Difference between revisions of "User:Tjones5/Final project"

From LING073
Jump to: navigation, search
(Current Evaluation)
Line 146: Line 146:
  
  
===Current Evaluation===
+
===Evaluation===
 
<!--* '''How you're evaluating the effectiveness''' of the solution, and some preliminary look at results of the evaluation. We'll understand that the project isn't yet complete, so the evaluation may show that the project is entirely ineffective and useless :-P -->
 
<!--* '''How you're evaluating the effectiveness''' of the solution, and some preliminary look at results of the evaluation. We'll understand that the project isn't yet complete, so the evaluation may show that the project is entirely ineffective and useless :-P -->
 
* Coverage over a large corpus  
 
* Coverage over a large corpus  

Revision as of 22:48, 9 May 2017

Improvements Made to the Transducer

Final version: https://github.swarthmore.edu/tjones5/ling073-wbp

Commit: https://github.swarthmore.edu/tjones5/ling073-wbp/commit/125bb63f33f5f50caf682b746b972c549b4da7a0

  • .lexc:
    • optional dashes throughout (verb suffixes, pronouns)
    • improved organization of pronoun analysis: separation of 1st person from 2nd/3rd person
    • added several noun stems
  • .twol:
    • added %{U%} archiphoneme to appear in suffixes on nouns in dative case

Commit: https://github.swarthmore.edu/tjones5/ling073-wbp/commit/f8841c64db1c248815bc5c5a3ce4a23b8d30f641

  • .lexc:
    • added present verb analysis (ex: piyani = present form of "to be", not piyanimi)
    • added verb tense and suffix analysis (apparently sometimes verbs can object suffixes; previously I thought it was only auxiliaries that could do this)
    • added "kirra" as another possible allative object ending
    • added clitic analysis and several clitics -- something needs to be changed so that they increase coverage
    • added 6 noun stems, ~55 verb stems, 3 adverb stems
  • .twol:
    • eliminated bi-directionality of some rules

Commit: next commit

Presentation

Description of the problem

  • Word structure is analyzed by composition of morphemes (the smallest units with grammatical meaning)
  • A morphological transducer connects forms with analyses.
  • Surface level -- represents the concatenation of letters which make up the actual spelling of the word, e.g. wolves
  • Lexical level -- represents the concatenation of morphemes making up a word, e.g. NOUN-wolf + PLURAL-es
  • Morphological parsing -- the problem of recognizing that a word breaks down into component morphemes, as above with "wolves"

Simple transducer.png

Previous work

  • Many languages have working Apertium transducers:
  • Warlpiri doesn't have a working transducer -- in fact, no languages from the Pama–Nyungan family have a transducer

Benefits of this project

  • Why is a transducer useful to a community of Warlpiri speakers?
    • Search engine
    • Spell checker
    • Machine Translation, especially to/from English (the primary language of most Warlpiri speakers). Many are trying to revive the use of Warlpiri, so there are many potentially useful applications.
      • eng → wbp MT system could be used to look up how to say/write a certain word or phrase
      • wbp → eng MT system could be used to analyze (and then translate) untranslated Warlpiri stories, or give alternate translations
    • All of these applications might also be helpful in starting an analyzer for other indigenous languages that have the same orthography, are highly agglutinative, and have other grammatical similarities

Approaches

  • Mostly have worked within lexc and twol files
  • Pronouns (optional dash not included in table):
Meaning Word Form
I, me ngaju(lu) ngaju<prn><pers><p1><sg> ↔ ngaju
ngaju<prn><pers><p1><sg> ↔ ngajulu
you nyuntu(lu) nyuntu<prn><pers><p2><sg> ↔ nyuntu
nyuntu<prn><pers><p2><sg> ↔ nyuntulu
he/she/it; to him/her/it nyanungu nyanungu<prn><pers><p3><sg> ↔ nyanungu
we (you & me) ngali(jarra) ngali<prn><pers><p1><du><incl> ↔ ngali
ngali<prn><pers><p1><du><incl> ↔ ngalijarra
we (him/her/it & me) ngajarra ngajarra<prn><pers><p1><du><excl> ↔ ngajarra
we (you & me & other(s)) ngalipa ngalipa<prn><pers><p1><pl><incl> ↔ ngalipa
we (them & me) nganimpa nganimpa<prn><pers><p1><pl><excl> ↔ nganimpa
you (both/two) nyumpala nyumpala<prn><pers><p2><du> ↔ nyumpala
you (more than 2) nyurrarla nyurrarla<prn><pers><p2><pl> ↔ nyurrarla
they/them (both/two) nyanungu-jarra nyanungu-jarra<prn><pers><p3><du> ↔ nyanungu-jarra
they/them (more than 2) nyanungu-rra/nyanungu-patu nyanungu<prn><pers><p3><pl> ↔ nyanungu-rra
nyanungu<prn><pers><p3><pl> ↔ nyanungu-patu
  • pirli (a noun with just 2 vowels) has different endings than warlkurru (which has more than two vowels)
    • however when suffixes such as -jarra are added, pirli gets the endings of nouns with >2 vowels: pirli-ngku → pirli-jarra-rlu
case name ~meaning tag possible forms pirli "rock" (2 vowels) warlkurru (>2 vowels) "axe"
absolutive subject of intransitive verbs and object of transitive verbs <abs> pirli<n><sg><abs> ↔ pirli warlkurru<n><sg><abs> ↔ warlkurru
dative "to" <dat> -ku pirli<n><sg><dat> ↔ pirli-ku warlkurru<n><sg><dat> ↔ warlkurru-ku
ergative "agent of a transitive verb" <erg> -ngku, -rlu pirli<n><sg><erg> ↔ pirli-ngku warlkurru<n><sg><erg> ↔ warlkurru-rlu
allative "onto" <all> -kurra pirli<n><sg><all> ↔ pirli-kurra warlkurru<n><sg><all> ↔ warlkurru-kurra
comitative "with" <com> -ngkajinta, -rlajinta pirli<n><sg><com> ↔ pirli-ngkajinta warlkurru<n><sg><com> ↔ warlkurru-rlajinta
elative "out of" <ela> -ngurlu pirli<n><sg><ela> ↔ pirli-ngurlu warlkurru<n><sg><ela> ↔ warlkurru-ngurlu
locative "at, in on" <loc> -ngka, -rla pirli<n><sg><loc> ↔ pirli-ngka warlkurru<n><sg><loc> ↔ warlkurru-rla


  • Disambiguation is a challenge because of flexible word order
echo "rdaka" | apertium -d . wbp-morph: ^rdaka/rdaka<num>/rdaka<n><sg><abs>$^./.<sent>$
echo "rdaka-pala" | apertium -d . wbp-morph: ^rdaka-pala/rdaka<num>$^./.<sent>$
  • Trying to write the following disambiguation rule: # If there is a noun to the right or left, I cannot be a numeral-like noun.
  • Examples of numeral-like nouns, sometimes distinguished by adding the number marker -pala.
    • mirdi = "four" or "elbow"
    • rdaka = "five" or "hand"
    • wirlki = "seven" or "boomerang"
    • narntirnki = "nine" or "curled"


  • Next to implement: perhaps reduplication?
echo "kurdu" | apertium -d . wbp-morph: ^kurdu/kurdu<n><sg><abs>$^./.<sent>$
echo "kurdu-kurdu" | apertium -d . wbp-morph: ^kurdu/kurdu<n><pl><abs>$^./.<sent>$
  • Need to write scraper to test on a larger corpus (current corpus is ~13,000 words)


Evaluation

  • Coverage over a large corpus
  • Precision and recall against hand-annotated randomly selected forms
status description stems coverage
prototype language module that has not received heavy development <1,000 <60%
development language module under development ≥1,000 ≥60%
working language module with near-production-quality performance ≥8,000 ≥80%
production language module used in a released pair ≥10,000 ≥90%