Difference between revisions of "User:Tjones5/Final project"
From LING073
(→Current Evaluation) |
|||
Line 146: | Line 146: | ||
− | === | + | ===Evaluation=== |
<!--* '''How you're evaluating the effectiveness''' of the solution, and some preliminary look at results of the evaluation. We'll understand that the project isn't yet complete, so the evaluation may show that the project is entirely ineffective and useless :-P --> | <!--* '''How you're evaluating the effectiveness''' of the solution, and some preliminary look at results of the evaluation. We'll understand that the project isn't yet complete, so the evaluation may show that the project is entirely ineffective and useless :-P --> | ||
* Coverage over a large corpus | * Coverage over a large corpus |
Revision as of 22:48, 9 May 2017
Contents
Improvements Made to the Transducer
Final version: https://github.swarthmore.edu/tjones5/ling073-wbp
Commit: https://github.swarthmore.edu/tjones5/ling073-wbp/commit/125bb63f33f5f50caf682b746b972c549b4da7a0
- .lexc:
- optional dashes throughout (verb suffixes, pronouns)
- improved organization of pronoun analysis: separation of 1st person from 2nd/3rd person
- added several noun stems
- .twol:
- added %{U%} archiphoneme to appear in suffixes on nouns in dative case
Commit: https://github.swarthmore.edu/tjones5/ling073-wbp/commit/f8841c64db1c248815bc5c5a3ce4a23b8d30f641
- .lexc:
- added present verb analysis (ex: piyani = present form of "to be", not piyanimi)
- added verb tense and suffix analysis (apparently sometimes verbs can object suffixes; previously I thought it was only auxiliaries that could do this)
- added "kirra" as another possible allative object ending
- added clitic analysis and several clitics -- something needs to be changed so that they increase coverage
- added 6 noun stems, ~55 verb stems, 3 adverb stems
- .twol:
- eliminated bi-directionality of some rules
Commit: next commit
Presentation
Description of the problem
- Word structure is analyzed by composition of morphemes (the smallest units with grammatical meaning)
- A morphological transducer connects forms with analyses.
- Surface level -- represents the concatenation of letters which make up the actual spelling of the word, e.g. wolves
- Lexical level -- represents the concatenation of morphemes making up a word, e.g. NOUN-wolf + PLURAL-es
- Morphological parsing -- the problem of recognizing that a word breaks down into component morphemes, as above with "wolves"
Previous work
- Many languages have working Apertium transducers:
- Warlpiri doesn't have a working transducer -- in fact, no languages from the Pama–Nyungan family have a transducer
Benefits of this project
- Why is a transducer useful to a community of Warlpiri speakers?
- Search engine
- Spell checker
- Machine Translation, especially to/from English (the primary language of most Warlpiri speakers). Many are trying to revive the use of Warlpiri, so there are many potentially useful applications.
- eng → wbp MT system could be used to look up how to say/write a certain word or phrase
- wbp → eng MT system could be used to analyze (and then translate) untranslated Warlpiri stories, or give alternate translations
- All of these applications might also be helpful in starting an analyzer for other indigenous languages that have the same orthography, are highly agglutinative, and have other grammatical similarities
Approaches
- Mostly have worked within lexc and twol files
- Pronouns (optional dash not included in table):
Meaning | Word | Form |
---|---|---|
I, me | ngaju(lu) | ngaju<prn><pers><p1><sg> ↔ ngaju ngaju<prn><pers><p1><sg> ↔ ngajulu |
you | nyuntu(lu) | nyuntu<prn><pers><p2><sg> ↔ nyuntu nyuntu<prn><pers><p2><sg> ↔ nyuntulu |
he/she/it; to him/her/it | nyanungu | nyanungu<prn><pers><p3><sg> ↔ nyanungu |
we (you & me) | ngali(jarra) | ngali<prn><pers><p1><du><incl> ↔ ngali ngali<prn><pers><p1><du><incl> ↔ ngalijarra |
we (him/her/it & me) | ngajarra | ngajarra<prn><pers><p1><du><excl> ↔ ngajarra |
we (you & me & other(s)) | ngalipa | ngalipa<prn><pers><p1><pl><incl> ↔ ngalipa |
we (them & me) | nganimpa | nganimpa<prn><pers><p1><pl><excl> ↔ nganimpa |
you (both/two) | nyumpala | nyumpala<prn><pers><p2><du> ↔ nyumpala |
you (more than 2) | nyurrarla | nyurrarla<prn><pers><p2><pl> ↔ nyurrarla |
they/them (both/two) | nyanungu-jarra | nyanungu-jarra<prn><pers><p3><du> ↔ nyanungu-jarra |
they/them (more than 2) | nyanungu-rra/nyanungu-patu | nyanungu<prn><pers><p3><pl> ↔ nyanungu-rra nyanungu<prn><pers><p3><pl> ↔ nyanungu-patu |
- pirli (a noun with just 2 vowels) has different endings than warlkurru (which has more than two vowels)
- however when suffixes such as -jarra are added, pirli gets the endings of nouns with >2 vowels: pirli-ngku → pirli-jarra-rlu
case name | ~meaning | tag | possible forms | pirli "rock" (2 vowels) | warlkurru (>2 vowels) "axe" |
---|---|---|---|---|---|
absolutive | subject of intransitive verbs and object of transitive verbs | <abs> | — | pirli<n><sg><abs> ↔ pirli | warlkurru<n><sg><abs> ↔ warlkurru |
dative | "to" | <dat> | -ku | pirli<n><sg><dat> ↔ pirli-ku | warlkurru<n><sg><dat> ↔ warlkurru-ku |
ergative | "agent of a transitive verb" | <erg> | -ngku, -rlu | pirli<n><sg><erg> ↔ pirli-ngku | warlkurru<n><sg><erg> ↔ warlkurru-rlu |
allative | "onto" | <all> | -kurra | pirli<n><sg><all> ↔ pirli-kurra | warlkurru<n><sg><all> ↔ warlkurru-kurra |
comitative | "with" | <com> | -ngkajinta, -rlajinta | pirli<n><sg><com> ↔ pirli-ngkajinta | warlkurru<n><sg><com> ↔ warlkurru-rlajinta |
elative | "out of" | <ela> | -ngurlu | pirli<n><sg><ela> ↔ pirli-ngurlu | warlkurru<n><sg><ela> ↔ warlkurru-ngurlu |
locative | "at, in on" | <loc> | -ngka, -rla | pirli<n><sg><loc> ↔ pirli-ngka | warlkurru<n><sg><loc> ↔ warlkurru-rla |
- Disambiguation is a challenge because of flexible word order
echo "rdaka" | apertium -d . wbp-morph: ^rdaka/rdaka<num>/rdaka<n><sg><abs>$^./.<sent>$
echo "rdaka-pala" | apertium -d . wbp-morph: ^rdaka-pala/rdaka<num>$^./.<sent>$
- Trying to write the following disambiguation rule: # If there is a noun to the right or left, I cannot be a numeral-like noun.
- Examples of numeral-like nouns, sometimes distinguished by adding the number marker -pala.
- mirdi = "four" or "elbow"
- rdaka = "five" or "hand"
- wirlki = "seven" or "boomerang"
- narntirnki = "nine" or "curled"
- Next to implement: perhaps reduplication?
echo "kurdu" | apertium -d . wbp-morph: ^kurdu/kurdu<n><sg><abs>$^./.<sent>$
echo "kurdu-kurdu" | apertium -d . wbp-morph: ^kurdu/kurdu<n><pl><abs>$^./.<sent>$
- Need to write scraper to test on a larger corpus (current corpus is ~13,000 words)
Evaluation
- Coverage over a large corpus
- Precision and recall against hand-annotated randomly selected forms
status | description | stems | coverage |
---|---|---|---|
prototype | language module that has not received heavy development | <1,000 | <60% |
development | language module under development | ≥1,000 | ≥60% |
working | language module with near-production-quality performance | ≥8,000 | ≥80% |
production | language module used in a released pair | ≥10,000 | ≥90% |
- Table Source: http://wiki.apertium.org/wiki/Languages
- Background on analyzers: http://www.phil.uu.nl/tst/2012/Slides/SLP_Lecture2.pdf