Difference between revisions of "User:Tjones5/Final project"
From LING073
(→Current Evaluation) |
|||
Line 124: | Line 124: | ||
<!--* '''How you're evaluating the effectiveness''' of the solution, and some preliminary look at results of the evaluation. We'll understand that the project isn't yet complete, so the evaluation may show that the project is entirely ineffective and useless :-P --> | <!--* '''How you're evaluating the effectiveness''' of the solution, and some preliminary look at results of the evaluation. We'll understand that the project isn't yet complete, so the evaluation may show that the project is entirely ineffective and useless :-P --> | ||
* Coverage over a large corpus | * Coverage over a large corpus | ||
− | * Precision and recall against | + | * Precision and recall against hand-annotated randomly selected forms |
* Coverage over large corpus | * Coverage over large corpus | ||
Line 173: | Line 173: | ||
*Table Source: http://wiki.apertium.org/wiki/Languages | *Table Source: http://wiki.apertium.org/wiki/Languages | ||
*Background on analyzers: http://www.phil.uu.nl/tst/2012/Slides/SLP_Lecture2.pdf | *Background on analyzers: http://www.phil.uu.nl/tst/2012/Slides/SLP_Lecture2.pdf | ||
− | |||
− | |||
− | |||
− | |||
− | |||
==Notes== | ==Notes== |
Revision as of 02:07, 5 May 2017
Contents
Presentation
Description of the problem
- Word structure is analyzed by composition of morphemes (the smallest units with grammatical meaning)
- A morphological transducer connects forms with analyses.
- Surface level -- represents the concatenation of letters which make up the actual spelling of the word, e.g. wolves
- Lexical level -- represents the concatenation of morphemes making up a word, e.g. NOUN-wolf + PLURAL-es
- Morphological parsing -- the problem of recognizing that a word breaks down into component morphemes, as above with "wolves"
Previous work
- Many languages have working Apertium transducers:
- Warlpiri doesn't have a working transducer -- in fact, no languages from the Pama–Nyungan family have a transducer
Benefits of this project
- Why is a transducer useful to a community of Warlpiri speakers?
- Search engine
- Spell checker
- Machine Translation, especially to/from English (the primary language of most Warlpiri speakers). Many are trying to revive the use of Warlpiri, so there are many potentially useful applications.
- eng → wbp MT system could be used to look up how to say/write a certain word or phrase
- wbp → eng MT system could be used to analyze (and then translate) untranslated Warlpiri stories, or give alternate translations
- All of these applications might also be helpful in starting an analyzer for other indigenous languages that have the same orthography, are highly agglutinative, and have other grammatical similarities
Approaches
- Mostly have worked within lexc and twol files
- Pronouns (optional dash not included in table):
Meaning | Word | Form |
---|---|---|
I, me | ngaju(lu) | ngaju<prn><pers><p1><sg> ↔ ngaju ngaju<prn><pers><p1><sg> ↔ ngajulu |
you | nyuntu(lu) | nyuntu<prn><pers><p2><sg> ↔ nyuntu nyuntu<prn><pers><p2><sg> ↔ nyuntulu |
he/she/it; to him/her/it | nyanungu | nyanungu<prn><pers><p3><sg> ↔ nyanungu |
we (you & me) | ngali(jarra) | ngali<prn><pers><p1><du><incl> ↔ ngali ngali<prn><pers><p1><du><incl> ↔ ngalijarra |
we (him/her/it & me) | ngajarra | ngajarra<prn><pers><p1><du><excl> ↔ ngajarra |
we (you & me & other(s)) | ngalipa | ngalipa<prn><pers><p1><pl><incl> ↔ ngalipa |
we (them & me) | nganimpa | nganimpa<prn><pers><p1><pl><excl> ↔ nganimpa |
you (both/two) | nyumpala | nyumpala<prn><pers><p2><du> ↔ nyumpala |
you (more than 2) | nyurrarla | nyurrarla<prn><pers><p2><pl> ↔ nyurrarla |
they/them (both/two) | nyanungu-jarra | nyanungu-jarra<prn><pers><p3><du> ↔ nyanungu-jarra |
they/them (more than 2) | nyanungu-rra/nyanungu-patu | nyanungu<prn><pers><p3><pl> ↔ nyanungu-rra nyanungu<prn><pers><p3><pl> ↔ nyanungu-patu |
- pirli (a noun with just 2 vowels) has different endings than warlkurru (which has more than two vowels)
- however when suffixes such as -jarra are added, pirli gets the endings of nouns with >2 vowels: pirli-ngku → pirli-jarra-rlu
case name | ~meaning | tag | possible forms | pirli "rock" (2 vowels) | warlkurru (>2 vowels) "axe" |
---|---|---|---|---|---|
absolutive | subject of intransitive verbs and object of transitive verbs | <abs> | — | pirli<n><sg><abs> ↔ pirli | warlkurru<n><sg><abs> ↔ warlkurru |
dative | "to" | <dat> | -ku | pirli<n><sg><dat> ↔ pirli-ku | warlkurru<n><sg><dat> ↔ warlkurru-ku |
ergative | "agent of a transitive verb" | <erg> | -ngku, -rlu | pirli<n><sg><erg> ↔ pirli-ngku | warlkurru<n><sg><erg> ↔ warlkurru-rlu |
allative | "onto" | <all> | -kurra | pirli<n><sg><all> ↔ pirli-kurra | warlkurru<n><sg><all> ↔ warlkurru-kurra |
comitative | "with" | <com> | -ngkajinta, -rlajinta | pirli<n><sg><com> ↔ pirli-ngkajinta | warlkurru<n><sg><com> ↔ warlkurru-rlajinta |
elative | "out of" | <ela> | -ngurlu | pirli<n><sg><ela> ↔ pirli-ngurlu | warlkurru<n><sg><ela> ↔ warlkurru-ngurlu |
locative | "at, in on" | <loc> | -ngka, -rla | pirli<n><sg><loc> ↔ pirli-ngka | warlkurru<n><sg><loc> ↔ warlkurru-rla |
- Disambiguation is a challenge because of flexible word order
echo "rdaka" | apertium -d . wbp-morph: ^rdaka/rdaka<num>/rdaka<n><sg><abs>$^./.<sent>$
echo "rdaka-pala" | apertium -d . wbp-morph: ^rdaka-pala/rdaka<num>$^./.<sent>$
- Trying to write the following disambiguation rule: # If there is a noun to the right or left, I cannot be a numeral-like noun.
- Examples of numeral-like nouns, sometimes distinguished by adding the number marker -pala.
- mirdi = "four" or "elbow"
- rdaka = "five" or "hand"
- wirlki = "seven" or "boomerang"
- narntirnki = "nine" or "curled"
- Need to write scraper to test on a larger corpus (current corpus is ~13,000 words)
- Next to implement: perhaps reduplication?
echo "kurdu" | apertium -d . wbp-morph: ^kurdu/kurdu<n><sg><abs>$^./.<sent>$
echo "kurdu-kurdu" | apertium -d . wbp-morph: ^kurdu/kurdu<n><pl><abs>$^./.<sent>$
Current Evaluation
- Coverage over a large corpus
- Precision and recall against hand-annotated randomly selected forms
- Coverage over large corpus
- Number of tokenised words in the corpus: 18427
- Coverage: 51.26%
- Precision/Recall on wbp.annotated.basic.txt from Polished RBMT:
- Precision: 97.54902%
- Recall: 97.07317%
- Number of stems: ~250
status | description | stems | coverage |
---|---|---|---|
prototype | language module that has not received heavy development | <1,000 | <60% |
development | language module under development | ≥1,000 | ≥60% |
working | language module with near-production-quality performance | ≥8,000 | ≥80% |
production | language module used in a released pair | ≥10,000 | ≥90% |
- Table Source: http://wiki.apertium.org/wiki/Languages
- Background on analyzers: http://www.phil.uu.nl/tst/2012/Slides/SLP_Lecture2.pdf
Notes
- To implement:
- Reduplication
- Get a big corpus:
- Build a scraper? to get corpus up to 25K
- Use regular expressions to clean up scraped (or copied/pasted) text
- For evaluation:
- Go back and look at coverage
- Other ling to do: UD pipeline