User:Tjones5/Final project
From LING073
< User:Tjones5
Revision as of 22:38, 9 May 2017 by Tjones5 (talk | contribs) (→Improvements Made to the Transducer)
Contents
Improvements Made to the Transducer
Final version: https://github.swarthmore.edu/tjones5/ling073-wbp
Commit: https://github.swarthmore.edu/tjones5/ling073-wbp/commit/125bb63f33f5f50caf682b746b972c549b4da7a0
- .lexc:
- optional dashes throughout (verb suffixes, pronouns)
- improved organization of pronoun analysis: separation of 1st person from 2nd/3rd person
- added several noun stems
- .twol:
- added %{U%} archiphoneme to appear in suffixes on nouns in dative case
Commit: next commit https://github.swarthmore.edu/tjones5/ling073-wbp/commit/f8841c64db1c248815bc5c5a3ce4a23b8d30f641
- .lexc:
- added present verb analysis (ex: piyani = present form of "to be", not piyanimi)
- added verb tense and suffix analysis (apparently sometimes verbs can object suffixes; previously I thought it was only auxiliaries that could do this)
- added "kirra" as another possible allative object ending
- added clitic analysis and several clitics -- something needs to be changed so that they increase coverage
- added 6 noun stems, ~55 verb stems, 3 adverb stems
- .twol:
- eliminated bi-directionality of some rules
Presentation
Description of the problem
- Word structure is analyzed by composition of morphemes (the smallest units with grammatical meaning)
- A morphological transducer connects forms with analyses.
- Surface level -- represents the concatenation of letters which make up the actual spelling of the word, e.g. wolves
- Lexical level -- represents the concatenation of morphemes making up a word, e.g. NOUN-wolf + PLURAL-es
- Morphological parsing -- the problem of recognizing that a word breaks down into component morphemes, as above with "wolves"
Previous work
- Many languages have working Apertium transducers:
- Warlpiri doesn't have a working transducer -- in fact, no languages from the Pama–Nyungan family have a transducer
Benefits of this project
- Why is a transducer useful to a community of Warlpiri speakers?
- Search engine
- Spell checker
- Machine Translation, especially to/from English (the primary language of most Warlpiri speakers). Many are trying to revive the use of Warlpiri, so there are many potentially useful applications.
- eng → wbp MT system could be used to look up how to say/write a certain word or phrase
- wbp → eng MT system could be used to analyze (and then translate) untranslated Warlpiri stories, or give alternate translations
- All of these applications might also be helpful in starting an analyzer for other indigenous languages that have the same orthography, are highly agglutinative, and have other grammatical similarities
Approaches
- Mostly have worked within lexc and twol files
- Pronouns (optional dash not included in table):
Meaning | Word | Form |
---|---|---|
I, me | ngaju(lu) | ngaju<prn><pers><p1><sg> ↔ ngaju ngaju<prn><pers><p1><sg> ↔ ngajulu |
you | nyuntu(lu) | nyuntu<prn><pers><p2><sg> ↔ nyuntu nyuntu<prn><pers><p2><sg> ↔ nyuntulu |
he/she/it; to him/her/it | nyanungu | nyanungu<prn><pers><p3><sg> ↔ nyanungu |
we (you & me) | ngali(jarra) | ngali<prn><pers><p1><du><incl> ↔ ngali ngali<prn><pers><p1><du><incl> ↔ ngalijarra |
we (him/her/it & me) | ngajarra | ngajarra<prn><pers><p1><du><excl> ↔ ngajarra |
we (you & me & other(s)) | ngalipa | ngalipa<prn><pers><p1><pl><incl> ↔ ngalipa |
we (them & me) | nganimpa | nganimpa<prn><pers><p1><pl><excl> ↔ nganimpa |
you (both/two) | nyumpala | nyumpala<prn><pers><p2><du> ↔ nyumpala |
you (more than 2) | nyurrarla | nyurrarla<prn><pers><p2><pl> ↔ nyurrarla |
they/them (both/two) | nyanungu-jarra | nyanungu-jarra<prn><pers><p3><du> ↔ nyanungu-jarra |
they/them (more than 2) | nyanungu-rra/nyanungu-patu | nyanungu<prn><pers><p3><pl> ↔ nyanungu-rra nyanungu<prn><pers><p3><pl> ↔ nyanungu-patu |
- pirli (a noun with just 2 vowels) has different endings than warlkurru (which has more than two vowels)
- however when suffixes such as -jarra are added, pirli gets the endings of nouns with >2 vowels: pirli-ngku → pirli-jarra-rlu
case name | ~meaning | tag | possible forms | pirli "rock" (2 vowels) | warlkurru (>2 vowels) "axe" |
---|---|---|---|---|---|
absolutive | subject of intransitive verbs and object of transitive verbs | <abs> | — | pirli<n><sg><abs> ↔ pirli | warlkurru<n><sg><abs> ↔ warlkurru |
dative | "to" | <dat> | -ku | pirli<n><sg><dat> ↔ pirli-ku | warlkurru<n><sg><dat> ↔ warlkurru-ku |
ergative | "agent of a transitive verb" | <erg> | -ngku, -rlu | pirli<n><sg><erg> ↔ pirli-ngku | warlkurru<n><sg><erg> ↔ warlkurru-rlu |
allative | "onto" | <all> | -kurra | pirli<n><sg><all> ↔ pirli-kurra | warlkurru<n><sg><all> ↔ warlkurru-kurra |
comitative | "with" | <com> | -ngkajinta, -rlajinta | pirli<n><sg><com> ↔ pirli-ngkajinta | warlkurru<n><sg><com> ↔ warlkurru-rlajinta |
elative | "out of" | <ela> | -ngurlu | pirli<n><sg><ela> ↔ pirli-ngurlu | warlkurru<n><sg><ela> ↔ warlkurru-ngurlu |
locative | "at, in on" | <loc> | -ngka, -rla | pirli<n><sg><loc> ↔ pirli-ngka | warlkurru<n><sg><loc> ↔ warlkurru-rla |
- Disambiguation is a challenge because of flexible word order
echo "rdaka" | apertium -d . wbp-morph: ^rdaka/rdaka<num>/rdaka<n><sg><abs>$^./.<sent>$
echo "rdaka-pala" | apertium -d . wbp-morph: ^rdaka-pala/rdaka<num>$^./.<sent>$
- Trying to write the following disambiguation rule: # If there is a noun to the right or left, I cannot be a numeral-like noun.
- Examples of numeral-like nouns, sometimes distinguished by adding the number marker -pala.
- mirdi = "four" or "elbow"
- rdaka = "five" or "hand"
- wirlki = "seven" or "boomerang"
- narntirnki = "nine" or "curled"
- Next to implement: perhaps reduplication?
echo "kurdu" | apertium -d . wbp-morph: ^kurdu/kurdu<n><sg><abs>$^./.<sent>$
echo "kurdu-kurdu" | apertium -d . wbp-morph: ^kurdu/kurdu<n><pl><abs>$^./.<sent>$
- Need to write scraper to test on a larger corpus (current corpus is ~13,000 words)
Current Evaluation
- Coverage over a large corpus
- Precision and recall against hand-annotated randomly selected forms
Initial Eval:
- Coverage over large corpus
- Number of tokenised words in the corpus: 18481
- Coverage: 49.67%
Checkpoint at presentation:
- Coverage over large corpus
- Number of tokenised words in the corpus: 18427
- Coverage: 51.26%
- Number of stems: ~250
status | description | stems | coverage |
---|---|---|---|
prototype | language module that has not received heavy development | <1,000 | <60% |
development | language module under development | ≥1,000 | ≥60% |
working | language module with near-production-quality performance | ≥8,000 | ≥80% |
production | language module used in a released pair | ≥10,000 | ≥90% |
- Table Source: http://wiki.apertium.org/wiki/Languages
- Background on analyzers: http://www.phil.uu.nl/tst/2012/Slides/SLP_Lecture2.pdf
Notes
- To implement:
- Reduplication
- Get a big corpus:
- Build a scraper? to get corpus up to 25K
- Use regular expressions to clean up scraped (or copied/pasted) text
- For evaluation:
- Go back and look at coverage
- Other ling to do: UD pipeline