User:Tjones5/Final project
From LING073
Placeholder page for final project
Contents
Notes
- To implement:
- Reduplication
- Get a big corpus:
- Build a scraper? to get corpus up to 25K
- Use regular expressions to clean up scraped (or copied/pasted) text
- For evaluation:
- Go back and look at coverage
- Other ling to do: UD pipeline
Presentation
Description of the problem
- Word structure is analyzed by composition of morphemes (the smallest units with grammatical meaning)
- A morphological transducer connects forms with analyses.
- Surface level -- represents the concatenation of letters which make up the actual spelling of the word, e.g. foxes
- Lexical level -- represents the concatenation of morphemes making up a word, e.g. NOUN-fox + PLURAL-es
- Morphological parsing -- the problem of recognizing that a word breaks down into component morphemes, as above with "foxes"
Previous work
- Many languages have working Apertium transducers:
- Warlpiri doesn't have a working transducer -- in fact, no languages from the Pama–Nyungan family have a transducer
- Other work?
Benefits of this project
- Why is a transducer useful to a community of Warlpiri speakers?
- Search engine
- Spell checker
- Machine Translation, especially to/from English (the primary language of most Warlpiri speakers). Many are trying to revive the use of Warlpiri, so there are many potentially useful applications.
- eng → wbp MT system could be used to look up how to say/write a certain word or phrase
- wbp → eng MT system could be used to analyze (and then translate) untranslated Warlpiri phrases, or give alternate translations
- All of these applications might also be helpful in starting an analyzer for other indigenous languages that have the same orthography, are highly agglutinative, and have other grammatical similarities
Approaches
- Mostly have worked within lexc and twol files
- Examples??
- Disambiguation is a challenge because of flexible word order
- Numerals are one thing to deal with
- Need to build a scraper to test on a larger corpus
Evaluation
- Coverage over a large corpus
- Precision and recall against a hand-annotated randomly selected forms
- Coverage over large corpus
- Number of tokenised words in the corpus: 18427
- Coverage: 51.26%
- Precision/Recall on wbp.annotated.basic.txt from Polished RMBT:
- Precision: 97.54902%
- Recall: 97.07317%
- Number of stems: ~250
status | description | stems | coverage |
---|---|---|---|
prototype | language module that has not received heavy development | <1,000 | <60% |
development | language module under development | ≥1,000 | ≥60% |
working | language module with near-production-quality performance | ≥8,000 | ≥80% |
production | language module used in a released pair | ≥10,000 | ≥90% |
- Table Source: http://wiki.apertium.org/wiki/Languages
- Background on analyzers: http://www.phil.uu.nl/tst/2012/Slides/SLP_Lecture2.pdf