Difference between revisions of "User:Tjones5/Final project"

From LING073
Jump to: navigation, search
(Current Evaluation)
Line 124: Line 124:
 
<!--* '''How you're evaluating the effectiveness''' of the solution, and some preliminary look at results of the evaluation. We'll understand that the project isn't yet complete, so the evaluation may show that the project is entirely ineffective and useless :-P -->
 
<!--* '''How you're evaluating the effectiveness''' of the solution, and some preliminary look at results of the evaluation. We'll understand that the project isn't yet complete, so the evaluation may show that the project is entirely ineffective and useless :-P -->
 
* Coverage over a large corpus  
 
* Coverage over a large corpus  
* Precision and recall against a hand-annotated randomly selected forms
+
* Precision and recall against hand-annotated randomly selected forms
  
 
* Coverage over large corpus
 
* Coverage over large corpus
Line 173: Line 173:
 
*Table Source: http://wiki.apertium.org/wiki/Languages
 
*Table Source: http://wiki.apertium.org/wiki/Languages
 
*Background on analyzers: http://www.phil.uu.nl/tst/2012/Slides/SLP_Lecture2.pdf
 
*Background on analyzers: http://www.phil.uu.nl/tst/2012/Slides/SLP_Lecture2.pdf
 
 
 
 
 
  
 
==Notes==
 
==Notes==

Revision as of 02:07, 5 May 2017

Presentation

Description of the problem

  • Word structure is analyzed by composition of morphemes (the smallest units with grammatical meaning)
  • A morphological transducer connects forms with analyses.
  • Surface level -- represents the concatenation of letters which make up the actual spelling of the word, e.g. wolves
  • Lexical level -- represents the concatenation of morphemes making up a word, e.g. NOUN-wolf + PLURAL-es
  • Morphological parsing -- the problem of recognizing that a word breaks down into component morphemes, as above with "wolves"

Simple transducer.png

Previous work

  • Many languages have working Apertium transducers:
  • Warlpiri doesn't have a working transducer -- in fact, no languages from the Pama–Nyungan family have a transducer

Benefits of this project

  • Why is a transducer useful to a community of Warlpiri speakers?
    • Search engine
    • Spell checker
    • Machine Translation, especially to/from English (the primary language of most Warlpiri speakers). Many are trying to revive the use of Warlpiri, so there are many potentially useful applications.
      • eng → wbp MT system could be used to look up how to say/write a certain word or phrase
      • wbp → eng MT system could be used to analyze (and then translate) untranslated Warlpiri stories, or give alternate translations
    • All of these applications might also be helpful in starting an analyzer for other indigenous languages that have the same orthography, are highly agglutinative, and have other grammatical similarities

Approaches

  • Mostly have worked within lexc and twol files
  • Pronouns (optional dash not included in table):
Meaning Word Form
I, me ngaju(lu) ngaju<prn><pers><p1><sg> ↔ ngaju
ngaju<prn><pers><p1><sg> ↔ ngajulu
you nyuntu(lu) nyuntu<prn><pers><p2><sg> ↔ nyuntu
nyuntu<prn><pers><p2><sg> ↔ nyuntulu
he/she/it; to him/her/it nyanungu nyanungu<prn><pers><p3><sg> ↔ nyanungu
we (you & me) ngali(jarra) ngali<prn><pers><p1><du><incl> ↔ ngali
ngali<prn><pers><p1><du><incl> ↔ ngalijarra
we (him/her/it & me) ngajarra ngajarra<prn><pers><p1><du><excl> ↔ ngajarra
we (you & me & other(s)) ngalipa ngalipa<prn><pers><p1><pl><incl> ↔ ngalipa
we (them & me) nganimpa nganimpa<prn><pers><p1><pl><excl> ↔ nganimpa
you (both/two) nyumpala nyumpala<prn><pers><p2><du> ↔ nyumpala
you (more than 2) nyurrarla nyurrarla<prn><pers><p2><pl> ↔ nyurrarla
they/them (both/two) nyanungu-jarra nyanungu-jarra<prn><pers><p3><du> ↔ nyanungu-jarra
they/them (more than 2) nyanungu-rra/nyanungu-patu nyanungu<prn><pers><p3><pl> ↔ nyanungu-rra
nyanungu<prn><pers><p3><pl> ↔ nyanungu-patu
  • pirli (a noun with just 2 vowels) has different endings than warlkurru (which has more than two vowels)
    • however when suffixes such as -jarra are added, pirli gets the endings of nouns with >2 vowels: pirli-ngku → pirli-jarra-rlu
case name ~meaning tag possible forms pirli "rock" (2 vowels) warlkurru (>2 vowels) "axe"
absolutive subject of intransitive verbs and object of transitive verbs <abs> pirli<n><sg><abs> ↔ pirli warlkurru<n><sg><abs> ↔ warlkurru
dative "to" <dat> -ku pirli<n><sg><dat> ↔ pirli-ku warlkurru<n><sg><dat> ↔ warlkurru-ku
ergative "agent of a transitive verb" <erg> -ngku, -rlu pirli<n><sg><erg> ↔ pirli-ngku warlkurru<n><sg><erg> ↔ warlkurru-rlu
allative "onto" <all> -kurra pirli<n><sg><all> ↔ pirli-kurra warlkurru<n><sg><all> ↔ warlkurru-kurra
comitative "with" <com> -ngkajinta, -rlajinta pirli<n><sg><com> ↔ pirli-ngkajinta warlkurru<n><sg><com> ↔ warlkurru-rlajinta
elative "out of" <ela> -ngurlu pirli<n><sg><ela> ↔ pirli-ngurlu warlkurru<n><sg><ela> ↔ warlkurru-ngurlu
locative "at, in on" <loc> -ngka, -rla pirli<n><sg><loc> ↔ pirli-ngka warlkurru<n><sg><loc> ↔ warlkurru-rla


  • Disambiguation is a challenge because of flexible word order
echo "rdaka" | apertium -d . wbp-morph: ^rdaka/rdaka<num>/rdaka<n><sg><abs>$^./.<sent>$
echo "rdaka-pala" | apertium -d . wbp-morph: ^rdaka-pala/rdaka<num>$^./.<sent>$
  • Trying to write the following disambiguation rule: # If there is a noun to the right or left, I cannot be a numeral-like noun.
  • Examples of numeral-like nouns, sometimes distinguished by adding the number marker -pala.
    • mirdi = "four" or "elbow"
    • rdaka = "five" or "hand"
    • wirlki = "seven" or "boomerang"
    • narntirnki = "nine" or "curled"
  • Need to write scraper to test on a larger corpus (current corpus is ~13,000 words)
  • Next to implement: perhaps reduplication?
echo "kurdu" | apertium -d . wbp-morph: ^kurdu/kurdu<n><sg><abs>$^./.<sent>$
echo "kurdu-kurdu" | apertium -d . wbp-morph: ^kurdu/kurdu<n><pl><abs>$^./.<sent>$


Current Evaluation

  • Coverage over a large corpus
  • Precision and recall against hand-annotated randomly selected forms
  • Coverage over large corpus
    • Number of tokenised words in the corpus: 18427
    • Coverage: 51.26%
  • Precision/Recall on wbp.annotated.basic.txt from Polished RBMT:
    • Precision: 97.54902%
    • Recall: 97.07317%
  • Number of stems: ~250


status description stems coverage
prototype language module that has not received heavy development <1,000 <60%
development language module under development ≥1,000 ≥60%
working language module with near-production-quality performance ≥8,000 ≥80%
production language module used in a released pair ≥10,000 ≥90%

Notes

  • To implement:
    • Reduplication
  • Get a big corpus:
    • Build a scraper? to get corpus up to 25K
    • Use regular expressions to clean up scraped (or copied/pasted) text
  • For evaluation:
    • Go back and look at coverage
  • Other ling to do: UD pipeline