User:Tjones5/Final project

From LING073
< User:Tjones5
Revision as of 22:38, 9 May 2017 by Tjones5 (talk | contribs) (Improvements Made to the Transducer)

Jump to: navigation, search

Improvements Made to the Transducer

Final version: https://github.swarthmore.edu/tjones5/ling073-wbp

Commit: https://github.swarthmore.edu/tjones5/ling073-wbp/commit/125bb63f33f5f50caf682b746b972c549b4da7a0

  • .lexc:
    • optional dashes throughout (verb suffixes, pronouns)
    • improved organization of pronoun analysis: separation of 1st person from 2nd/3rd person
    • added several noun stems
  • .twol:
    • added %{U%} archiphoneme to appear in suffixes on nouns in dative case

Commit: next commit https://github.swarthmore.edu/tjones5/ling073-wbp/commit/f8841c64db1c248815bc5c5a3ce4a23b8d30f641

  • .lexc:
    • added present verb analysis (ex: piyani = present form of "to be", not piyanimi)
    • added verb tense and suffix analysis (apparently sometimes verbs can object suffixes; previously I thought it was only auxiliaries that could do this)
    • added "kirra" as another possible allative object ending
    • added clitic analysis and several clitics -- something needs to be changed so that they increase coverage
    • added 6 noun stems, ~55 verb stems, 3 adverb stems
  • .twol:
    • eliminated bi-directionality of some rules

Presentation

Description of the problem

  • Word structure is analyzed by composition of morphemes (the smallest units with grammatical meaning)
  • A morphological transducer connects forms with analyses.
  • Surface level -- represents the concatenation of letters which make up the actual spelling of the word, e.g. wolves
  • Lexical level -- represents the concatenation of morphemes making up a word, e.g. NOUN-wolf + PLURAL-es
  • Morphological parsing -- the problem of recognizing that a word breaks down into component morphemes, as above with "wolves"

Simple transducer.png

Previous work

  • Many languages have working Apertium transducers:
  • Warlpiri doesn't have a working transducer -- in fact, no languages from the Pama–Nyungan family have a transducer

Benefits of this project

  • Why is a transducer useful to a community of Warlpiri speakers?
    • Search engine
    • Spell checker
    • Machine Translation, especially to/from English (the primary language of most Warlpiri speakers). Many are trying to revive the use of Warlpiri, so there are many potentially useful applications.
      • eng → wbp MT system could be used to look up how to say/write a certain word or phrase
      • wbp → eng MT system could be used to analyze (and then translate) untranslated Warlpiri stories, or give alternate translations
    • All of these applications might also be helpful in starting an analyzer for other indigenous languages that have the same orthography, are highly agglutinative, and have other grammatical similarities

Approaches

  • Mostly have worked within lexc and twol files
  • Pronouns (optional dash not included in table):
Meaning Word Form
I, me ngaju(lu) ngaju<prn><pers><p1><sg> ↔ ngaju
ngaju<prn><pers><p1><sg> ↔ ngajulu
you nyuntu(lu) nyuntu<prn><pers><p2><sg> ↔ nyuntu
nyuntu<prn><pers><p2><sg> ↔ nyuntulu
he/she/it; to him/her/it nyanungu nyanungu<prn><pers><p3><sg> ↔ nyanungu
we (you & me) ngali(jarra) ngali<prn><pers><p1><du><incl> ↔ ngali
ngali<prn><pers><p1><du><incl> ↔ ngalijarra
we (him/her/it & me) ngajarra ngajarra<prn><pers><p1><du><excl> ↔ ngajarra
we (you & me & other(s)) ngalipa ngalipa<prn><pers><p1><pl><incl> ↔ ngalipa
we (them & me) nganimpa nganimpa<prn><pers><p1><pl><excl> ↔ nganimpa
you (both/two) nyumpala nyumpala<prn><pers><p2><du> ↔ nyumpala
you (more than 2) nyurrarla nyurrarla<prn><pers><p2><pl> ↔ nyurrarla
they/them (both/two) nyanungu-jarra nyanungu-jarra<prn><pers><p3><du> ↔ nyanungu-jarra
they/them (more than 2) nyanungu-rra/nyanungu-patu nyanungu<prn><pers><p3><pl> ↔ nyanungu-rra
nyanungu<prn><pers><p3><pl> ↔ nyanungu-patu
  • pirli (a noun with just 2 vowels) has different endings than warlkurru (which has more than two vowels)
    • however when suffixes such as -jarra are added, pirli gets the endings of nouns with >2 vowels: pirli-ngku → pirli-jarra-rlu
case name ~meaning tag possible forms pirli "rock" (2 vowels) warlkurru (>2 vowels) "axe"
absolutive subject of intransitive verbs and object of transitive verbs <abs> pirli<n><sg><abs> ↔ pirli warlkurru<n><sg><abs> ↔ warlkurru
dative "to" <dat> -ku pirli<n><sg><dat> ↔ pirli-ku warlkurru<n><sg><dat> ↔ warlkurru-ku
ergative "agent of a transitive verb" <erg> -ngku, -rlu pirli<n><sg><erg> ↔ pirli-ngku warlkurru<n><sg><erg> ↔ warlkurru-rlu
allative "onto" <all> -kurra pirli<n><sg><all> ↔ pirli-kurra warlkurru<n><sg><all> ↔ warlkurru-kurra
comitative "with" <com> -ngkajinta, -rlajinta pirli<n><sg><com> ↔ pirli-ngkajinta warlkurru<n><sg><com> ↔ warlkurru-rlajinta
elative "out of" <ela> -ngurlu pirli<n><sg><ela> ↔ pirli-ngurlu warlkurru<n><sg><ela> ↔ warlkurru-ngurlu
locative "at, in on" <loc> -ngka, -rla pirli<n><sg><loc> ↔ pirli-ngka warlkurru<n><sg><loc> ↔ warlkurru-rla


  • Disambiguation is a challenge because of flexible word order
echo "rdaka" | apertium -d . wbp-morph: ^rdaka/rdaka<num>/rdaka<n><sg><abs>$^./.<sent>$
echo "rdaka-pala" | apertium -d . wbp-morph: ^rdaka-pala/rdaka<num>$^./.<sent>$
  • Trying to write the following disambiguation rule: # If there is a noun to the right or left, I cannot be a numeral-like noun.
  • Examples of numeral-like nouns, sometimes distinguished by adding the number marker -pala.
    • mirdi = "four" or "elbow"
    • rdaka = "five" or "hand"
    • wirlki = "seven" or "boomerang"
    • narntirnki = "nine" or "curled"


  • Next to implement: perhaps reduplication?
echo "kurdu" | apertium -d . wbp-morph: ^kurdu/kurdu<n><sg><abs>$^./.<sent>$
echo "kurdu-kurdu" | apertium -d . wbp-morph: ^kurdu/kurdu<n><pl><abs>$^./.<sent>$
  • Need to write scraper to test on a larger corpus (current corpus is ~13,000 words)


Current Evaluation

  • Coverage over a large corpus
  • Precision and recall against hand-annotated randomly selected forms

Initial Eval:

  • Coverage over large corpus
    • Number of tokenised words in the corpus: 18481
    • Coverage: 49.67%

Checkpoint at presentation:

  • Coverage over large corpus
    • Number of tokenised words in the corpus: 18427
    • Coverage: 51.26%
  • Number of stems: ~250




status description stems coverage
prototype language module that has not received heavy development <1,000 <60%
development language module under development ≥1,000 ≥60%
working language module with near-production-quality performance ≥8,000 ≥80%
production language module used in a released pair ≥10,000 ≥90%

Notes

  • To implement:
    • Reduplication
  • Get a big corpus:
    • Build a scraper? to get corpus up to 25K
    • Use regular expressions to clean up scraped (or copied/pasted) text
  • For evaluation:
    • Go back and look at coverage
  • Other ling to do: UD pipeline