Guarani and Warlpiri

From LING073
Jump to: navigation, search

Resources for machine translation between Guarani and Warlpiri

Corpora

Developed Resources

Lexical Selection

  • See Lexical Selection page linked above for more detail
  • (gug) ahy'o → (wbp) ngukarna "throat"; lirra "voice"
  • (gug) ha'anga → (wbp) japirdimi "threaten"; walaparrirni "imitate"
  • (gug) ajúra → (wbp) japirdimi "collar"; waninja "neck"
  • (wbp) rdaka → (gug) avatisoka "hand"; po "five"
  • (wbp) kirdirrpa → (gug) itakua "cave"; ka'irãi "jail"

Initial Evaluation

wbp → gug evaluation

Results when running on tests

  • Total number of words: 14
  • Total number of unknown words: 4
  • Percentage of unknown words: 28.57%
  • Percentage of stems translated correctly: 71.43%

When removing unknown words:

  • WER = 78.57%
  • PER = 78.57%

When unknown words are not removed:

  • WER = 100%
  • PER = 100%

Results when running on sentences (corpus)

  • Total number of words: 58
  • Total number of unknown words: 68
  • Percentage of unknown words: 84.45%
  • Percentage of stems translate correctly: 15.52%

When removing unknown words:

  • WER = 22.41%
  • PER = 22.41%

When unknown words are not removed:

  • WER = 100%
  • PER = 100%


gug → wbp evaluation

Results when running on tests

  • Total number of words: 13
  • Total number of unknown words: 0
  • Percentage of unknown words: 0%
  • Percentage of stems translated correctly: 100%

When removing unknown words:

  • WER = 92.86%
  • PER = 92.86%

When unknown words are not removed:

  • WER = 92.86%
  • PER = 92.86%

Results when running on sentences (corpus)

  • Total number of words: 66
  • Total number of unknown words: 60
  • Percentage of unknown words: 90.9%
  • Percentage of stems translated correctly: 9.09%

When removing unknown words:

  • WER = 0.0%
  • PER = 0.0%

When unknown words are not removed:

  • WER = 100%
  • PER = 100%

Final Evaluation

gug → wbp and wbp → gug

100 more stems

Added 100 more stems (mostly nouns, verbs, and adjectives) to the bilingual and monolingual dictionaries.

gug → wbp

6 new lexical selection rules

  • (gug) akã → (wbp) jurru "head"; payarrpa "hill"
    • Rule: Select "jurru" (Eng: head) as the translation of "akã" when it is preceded by "ha'e" (Eng: his/her). Otherwise, select "payarrpa" as the default.
  • (gug) asu → (wbp) jampu "left"; jampu-karra "left-handed"
    • Rule: Select "jampu-karra" (Eng: left-handed) as the translation of "asu" when it is followed by person words such as "ava" (Eng: person) or "kuña" (Eng: woman) or "mitã" (Eng: child). Otherwise, select "jampu" (Eng: left) as the default.
  • (gug) henyhẽ → (wbp) jaka-ngalya "full"; ngayarrka "pregnant"
    • Rule: Select "ngayarrka" (Eng: pregnant) as the translation of "henyhẽ" when it is followed/preceded by "kuña" (Eng: woman) or other words describing female people. Otherwise, select "jaka-ngalya" (Eng: full) as the default.
  • (gug) mongúi → (wbp) kijirni "to throw"; pata-karrimi "to fall"
    • Rule: Select "kijirni" (Eng: to throw) as the translation of "mopẽ" when it is followed by "ita" (Eng: rock) or "apu'a" (Eng: ball). Otherwise, select "pata-karrimi" as the default.
  • (gug) moha'anga → (wbp) pantirni "to draw"; jarntirni "to carve"
    • Rule: Select "pantirni" (Eng: to draw) as the translation of "moha'anga" when it is followed by "ha'anga" (Eng: drawing). Otherwise, select "jarntirni" as the default.
  • (gug) mbogyapy → (wbp) pipa-kurra-mani "to write"; nyinanja-wantimi "to sit down"
    • Rule: Select "pipa-kurra-mani" (Eng: to write) as the translation of "mbogyapy" when it is followed by "aranduka" (Eng: book). Otherwise, select "nyinanja-wantimi" as the default.

6 additions to the morphology

  • Added numeral analysis:
    • Several number words in Warlpiri also have meanings as nouns (e.g. rdaka can mean "hand" or "five")
    • the -pala ending distinguishes them as number words, but it is not always necessary
    • "rdaka-pala": ^rdaka-pala/rdaka<num>$^./.<sent>$
    • "rdaka": ^rdaka/rdaka<num>/rdaka<n><sg><abs>$^./.<sent>$
  • Added adverb analysis:
    • "nyurruwiyi: ^nyurruwiyi/nyurruwiyi<adv>$^./.<sent>$
  • Added elative/source case for verb auxiliaries :
    • "kajangka": ^kajangka/ka<vaux><src><subj3sg>$^./.<sent>$
  • Added inceptive aspect for verbs:
    • "pakarnunjunu": ^pakarnunjunu/pakarni<v><tv><past><inc>$^./.<sent>$
  • Added possessive marker "nyanu" for nouns:
    • "kaja-nyanu-rlu": ^kaja-nyanu-rlu/kaja<n><pl><poss><erg>$^./.<sent>$
  • Added the topicalizing particle grammatical marker for nouns. -ja is used to mark known information or to indicate the topic, although sometimes it has no function.
    • echo "kurdukuju": ^kurdukuju/kurdu<n><sg><dat><top>$^./.<sent>$


wbp → gug

6 new lexical selection rules

  • (wbp) nyurltu-nyurltu → (gug) apañuãi "confusing, garbled"; apokytã "tangled"
    • Rule: select "apokytã" if "nyurltu-nyurltu" is followed by the Warlpiri word "marnilpa" (hair). In other cases, select "apañuãi" as the default.
  • (wbp) warlu → (gug) tini "hot"; aku "angry"
    • Rule: select "aku" if "warlu" is followed by the Warlpiri word "miparrpa" (face) or "yapa" (person). In other cases, select "tini" as the default.
  • (wbp) yajarni → (gug) henoi "fetch/invite (someone)"; kakuaa "grow (a plant)"
    • Rule: select "kakuaa" if "yajarni" is followed by the Warlpiri word "watiya" (plant). In other cases, select "henoi" as the default.
  • (wbp) paarr-paarr-jankami → (gug) hapy "to burn, singe (hair or fur)"; jahéi "to hurt"
    • Rule: select "hapy" if "paarr-paarr-jankami" is followed by the Warlpiri word "marnilpa" (hair). In other cases, select "jahéi" as the default.
  • (wbp) pajirni → (gug) su'u "to bite"; hupi "to pick/gather"
    • Rule: select "hupi" if "pajirni" is followed by the Warlpiri word "mangarri" (fruit). In other cases, select "su'u" as the default.
  • (wbp) pakarli → (gug) kuatia "headdress"; kuatia "paper"
    • Rule: select "kuatia" if "pakarli" is preceded by the Warlpiri word "yangkarni" (to wear). In other cases, select "kuatia" as the default.


4 new twol rules

  • Nasal harmony on dative suffix -pe:
    • -pe now properly changes to -me when before a nasal vowel.
    • Ex: havõ (soap). Suffix is regularly -pe, but now becomes -me
 echo "havõme" | apertium -d . gug-morph
 ^havõme/havõ<n><sg><dat>$^./.<sent>$
  • Nasal harmony on accusative suffix -ve:
    • -ve now properly changes to -me on accusative forms of pronouns
    • Ex: peẽ (second person plural pronoun). Suffix is regularly -ve, but now becomes -me
 echo "peẽme" | apertium -d . gug-morph
 ^peẽme/peẽ<prn><pers><p2><sg><acc>$^./.<sent>$
  • Stress on accusative forms of pronouns:
    • The e at the end of pronouns should be stressed (é) when followed by the acc suffix -ve
    • Ex: che (first person singular pronoun). With suffix -ve, becomes chéve
 echo "chéve" | apertium -d . gug-morph
 ^chéve/che<prn><pers><p1><sg><acc>$^./.<sent>$
  • Nasal harmony on reflexive prefix je- :
    • je- now properly changes to -ñe when before a nasal.
    • Ex: ñeñapytĩ (“to tie”).
 echo "oñeñapytĩ" | apertium -d . gug-morph
 ^oñeñapytĩ/ñapytĩ<v><tv><ar><pres><p3><sp><ref>$^./.<sent>$


Tests results

gug

  • Precision and recall:
    • Results:
 precisionRecall ../corpus/ling073-gug-corpus/ling073-gug-corpus/gug.annotated.basic.txt ../bootstrap/ling073-gug/corpus.out.txt
 Totals: 89 tp, 87 fp, 0 tn, 19 fn
 Precision: 82.40741%
 Recall: 50.56818%
  • Coverage over large corpus:
    • Results:
 aq-covtest ~/Source/corpus/ling073-gug-corpus/ling073-gug-corpus/gug.corpus.large.txt gug.automorf.bin
 Number of tokenised words in the corpus: 522284
 Coverage: 28.96%
  • Number of stems in transducer: ~270 (didn't count by hand)

wbp

  • Precision and recall:
    • Results:
 precisionRecall ../ling073-wbp-corpus/wbp.annotated.basic.txt corpus.out.txt 

Totals: 199 tp, 5 fp, 0 tn, 6 fn Precision: 97.54902% Recall: 97.07317%

  • Coverage over large corpus:
    • Results:
aq-covtest ../ling073-wbp-corpus/wbp.corpus.large.txt wbp.automorf.bin

Number of tokenised words in the corpus: 18427 Coverage: 51.26%

  • Number of stems in transducer: ~250

wbp → gug

  • WER and PER
 Number of words in reference: 76
 Number of words in test: 69
 Number of unknown words (marked with a star) in test: 45
 Percentage of unknown words: 65.22 %
 Word error rate (WER): 100.00 %
 Position-independent word error rate (PER): 94.74 %
  • Proportion of stems translate correctly in longer: 31/71 = 40.8%

Unsure... Need wbp.longer.txt to test in longer and large

gug → wbp

  • WER and PER over gug.longer.txt
 Number of words in reference: 72
 Number of words in test: 81
 Number of unknown words (marked with a star) in test: 44
 Percentage of unknown words: 54.32 %
 Word error rate (WER): 112.50.00 %
 Position-independent word error rate (PER): 108.33 %
  • Proportion of stems translate correctly in longer: (81-44 / 81) = 37/81 = 45.7%
  • Trimmed coverage and number of tokens in longer and large corpora
aq-covtest ../ling073-gug-wbp-corpus/gug.longer.txt gug-wbp.automorf.bin
**Number of tokenised words in the longer corpus: 97
**Coverage: 50.52%


aq-covtest ../ling073-gug-wbp-corpus/gug.corpus.large.txt gug-wbp.automorf.bin
**Number of tokenised words in the large corpus: 501127
**Coverage: 24.74%