Difference between revisions of "Khasi and Wôpanâak"

From LING073
Jump to: navigation, search
(Improvements to Wôpanâak Transducer)
(Improvements to Wôpanâak Transducer)
Line 34: Line 34:
**Number of tokenised words in corpus: 553
**Number of tokenised words in corpus: 553
**Corpus coverage: 50.45%
**Corpus coverage: 50.45%
**Number of stems in analyzer: 167
**Number of stems in transducer: 167
===Improvements to Khasi Transducer===
===Improvements to Khasi Transducer===

Latest revision as of 00:17, 24 April 2017

Resources for machine translation between Khasi and Wôpanâak

Final Evaluation

Improvements to Wôpanâak Transducer

  • Updates to lexc
    • Removed {u} from first- and second- person suffixes, causing words such as "nutunantam" to analyze correctly. Additionally, {u} was added to third-person suffixes, allowing words like "nupuwak" to analyze correctly.
    • Fixed an overgeneration issue with intransitive verbs. Inanimate intransitive verbs can no longer receive first or second person morphology
    • Reworked transitive verbs with inanimate objects, splitting the inflection lexicon into two, one for absolute forms and one for objective forms
    • Added the "TI2" paradigm of verb stems to the transducer by sending them to a different lexicon before inflection. This has allowed the analyzer to handle words such as "ahtaw" ("have/own")
    • Added the verbalizing suffix -w to the noun pathway, allowing nouns with it to analyze as verbs (e.g. "sôtyumâw" - "he/she is sachem")
    • Added a number of words which should have been receiving the -m possession suffix to the correct category. Also updated the -m suffix to change to -um when following a consonant. This is hopefully correct, as it is consistent with phonology elsewhere in the language, but may require a further update as it is largely conjecture on my part.
  • Updates to twol
    • Updated rules for deletion of {m} and {w} to work after vowel archiphonemes as well as vowels.
    • Consolidated multiple rules sets that did the same thing to the same archiphoneme into single rules, clearing up rule conflicts in the compiler
    • Added a rule that deletes w from prefix of 3p possessed dependent nouns beginning with ȣ. Now the transducer correctly analyzes and generates words such as "ȣshah" instead of "wȣshah" for "his father".
  • Updates to twoc
    • Added tags and rules to prevent overgeneration of forms associated with the -w verbalizing suffix: nouns will not receive a 3p possession reading without the appropriate prefix, and the verb form will never receive the 3p possession prefix incorrectly.
  • Updates to dix
    • Added 100 words to dix file and lexc file
  • Evaluation
    • Annotated corpus precision and recall:
      • Precision : 92.6%
      • Recall: 80.6%
    • Number of tokenised words in corpus: 553
    • Corpus coverage: 50.45%
    • Number of stems in transducer: 167

Improvements to Khasi Transducer

  • Updates to lexc
    • Completely restructured file and paths the transducer takes in order to make code much more readable, and fixed prefixation (I essentially rewrote most of the lexc file)
    • Added prefixed nominalization with 'kaba' (verbs and adverbs) and 'nym' and 'nong' (verbs)
    • Added progressives (prefixed) with 'nang' and 'iai'
    • Added future aspects with 'la'
    • Added past aspects with 'myn'
    • Added causality on verbs with 'pyn'
    • Added personal pronoun emphasis ('ma')
    • Added ungendered nouns and figured out how to work with them
  • Updates to disambiguator (x2)
    • Fixed 3rd and 4th disambiguator rules to be more specific to personal pronouns (If first word is a pronoun or an article, and next word is a noun/adj/verb, select article for the first word.)
    • All rules under "#Everything after this point was not written for the first assignment" in rlx file were written in the past but not as part of the disambiguator assignment; I found words that needed disambiguation so I fixed them as I found problems. There are three rules present that were not necessary for the first assignment, but that I completed previously.
    • For the last three rules necessary to (doubly) complete this portion, I wrote rules for selecting transitivity, differentiating between 'um' as a noun and 'um' as a negative personal pronoun, and getting a default for verbs that analyze twice (i.e. pyn+long vs pynlong, which are the same). See the rlx file for much more detail on all of the rules.
  • Updates to dix
    • Added 100 words to dix file and lexc file
  • Final Evaluation
    • My precision and recall test wasn't informative, as every time I saw an error (or undefined word) in my corpus, I added sections to the transducer the fixed those errors. I then regenerated until I was finished and didn't save the initial output before I started modifying the lexc file. The command outputs:
      • Totals: 357 tp, 0 fp, 0 tn, 0 fn
      • Precision: 100.00000%
      • Recall: 100.00000%
      • Because the files are identical.
    • Coverage over the large corpus is 55.67%
    • There are 53,444 words in my large corpus
    • There are 262 stems in the Khasi transducer

Evaluation of MT between languages

  • Kha-Wam has 76.35% WER and PER
  • Kha-Wam has 45/82 stems translating correctly
  • Kha-Wam has 35.65% coverage and 1139 tokenized words
  • Wam-Kha has a 91.30% WER and PER
  • Wam-Kha has a 63/89 stems translating correctly
  • Wam-Kha has 25.51% coverage and 584 tokenized words