Khasi and Wôpanâak
From LING073
Revision as of 23:17, 23 April 2017 by Jmalin1 (talk | contribs) (→Improvements to Wôpanâak Transducer)
Resources for machine translation between Khasi and Wôpanâak
Contents
Final Evaluation
Improvements to Wôpanâak Transducer
- Updates to lexc
- Removed {u} from first- and second- person suffixes, causing words such as "nutunantam" to analyze correctly. Additionally, {u} was added to third-person suffixes, allowing words like "nupuwak" to analyze correctly.
- Fixed an overgeneration issue with intransitive verbs. Inanimate intransitive verbs can no longer receive first or second person morphology
- Reworked transitive verbs with inanimate objects, splitting the inflection lexicon into two, one for absolute forms and one for objective forms
- Added the "TI2" paradigm of verb stems to the transducer by sending them to a different lexicon before inflection. This has allowed the analyzer to handle words such as "ahtaw" ("have/own")
- Added the verbalizing suffix -w to the noun pathway, allowing nouns with it to analyze as verbs (e.g. "sôtyumâw" - "he/she is sachem")
- Added a number of words which should have been receiving the -m possession suffix to the correct category. Also updated the -m suffix to change to -um when following a consonant. This is hopefully correct, as it is consistent with phonology elsewhere in the language, but may require a further update as it is largely conjecture on my part.
- Updates to twol
- Updated rules for deletion of {m} and {w} to work after vowel archiphonemes as well as vowels.
- Consolidated multiple rules sets that did the same thing to the same archiphoneme into single rules, clearing up rule conflicts in the compiler
- Added a rule that deletes w from prefix of 3p possessed dependent nouns beginning with ȣ. Now the transducer correctly analyzes and generates words such as "ȣshah" instead of "wȣshah" for "his father".
- Updates to twoc
- Added tags and rules to prevent overgeneration of forms associated with the -w verbalizing suffix: nouns will not receive a 3p possession reading without the appropriate prefix, and the verb form will never receive the 3p possession prefix incorrectly.
- Updates to dix
- Added 100 words to dix file and lexc file
- Evaluation
- Annotated corpus precision and recall:
- Precision : 92.6%
- Recall: 80.6%
- Number of tokenised words in corpus: 553
- Corpus coverage: 50.45%
- Number of stems in transducer: 167
- Annotated corpus precision and recall:
Improvements to Khasi Transducer
- Updates to lexc
- Completely restructured file and paths the transducer takes in order to make code much more readable, and fixed prefixation (I essentially rewrote most of the lexc file)
- Added prefixed nominalization with 'kaba' (verbs and adverbs) and 'nym' and 'nong' (verbs)
- Added progressives (prefixed) with 'nang' and 'iai'
- Added future aspects with 'la'
- Added past aspects with 'myn'
- Added causality on verbs with 'pyn'
- Added personal pronoun emphasis ('ma')
- Added ungendered nouns and figured out how to work with them
- Updates to disambiguator (x2)
- Fixed 3rd and 4th disambiguator rules to be more specific to personal pronouns (If first word is a pronoun or an article, and next word is a noun/adj/verb, select article for the first word.)
- All rules under "#Everything after this point was not written for the first assignment" in rlx file were written in the past but not as part of the disambiguator assignment; I found words that needed disambiguation so I fixed them as I found problems. There are three rules present that were not necessary for the first assignment, but that I completed previously.
- For the last three rules necessary to (doubly) complete this portion, I wrote rules for selecting transitivity, differentiating between 'um' as a noun and 'um' as a negative personal pronoun, and getting a default for verbs that analyze twice (i.e. pyn+long vs pynlong, which are the same). See the rlx file for much more detail on all of the rules.
- Updates to dix
- Added 100 words to dix file and lexc file
- Final Evaluation
- My precision and recall test wasn't informative, as every time I saw an error (or undefined word) in my corpus, I added sections to the transducer the fixed those errors. I then regenerated until I was finished and didn't save the initial output before I started modifying the lexc file. The command outputs:
- Totals: 357 tp, 0 fp, 0 tn, 0 fn
- Precision: 100.00000%
- Recall: 100.00000%
- Because the files are identical.
- Coverage over the large corpus is 55.67%
- There are 53,444 words in my large corpus
- There are 262 stems in the Khasi transducer
- My precision and recall test wasn't informative, as every time I saw an error (or undefined word) in my corpus, I added sections to the transducer the fixed those errors. I then regenerated until I was finished and didn't save the initial output before I started modifying the lexc file. The command outputs:
Evaluation of MT between languages
- Kha-Wam has 76.35% WER and PER
- Kha-Wam has 45/82 stems translating correctly
- Kha-Wam has 35.65% coverage and 1139 tokenized words
- Wam-Kha has a 91.30% WER and PER
- Wam-Kha has a 63/89 stems translating correctly
- Wam-Kha has 25.51% coverage and 584 tokenized words