User:Jmundo1/Final project

From LING073
Jump to: navigation, search

Transducer for Hiaki

For my final project, I created a prototype transducer for the Hiaki language, a language extremely lacking in computational resources. It is available at Swarthmore's github and publicly on github.

Implementation

The implementation includes the following notable or unusual functionality, in addition to standard treatment of nouns, verbs, pronouns, postpositions, adjectives, adverbs, and other common Hiaki word types:

Variation in Verb Forms

Hiaki verb forms exhibit somewhat idiosyncratic variation in their roots depending on whether derivational suffixation appears. This process could not be automated because of the idiosyncrasy, but the lexicon is built so both forms of a root verb can be placed and the transducer will be sure only to generate the right one.

Examples:

  • poona (without morphology) to pon (with morphology)
  • kiima (without morphology) to kima'a (with morphology)
  • ne'e (without morphology) to ni'i (with morphology)

As the above examples illustrate, this change is unpredictable; it can involve addition of letters, change of letters, or deletion of letters. For this reason it must be hard-coded into the transducer.

An important next step to the transducer would be to make sure both forms of all verbs are included; this would possibly require consultation with a speaker, as sources do not always include both forms.

Noun Incorporation

Hiaki nouns and pronouns can be incorporated into verbs, losing their case morphology and becoming part of the single word. The transducer is structured to handle incorporation of arbitrary nouns into arbitrary verbs.

Due to the computational complexity of noun incorporation, it is currently disabled. However, pronoun incorporation, which occurs vastly more frequently in the corpus, remains enabled.

Reduplication

Many Hiaki verbs reduplicate productively to indicate, for instance, the habitual aspect on verbs, or plurality on adjectives. Functionality for reduplication is partially implemented using .twol rules. However, due to the complexity of Hiaki reduplication and the idiosyncracies across verbs regarding what part of the verb reduplicates, the implementation is not yet complete; in particular, it can only handle reduplication of single open or closed syllables, which must not begin with a vowel. (For example, the transducer can correctly analyze vivicha, the reduplicated form of vicha `see'.)

Some "reduplications" are not actually reduplications; these appear on verbs which have a different morphological change in place of reduplication for the same semantic effect. Such "reduplications", which can be, for example, gemination of a consonant in the middle of the word, are hard-coded.

Examples of Reduplication Types:

  • wiuta to wiwiuta (open syllable reduplication)
  • kupikte to kupikupikte (two open syllable reduplication)
  • hakta to hakhakta (closed syllable reduplication)
  • maveta to mavveta (gemination -- not actually reduplication)

Spellrelax

A copy of the spellrelax file originally from [here http://svn.code.sf.net/p/apertium/svn/incubator/apertium-wal/dev/spellrelax.twol] is used to relax apostrophe-like symbols into a single symbol, as such symbols can appear inside Hiaki words.

Next Steps

Several major grammatical aspects of the Hiaki language remain unimplemented, for example:

Number Suppletion

Many Hiaki verbs completely change root form depending on the number of one of their arguments, usually the object if the verb is transitive and the subject otherwise. This is implemented in the transducer with <sg> and <pl> tags on verbs for those verbs known to exhibit number agreement; however, it is often the case in certain word databases that it is not made clear if a verb root is singular, plural, or both. More work needs to be done with the transducer to implement this properly.

Also, currently the transducer has different lemmas for the two verbs, the singular and the plural form. To create any effective machine translation system in the future, it is likely these lemmas would need to be collapsed into one, which would require a small restructure of the transducer. For the purposes of the current transducer, the lemmas are kept distinct to avoid the confusion of a lemma with no resemblance to its form; for certain uses of the transducer, such as automating glossing, this is likely better.

Vowel Change Rules

The vowel change rules in Hiaki are fairly complex, and the forms of affixes often vary in slight ways depending on the root. The transducer needs to be improved by first identifying the regularities and patterns in the alternation of affixes, then implementing them, probably through .twol. Currently some such patterns are implemented, but normally the transducer overgenerates, placing every form of a given affix on any root. There are also likely many forms not yet encountered which need to be put in the transducer.

Root Change Rules

Many roots in Hiaki change form due to a variety of reasons, from number suppletion to the presence of a derivational suffix to being incorporated with another word to apparently idiosyncratic changes based on specific suffixes, such as wohoʼoria-po and wohoʼok-u, both of which mean `in a hole' but which have slightly different roots in addition to different locative suffixes. Before the transducer can be improved to understand these changes, it must be better understood what exactly conditions them. This requires research into the work done on Hiaki and possibly consultation with speakers.

Dealing with Spanish

A large amount of text available in Hiaki comes from transcribed interviews with native speakers, who often switch briefly into Spanish while speaking Hiaki. Several very common Spanish words are included in the transducer as loanwords, but there may be a better way to deal with switching into Spanish. This should be considered going forward in the transducer's development.

Evaluation

Coverage was tested over a corpus (corpus/yaq.corpus.txt). Little text is readily available in Hiaki, so the corpus used only has 3254 words. To continue work on the transducer, it is likely a larger corpus would need to be compiled.

Currently, the transducer tokenizes 4018 words in the corpus (including punctuation), and has coverage 67.21%.

Precision and recall were evaluated on a small segment of 88 words of the corpus. The output of the transducer at the time of testing is stored, in CG format, in corpus/yaq.corpus.cg.txt; the correctly annotated version is stored in corpus/yaq.corpus.annotated.txt. Precision over this very small section was 80.53097%, and recall was 91.00000%. These high values were achieved because this is a well-understood section of the corpus (why it was chosen to be annotated), and thus the transducer is over-fit to it, having been built in reference to this part of the corpus. It was not feasible to measure precision and recall against a random part of the corpus, as the corpus is not well enough understood to be annotated.

The transducer was then modified to improve precision and recall on yaq.corpus.cg.txt. The new output, in CG format, is storied in corpus/yaq.corpus.cg.new.txt. A new annotated version was made as well at corpus/yaq.corpus.cg.annotated.txt, as the transducer was also updated to parse quotation marks, which are now included in the new CG formatted documents. The new annotated corpus only differs from the old one in that it includes quotation marks. After the modification, precision was 74.07407% and recall was 95.23810%. The reason precision decreased may have to do with certain forms, which previously were not analyzed at all, now receiving multiple possibly correct analyses.