Kikuyu/Disambiguation

From LING073
Jump to: navigation, search

Initial Evaluation of Ambiguity

The initial ambiguity of the corpus is 2340/1016 = 2.3031.

The major reason this number is so high is that numerous Kikuyu words are genuinely ambiguous, in the sense that which is meant can be determined only by context. One such verb is njĩkĩrire, meaning 'put' (first person singular), which appears to be able to be inflected for the current or remote past with the same form (although it is hard to tell from the data available).

Ambiguities which could be resolved by context also contribute to the high ambiguity value.For example, nĩ, the third-person present copula for all noun classes, appears many times in the corpus and has enormous numbers of forms. Since noun class and plurality is tagged, there are 25 distinct correct analyses of the form. This will be resolvable with the disambiguator, at least when an overt subject is present; that is exactly what the rule added will do.

It is possible that some of these forms are not genuine forms of the language, but if so, there is no way to tell from any of the available grammars; the analysis of the language given predicts the existence of these forms. Further investigation of a corpus, preferably with the aid of a speaker, would be necessary to clarify this situation. In particular, there are numerous complications with tense, to the point that it is rare for a form to be analyzed with only one possible tense; this may be a true ambiguity in the language, or an error in the generator, but that is not yet clear.

It is worth noting also that some of these come from over-generation of forms, as the generator still generates a fair number of incorrect forms; due to the complexity of the morphology it has been difficult to eliminate all these errors.

Example of Ambiguity

An example of two sentences in the corpus using the same word is the following:

ciana nĩ gũtheka irathekaga "the children were laughing"

mataha nĩ irio itahagwo na kaihũri "mataha is food scooped with a little half gourd"

In the first sentence, nĩ should be noun class seven (plural), and in the second, it should be noun class five (plural), to agree with the subject. Unfortunately, there is a complication in that the analyser thinks mataha is a valid verb form for the verb tah 'fetch' (in fact the same verb in itahagwo 'scoop'); hence, a disambiguation rule removing verbs followed by a copula has been added in addition to the copula agreement rule.

These rules will cause difficulties in cases where a verb is preceded by its object, as they will make the verb incorrectly agree in class with that object. This issue remains to be dealt with.

In prose, the rules added are as follows.

  1. If a copula, verb, or associative is preceded by a noun or pronoun of a particular noun class, select the form of that copula, verb, or associative which agrees with that noun class.
  1. If a copula, verb, or associative is preceded by a noun or pronoun of a particular number (singular or plural), select the form of that copula, verb, or associative which agrees with that number.
  1. If a verb is followed by a copula, remove it.

The rules are able to correctly disambiguate both of the sentences given above.

Final Evaluation of Ambiguity

The final ambiguity was 2248/1016 = 2.2126, not a significant decrease from the original ambiguity; it appears that the sort of ambiguity the words were able to deal with is fairly rare (although present in the corpus). In particular it is worth noting that nĩ often appears after a word other than a noun.