Lingala/Disambiguation

From LING073
Jump to: navigation, search

Initial Evaluation of Ambiguity

Initially, the corpus had an ambiguity of 1.1581.

Example of Ambiguity

One main source of ambiguity that can be readily fixed by constraint rules involves kozala (which is "to be" in Lingala). Kozala can function as both a main verb, where it takes <v>, and an auxiliary verb, where it takes <vaux>. Two initial sentences highlighting the ambiguity are highlighted below. Note that these sentences are shown before each word of the sentence has been added to the Lexicon.

  • ^Moyíbi/moyíbi<n><sg><cl1>/moyíbi<n><sg><cl3>$ ^w/*w$â^ná/*ná$ ^bapoli/*bapoli$c^e/*e$ ^bazaláká/kozala<v><past><p3><pl><aa>/kozala<vaux><past><p3><pl><aa>$ ^koluka/*koluka$ ^yé/yé<prn><pers><p3><sg><aa>$ ^bandá/*bandá$ ^kala/*kala$^./.<sent>$
  • ^Ezalí/kozala<v><pres><p3><sg><nn>/kozala<v><pres><p3><pl><nn>/kozala<vaux><pres><p3><sg><nn>/kozala<vaux><pres><p3><pl><nn>$ ^miko/*miko$ ^mauti/*mauti$ ^na/na<pr>/na<cngcoo>$ ^yo/yó<prn><pers><p2><sg><aa>$ ^soko/*soko$ ^mokonzi/mokonzi<n><sg><cl1>$ ^awei/*awei$ ^ntango/*ntango$ ^na/na<pr>/na<cngcoo>$ ^lisálisi/*lisálisi$^!/!<sent>$

For convenience, the sentences without analysis are listed below:

  • Moyíbi wâná bapolice bazaláká koluka yé bandá kala.
  • Ezalí miko mauti na yo soko mokonzi awei ntango na lisálisi!

In order to ensure that all forms analyze, a number of items were added or changed in the Lexicon:

  • The determiner wâná and appropriate analysis was added
  • The loan word bapolice, a plural noun of no class, was added to a special loan word lexicon
  • The verb stem -luk- (infinitive koluka, meaning to search) was added to the verb stem lexicon
  • Added both bandá and kala (since and long ago, respectively) to a new adverb lexicon
  • Added miko to the noun class III/IV lexicon
  • Added mauti as a relative pronoun in a new "tentative" lexicon. This is because it is translated "which" but I cannot verify the analysis independently in a grammar book or dictionary
  • Added soko ("if") as a subordinating conjunction. Rearranged conjunction tagging lexicon to distinguish between coordinating and subordinating conjunctions
  • Added kowâ with irregular steam -we- to the verb lexicon. This should analyze every form except the infinitive correctly.
  • Added ntango to the noun IX-X lexicon
  • Added lisálísí to the noun V-VI lexicon.

After these changes, the two sentences are analyzed as follows (with the correct analysis for kozala bolded):

  • ^Moyíbi/moyíbi<n><sg><cl3>/moyíbi<n><sg><cl1>$ ^wâná/wâná<det><dem>$ ^bapolice/bapolice<n><pl>$ ^bazaláká/kozala<v><past><p3><pl><aa>/kozala<vaux><past><p3><pl><aa>$ ^koluka/koluka<v><inf>$ ^yé/yé<prn><pers><p3><sg><aa>$ ^bandá/bandá<adv>$ ^kala/kala<adv>$^./.<sent>$
  • ^Ezalí/kozala<v><pres><p3><sg><nn>/kozala<v><pres><p3><pl><nn>/kozala<vaux><pres><p3><sg><nn>/kozala<vaux><pres><p3><pl><nn>$ ^miko/miko<n><pl><cl4>$ ^mauti/mauti<prn><rel>$ ^na/na<pr>/na<cngcoo>$ ^yo/yó<prn><pers><p2><sg><aa>$ ^soko/soko<cnjsub>$ ^mokonzi/mokonzi<n><sg><cl1>$ ^awei/kowâ<v><pres><p3><sg><aa>$ ^ntango/ntango<n><sg><cl9>/ntango<n><pl><cl10>$ ^na/na<pr>/na<cngcoo>$ ^lisálisi/lisálísí<n><sg><cl5>$^!/!<sent>$

Note also that as a result of the changes to the lexicon, the corpus ambiguity level decreased slightly to 1.156.

Disambiguation

Overview

The main ambiguity of interest is distinguishing between the <vaux> and <v> forms of kozala. We should first comment on some of the other ambiguities seen in the examples above. First, note that in sentence one Moyíbi is analyzed as both as class 1 and a class 3 noun. This is because as a class 1 noun, it is translated thief and as a class 3 noun, it is translated theft. This would be fairly difficult to disambiguate with constraint grammar, though it might be possible to take care of some special cases. In any case, this example is rather specific so we don't address it here.

In sentence 2, notice that Ezalí has four analyses, not two as might be expected. This is because the inanimate prefix is the same for both singular and plural, so the <v> and <vaux> in both singular and plural form are suggested. To fix this, it is probably better to change the way we tag the word rather than write constraint grammar. This should be possible by adding a <sp> take to the e- prefix that denotes that it can be both singular and plural (this is a tag found on the Apertium list of symbols).

Also in sentence 2, na is recognized as both a pronoun and a coordinating conjunction. In fact, na can take on even more roles in the sentence than have been implemented so far in lexc, so we postpone disambiguation of na for now.

Finally, ntango, a class 9/10 noun, has both a singular and plural form suggested. This is a similar problem to ezalí, but is complicated by the fact that the class is different in singular and plural, even though the prefix n- stays the same. Given this, we postpone applying the <sp> tag fix that we apply to ezalí

Implementation

Compressing the singular and plural forms of inanimate verb in lexc as described above lowered corpus ambiguity to 1.133.


We should now be able to disambiguate between <vaux> and <v> with a two rules. In prose, "If there is a verb than can be an auxiliary or main verb followed by another verb, remove the verb analysis of the first verb." And, "If there is a verb that can be an auxiliary or main verb and it is followed by any part of speech other than verb, remove the auxiliary analysis of the verb." See Github for my implementation.

After implementing these rules, the example sentences are disambiguated properly:

echo "Ezalí miko mauti na yo soko mokonzi awei ntango na lisálisi!" | apertium -d . lin-disam

"<Ezalí>"
	"kozala" v pres p3 sp nn
;	"kozala" vaux pres p3 sp nn REMOVE:16
"<miko>"
	"miko" n pl cl4
"<mauti>"
	"mauti" prn rel
"<na>"
	"na" pr
	"na" cngcoo
"<yo>"
	"yó" prn pers p2 sg aa
"<soko>"
	"soko" cnjsub
"<mokonzi>"
	"mokonzi" n sg cl1
"<awei>"
	"kowâ" v pres p3 sg aa
"<ntango>"
	"ntango" n sg cl9
	"ntango" n pl cl10
"<na>"
	"na" pr
	"na" cngcoo
"<lisálisi>"
	"lisálísí" n sg cl5
"<!>"
	"!" sent
"<.>"
	"." sent


echo "Moyíbi wâná bapolice bazaláká koluka yé bandá kala." | apertium -d . lin-disam

"<Moyíbi>"
	"moyíbi" n sg cl3
	"moyíbi" n sg cl1
"<wâná>"
	"wâná" det dem
"<bapolice>"
	"bapolice" n pl
"<bazaláká>"
	"kozala" vaux past p3 pl aa
;	"kozala" v past p3 pl aa REMOVE:13
"<koluka>"
	"koluka" v inf
"<yé>"
	"yé" prn pers p3 sg aa
"<bandá>"
	"bandá" adv
"<kala>"
	"kala" adv
"<..>"
	".." sent

Final Evaluation of Ambiguity

After making these changes, ambiguity decreased to 1.120.