Lingala and Kikuyu

From LING073
Revision as of 22:48, 17 April 2017 by Jmundo1 (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Resources for machine translation between Lingala and Kikuyu. Code is stored in a GitHub Repository.

Developed Resources

Parallel Corpus

A parallel corpus created using two translations of Genesis is available on Github.

Contrastive Grammar

Contrastive Grammar is detailed on the wiki page Lingala and Kikuyu/Contrastive Grammar.

Lexical Selection

Lexical selection is detailed on the wiki page Lingala and Kikuyu/Lexical Selection.

Structural Transfer

Structural transfer is detailed on the wiki page Lingala and Kikuyu/Structural Transfer

Installing Unidecode For MT Evaluation

Most available Lingala text does not mark tones; however, the Lingala transducer has been configured using spellrelax to provide analyses and surface generation with tone markings included as accents on vowels. As a result, generated translations will include tone markings not present in the reference translation and artificially inflate evaluation metrics such as WER and PER. Fortunately, we can take advantage of the fact that accented characters are not present in ASCII to replace accented characters with their unaccented version on the command line.

Unidecode is a Python package that implements a function to convert Unicode strings into their closest ASCII approximation. For simple accented characters, this is precisely the mapping we need to remove the accents for evaluation against a reference translation. Unidecode can also be used on the command line with our machine translation system as follows: echo /path/to/kik-corpus | apertium -d . kik-lin | unidecode. To install unidecode, it is recommended to use pip, as experience suggests that apt-get will not run the necessary installation scripts to make unidecode accessible in Bash.

On the Ling 073 virtual machines, Pip can be installed using sudo apt-get install python-pip. After installing pip, unidecode can be installed using pip install unidecode.

Lexical Selection

This section is part of the Lexical transfer assignment. In addition to adding many one-to-one mappings to our .dix file, we have included a couple of one-to-many mappings for each direction of translation. See Lingala and Kikuyu/Lexical Selection for further details.

From Lingala to Kikuyu, we consider four one-to-many mappings:

  • The Lingala verb kokíma can be translated in English as both "to run" and "to run away." Kikuyu has two distinct verbs corresponding to these English translations, respectively, hanyũk and ũr.
  • The Lingala verb kofánda can be translated in English as both "to sit" and "to live" (as in reside). Kikuyu has two distinct verbs corresponding to these English translations, respectively, ikar and tũũr.
  • The Lingala verb kobánza can be translated in English as both "to think" and "to remember." Kikuyu has two distinct verbs corresponding to these English translations, respectively, ĩcir and ririkan.
  • The Lingala noun mabele can be translated in English as both "sand" and "milk." Kikuyu has two distinct nouns corresponding to these English translations, respectively, thanga and ria.

From Kikuyu to Lingala, we consider three one-to-many-mappings:

  • The Kikuyu verb igu can be translated in English as both "to obey" and "to hear." Lingala has two distinct verbs corresponding to these English translations, respectively, kotosa and koyoka.
  • The Kikuyu verb rut can be translated in English as both "to learn" and "to remove." Lingala has two distinct verbs corresponding to these English translations, respectively, koyekola and kokukula.
  • the Kikuyu noun ria can be translated in English as both "milk" and "pond." Lingala has two distinct nouns corresponding to these English translations, respectively, mabele and etimá.

Initial Evaluation

lin → kik evaluation

  • Evaluation on mini corpus:
    • WER: 87.50%
    • PER: 87.50%
    • Number of words: 9
    • Number of unknown words: 0
    • Proportion of stems translated correctly: 100%
  • Evaluation of corpus:
    • WER: 99.06%
    • PER: 94.34%
    • Number of words: 107
    • Number of unknown words: 71
    • Proportion of stems translated correctly:33.64%

kik → lin evaluation

  • Evaluation on mini corpus:
    • WER: 80%
    • PER: 80%
    • Number of words: 7 (This should really be 8, but the command we are supposed to use to count this outputs 7. The proportion remains unchanged in our case, however.)
    • Number of unknown words: 0
    • Proportion of stems translated correctly: 100%
  • Evaluation of corpus:
    • WER: 98.13%
    • PER: 90.65%
    • Number of words: 106
    • Number of unknown words: 79 (Note: Using the commands on the wiki gives a slightly different count that using the apertium-eval-translator tool. We use the count given by the command, not the tool)
    • Proportion of stems translated correctly: 25.47%

Final Evaluation

lin → kik Implementation Details

This direction of the pair was improved by improving the transducer to handle pronouns, proper names, and new types of morphology; improving the constraint grammar to better produce forms which agree properly; and adding transfer rules for several new input configurations including associatives and adjectives.

Bilingual Dictionary Expansion

Difference can be seen by looking at the Github commit history.

Morphology Expansion

  • Support for demonstrative pronouns/determiners (proximal, distal, and anaphoric)
    • Example word: "rĩrĩa" (Class 5 singular distal demonstrative)
    • Old output: ^rĩrĩa/*rĩrĩa$^./.<sent>$
    • New output: ^rĩrĩa/pro<rel><cl5><sg>/pro<prn><dist><cl5><sg>$^./.<sent>$
  • Support for possessive pronouns/determiners
    • Example word: "gĩakwa" (Possessive pronoun for Class 7 singular object with first-person singular owner)
    • Old output: ^gĩakwa/*gĩakwa$^./.<sent>$
    • New output: ^gĩakwa/pro<prn><poss><cl7><sg><sp1><ssg>$^./.<sent>$
  • Support for relative pronouns
    • Example word: ũrĩa (Class 1 singular relative pronoun, as well as other noun classes and other types of pronouns)
    • Old output: ^ũrĩa/*ũrĩa$^./.<sent>$
    • New output: ^ũrĩa/pro<prn><rel><cl1><sg>/pro<prn><rel><cl3><sg>/pro<prn><rel><cl14><sg>/pro<prn><dem><dist><cl1><sg>/pro<prn><dem><dist><cl3><sg>/pro<prn><dem><dist><cl14><sg>$^./.<sent>$
  • Support for personal pronouns
    • Example word: yo (Class 3 plural personal pronoun)
    • Old output: ^yo/*yo$^./.<sent>$
    • New output: ^yo/pro<prn><pers><p3><cl3><pl>/pro<prn><pers><p3><cl9p10><sg>$
  • Support for proper names (proper name tag introduced, and several names added)
    • Example word: Kenya (Kenya)
    • Old output: ^Kenya/Kenya$^./.<sent>$
    • New output: ^Kenya/Kenya<n><prop>$
  • Support for Class 16 and 17 nouns (which tend to have to do with place), and their agreement on adjectives
    • Example: kũndũ guothe (all places)
    • Old output: ^kũndũ/*kũndũ$ ^guothe/*guothe$
    • New output: ^kũndũ/ndũ<n><cl17><sg><indef>$ ^guothe/othe<adj><cl17><sg>$^./.<sent>$
  • Support for unique reversive middle voice morphology
    • Example word: rũthiũrũrũkĩirie (surround, roughly)
    • Old output: ^makĩinũkae/*makĩinũka$^./.<sent>$
    • New output: ^makĩinũka/in<seq><mid><rev><p3><cl1><pl>/in<seq><mid><rev><p3><cl9p6><pl>/in<seq><mid><rev><p3><cl14><pl>/in<seq><mid><rev><p3><cl5><pl>/in<seq><mid><rev><p3><cl11p6><pl>/in<seq><mid><rev><p3><cl15><pl>/in<seq><mid><rev><p3><cl12p6><pl>/in<seq><mid><rev><refl><p3><cl1><pl>/in<seq><mid><rev><refl><p3><cl9p6><pl>/in<seq><mid><rev><refl><p3><cl14><pl>/in<seq><mid><rev><refl><p3><cl5><pl>/in<seq><mid><rev><refl><p3><cl11p6><pl>/in<seq><mid><rev><refl><p3><cl15><pl>/in<seq><mid><rev><refl><p3><cl12p6><pl>$^./.<sent>$
  • Support for infinitives
    • Example word: gũthambĩra (to swim)
    • Old output: ^gũthambĩra/*gũthambĩra$ ^guothe/*guothe$
    • New output: ^gũthambĩra/^gũthambĩra/thambĩr<v><inf>$^./.<sent>$
  • An attempt was made to implement object agreement morphology on verbs. Such an attempt would necessitate adding twoc rules for each noun class, and doing so made twoc take far too long to compile. This attempt was therefore abandoned, but if a workaround were to be found for twoc, the twoc rules and lexc code are commented out in their respective files.

Additional Smoothing of Transducer

  • Support for ĩ to y alternation in certain cases (implemented through twol)
    • Example word: yarĩ (copula form)
    • Old output: ^yarĩ/*yarĩ$^./.<sent>$
    • New output: ^yarĩ/rĩ<cop><rempast><p1><pl>/rĩ<cop><rempast><p2><pl>/rĩ<cop><rempast><p3><cl1><pl>/rĩ<cop><rempast><p3><cl3><pl>/rĩ<cop><rempast><p3><cl9p6><pl>/rĩ<cop><rempast><p3><cl14><pl>/rĩ<cop><rempast><p3><cl5><pl>/rĩ<cop><rempast><p3><cl11p6><pl>/rĩ<cop><rempast><p3><cl15><pl>/rĩ<cop><rempast><p3><cl12p13><pl>/rĩ<cop><rempast><p3><cl7><pl>/rĩ<cop><rempast><p3><cl12p6><pl>/rĩ<cop><rempast><p3><cl9p10><pl>/rĩ<cop><rempast><p3><cl11p10><pl>$^./.<sent>$
  • Support for ũ to u alternation before o
    • Example word: kuona (to see)
    • Old output: ^kuona/*kuona$^./.<sent>$
    • New output: ^kuona/on<v><inf>$^./.<sent>$

Additional Constraint Grammar Rules

Note: several of the rules in this section required multiple rlx rules to implement, because they act on various different aspects of agreement, including number, noun class, and person.

  • If I am a verb, demonstrative, or associative inflected to agree with a certain number/noun class, select me if I am preceded by an adjective preceded by a noun of that number/noun class.
    • Example: maaĩ maingĩ me (many waters are)
    • Output:
"<Maaĩ>"
	"aĩ" n cl5 pl
"<maingĩ>"
	"ingĩ" adj cl5 pl SELECT:27
;	"ingĩ" adj cl9p6 pl SELECT:27
;	"ingĩ" adj cl14 pl SELECT:27
;	"ingĩ" adj cl11p6 pl SELECT:27
;	"ingĩ" adj cl15 pl SELECT:27
;	"ingĩ" adj cl12p6 pl SELECT:27
"<me>"
	"rĩ" cop pres p3 cl5 pl SELECT:33
;	"rĩ" cop pres p3 cl1 pl SELECT:33
;	"rĩ" cop pres p3 cl9p6 pl SELECT:33
;	"rĩ" cop pres p3 cl14 pl SELECT:33
;	"rĩ" cop pres p3 cl11p6 pl SELECT:33
;	"rĩ" cop pres p3 cl15 pl SELECT:33
;	"rĩ" cop pres p3 cl12p6 pl SELECT:33
"<.>"
	"." sent
  • If I am a verb inflected to agree with a certain number/noun class, select me if I am preceded by a demonstrative preceded by a noun of that number/person/noun class.
    • Example: andũ acio nĩmagakorwo (those people will find)
    • Output:
"<na>"
	"na" cnjcoo
	"na" pr
"<andũ>"
	"ndũ" n cl1 pl
"<acio>"
	"pro" prn dem ana cl1 pl
"<nĩmagakorwo>"
	"kor" v pres pass p3 cl1 pl foc SELECT:40
	"kor" v remfut pass p3 cl1 pl foc SELECT:40
;	"kor" v pres pass p3 cl9p6 pl foc SELECT:40
;	"kor" v pres pass p3 cl14 pl foc SELECT:40
;	"kor" v pres pass p3 cl5 pl foc SELECT:40
;	"kor" v pres pass p3 cl11p6 pl foc SELECT:40
;	"kor" v pres pass p3 cl15 pl foc SELECT:40
;	"kor" v pres pass p3 cl12p6 pl foc SELECT:40
;	"kor" v remfut pass p3 cl9p6 pl foc SELECT:40
;	"kor" v remfut pass p3 cl14 pl foc SELECT:40
;	"kor" v remfut pass p3 cl5 pl foc SELECT:40
;	"kor" v remfut pass p3 cl11p6 pl foc SELECT:40
;	"kor" v remfut pass p3 cl15 pl foc SELECT:40
;	"kor" v remfut pass p3 cl12p6 pl foc SELECT:40
"<.>"
	"." sent
  • If I am a verb inflected to agree with a certain number/person/noun class, select me if I am preceded by a number preceded by an associative preceded by a noun of that noun class.
    • Example: nyamũ ya mũthia nĩ (the last animal is)
    • Output:
"<nyamũ>"
	"nyamũ" n cl9p10 pl
	"nyamũ" n cl9p10 sg
"<ya>"
	"a" assoc cl9p10 sg SELECT:27
;	"a" assoc cl3 pl SELECT:27
;	"a" assoc cl9p6 sg SELECT:27
"<mũthia>"
	"mũthia" num
"<nĩ>"
	"rĩ" cop pres p3 cl9p10 pl SELECT:46
	"rĩ" cop pres p3 cl9p10 sg SELECT:46
;	"rĩ" cop pres p3 cl1 pl SELECT:46
;	"rĩ" cop pres p3 cl1 sg SELECT:46
;	"rĩ" cop pres p3 cl3 pl SELECT:46
;	"rĩ" cop pres p3 cl3 sg SELECT:46
;	"rĩ" cop pres p3 cl9p6 pl SELECT:46
;	"rĩ" cop pres p3 cl9p6 sg SELECT:46
;	"rĩ" cop pres p3 cl14 pl SELECT:46
;	"rĩ" cop pres p3 cl14 sg SELECT:46
;	"rĩ" cop pres p3 cl5 pl SELECT:46
;	"rĩ" cop pres p3 cl5 sg SELECT:46
;	"rĩ" cop pres p3 cl11p6 pl SELECT:46
;	"rĩ" cop pres p3 cl11p6 sg SELECT:46
;	"rĩ" cop pres p3 cl15 pl SELECT:46
;	"rĩ" cop pres p3 cl15 sg SELECT:46
;	"rĩ" cop pres p3 cl12p13 pl SELECT:46
;	"rĩ" cop pres p3 cl12p13 sg SELECT:46
;	"rĩ" cop pres p3 cl7 pl SELECT:46
;	"rĩ" cop pres p3 cl7 sg SELECT:46
;	"rĩ" cop pres p3 cl12p6 pl SELECT:46
;	"rĩ" cop pres p3 cl12p6 sg SELECT:46
;	"rĩ" cop pres p3 cl11p10 pl SELECT:46
;	"rĩ" cop pres p3 cl11p10 sg SELECT:46
"<.>"
	"." sent
  • If I am a preposition followed by a verb, remove me.
    • Example: na ndĩarĩ na kĩndũ (and was of nothing)
    • Output:
"<na>"
	"na" cnjcoo
;	"na" pr REMOVE:68
"<ndĩarĩ>"
	"rĩ" cop rempast neg p3 cl9p6 sg
	"rĩ" cop rempast neg p3 cl9p10 sg
"<na>"
	"na" cnjcoo
	"na" pr
"<kĩndũ>"
	"ndũ" n cl7 sg
"<.>"
	"." sent

Additional Transfer Rules

Note: None of the examples here will work, because there remains a bug in generation in the Kikuyu transducer. It cannot generate most forms it can analyze. All the forms outputted in these examples would be generated correctly if this bug is fixed, without other changes to the Kikuyu transducer.

The chunker is the only part of the structural transfer system which has anything beyond the default, so only the chunker and final output is given.

  • Implemented transfer of adjectives
    • Example: ezibele monene
    • Output of chunker: nominal<nominal>{^rango<n><cl3><sg>$ ^nene<adj><cl3><sg>$}$^sent<SENT>{^.<sent>$}
    • Final output: #rango #nene
    • Hypothetical final output if Kikuyu generation fixed: mũrango mũnene
  • Implemented transfer of imperatives
    • Example: koyemba
    • Output of chunker: ^imperative<verb><imperative>{^in<v><imp>$}$^sent<SENT>{^.<sent>$}$
    • Final output: #in
    • Hypothetical final output if Kikuyu generation fixed:ina
  • Implemented transfer of associative
    • Example: bato ya mokili (the people of the world)
    • Output of chunker: nominal<nominal>{^ndũ<n><cl1><pl>$ ^a<assoc><cl1><pl>$ ^thĩ<n><cl9p10><sg>$}$^sent<SENT>{^.<sent>$}$
    • Final output: #ndũ a thĩ
    • Hypothetical final output if Kikuyu generation fixed: andũ a thĩ
  • Implemented transfer of demonstratives
    • Example: moto óyo (this person)
    • Output of chunker: ^nominal<nominal>{^ndũ<n><cl1><sg>$ ^pro<prn><dem><prox><cl1><sg>$}$^sent<SENT>{^.<sent>$}$
    • Final output: #ndũ ũyũ
    • Hypothetical final output if Kikuyu generation fixed: mũndũ ũyũ

kik → lin Implementation Details

This direction of the RBMT system was improved by adding items to the bilingual dictionary, adding additional morphology to the transducer, writing more constraint grammar rules, and writing additional transfer rules. Specific implementation details with examples of improved output are detailed in the sections that follow.

Bilingual Dictionary Expansion

Difference can be seen by looking at the Github commit history.

Morphology Expansion

  • Support for proper nouns
    • Example sentence: "Nzambe" (God)
    • Old output: ^Nzambe/*Nzambe$^./.<sent>$
    • New output: ^Nzambe/Nzambe<n><prop>$^./.<sent>$
  • Support for imperative mood
    • Example sentence: "Kíma!" (Run!)
    • Old output: ^Kíma/*Kíma$^!/!<sent>$^./.<sent>$
    • New output: ^Kíma/kokíma<v><imp>$^!/!<sent>$^./.<sent>$
  • Support for intensifying suffixes (-a becomes -áká to denote certainty (infinitive, future) or insistence (imperative))
    • Example sentence 1: "Azalákí kopésáká" (He was really distributing...)
    • Old output: ^Azalákí/kozala<v><urp><p3><sg><aa>/kozala<vaux><urp><p3><sg><aa>$ ^kopésáká/*kopésáká$^./.<sent>$
    • New output ^Azalákí/kozala<v><urp><p3><sg><aa>/kozala<vaux><urp><p3><sg><aa>$ ^kopésáká/kopésa<v><inf><intns>$^./.<sent>$
    • Example sentence 2: "Akopésáká" (He will really give...)
    • Old output: ^Akopésáká/*Akopésáká$^./.<sent>$
    • New output ^Akopésáká/kopésa<v><intns><fut><p3><sg><aa>$^./.<sent>$
    • Example sentence 3: "Pésáká" (Now do give...)
    • Old output: ^Pésáká/*Pésáká$^./.<sent>$
    • New output ^Pésáká/kopésa<v><imp><intns>$^./.<sent>$
  • Support for more than 1 radical extension
    • Example sentence: "Nabángisamákí na Pierre" (I was scared by Pierre, or more literally, I was made, by Pierre, to fear something/someone)
    • Old output: ^Nabángisamákí/*Nabángisamákí$ ^na/na<pr>/na<cngcoo>$ ^Pierre/Pierre<n><prop>$^./.<sent>$
    • New output: ^Nabángisamákí/kobánga<v><caus><pass><urp><p1><sg><aa>$ ^na/na<pr>/na<cngcoo>$ ^Pierre/Pierre<n><prop>$^./.<sent>$
  • Support for numerals
    • Example sentence: "zómi" (ten)
    • Old output: ^zómi/*zómi$^./.<sent>$
    • New output: ^zómi/zómi<num><ord>/zómi<num><card>$^./.<sent>$
  • Support for quantifiers
    • Example sentence: "mbelí nyónso" (each knife)
    • Old output: ^mbelí/mbelí<n><cl9><sg>/mbelí<n><cl9><pl>$ ^nyónso/*nyónso$^./.<sent>$
    • New output: ^mbelí/mbelí<n><cl9><sg>/mbelí<n><cl9><pl>$ ^nyónso/nyónso<det><qnt>$^./.<sent>$

Additional Constraint Grammar Rules

  • If I am an adjective and a quantifier but I do not agree in number with the preceding noun, I must be a quantifier
    • Example sentence: "bilóko moké" (A few things, not small things)
    • Old output (below)
"<bilóko>"
	"elóko" n cl7 pl
"<moké>"
	"moké" det qnt
	"moké" adj sg
"<.>"
	"." sent
  • New output (below)
"<bilóko>"
	"elóko" n cl7 pl
"<moké>"
	"moké" det qnt
;	"moké" adj sg REMOVE:21
"<.>"
	"." sent
  • If I am a noun that can be singular or plural and am followed by a singular adjective, I cannot be plural
    • Example sentence: "nzelá molaí" (a long road)
    • Old output (below)
"<nzelá>"
	"nzelá" n cl9 sg
	"nzelá" n cl9 pl
"<molaí>"
	"molaí" adj sg
"<.>"
	"." sent
  • New output (below)
"<nzelá>"
	"nzelá" n cl9 sg
;	"nzelá" n cl9 pl REMOVE:25
"<molaí>"
	"molaí" adj sg
"<.>"
	"." sent
  • If I am a noun that can be singular or plural and am followed by a plural adjective, I cannot be singular
    • Example sentence: "nzelá milaí" (long roads)
    • Old output (below)
"<nzelá>"
	"nzelá" n cl9 sg
	"nzelá" n cl9 pl
"<milaí>"
	"molaí" adj pl
"<.>"
	"." sent
  • New output (below)
"<nzelá>"
	"nzelá" n cl9 pl
;	"nzelá" n cl9 sg REMOVE:28
"<milaí>"
	"molaí" adj pl
"<.>"
	"." sent
  • If I can be an ordinal and a cardinal and am modifying a plural noun, I am not an ordinal. (One is never both an ordinal and cardinal, so this is not a problem.)
    • Example sentence: "bilóko míbalé" (two things, not the second thing)
    • Old output (below)
"<bilóko>"
	"elóko" n cl7 pl
"<míbalé>"
	"míbalé" num ord
	"míbalé" num card
"<.>"
	"." sent

  • New output (below)
"<bilóko>"
	"elóko" n cl7 pl
"<míbalé>"
	"míbalé" num card
;	"míbalé" num ord REMOVE:31
"<.>"
	"." sent
  • If I can be and ordinal and a cardinal and am modifying a a singular noun, I am not a cardinal. (One is never both an ordinal and cardinal, so this is not a problem.)
    • Example sentence: "elóko mibalé" (the second thing, not two things)
    • Old output (below)
"<elóko>"
	"elóko" n cl7 sg
"<míbalé>"
	"míbalé" num ord
	"míbalé" num card
"<.>"
	"." sent
  • New output (below)
"<elóko>"
	"elóko" n cl7 sg
"<míbalé>"
	"míbalé" num ord
;	"míbalé" num card REMOVE:34
"<.>"
	"." sent
  • If I can be a demonstrative nominal and a relative pronoun and am proceeded by a noun, I cannot be a demonstrative nominal.
    • Example sentence: "Mobali óyo amonaki ngáí" (The man who saw me...)
    • Old output (below)
"<Mobali>"
	"mobali" n cl1 sg
"<óyo>"
	"óyo" prn rel
	"óyo" det dem
	"óyo" prn dem sg
"<amonaki>"
	"komóna" v urp p3 sg aa
"<ngáí>"
	"ngáí" prn pers p1 sg aa
"<.>"
	"." sent
  • New output (below)
"<Mobali>"
	"mobali" n cl1 sg
"<óyo>"
	"óyo" prn rel
	"óyo" det dem
;	"óyo" prn dem sg REMOVE:37
"<amonaki>"
	"komóna" v urp p3 sg aa
"<ngáí>"
	"ngáí" prn pers p1 sg aa
"<.>"
	"." sent

Additional Transfer Rules

  • Enhance existing basic verb transfer rule to work with newly implemented imperative
    • Example sentence: "ina" (sing)
    • Old output
      • Tagger: ^in<v><imp><p1><sg>$^.<sent>$
      • Biltrans: ^in<v><imp><p1><sg>/koyemba<v><imp><p1><sg>$^.<sent>/.<sent>$
      • Chunker: ^verb<verb>{^koyemba<v><p1><sg><aa>$}$^sent<SENT>{^.<sent>$}$
      • Interchunk: ^verb<verb>{^koyemba<v><p1><sg><aa>$}$^sent<SENT>{^.<sent>$}$
      • Postchunk: ^koyemba<v><p1><sg><aa>$^.<sent>$
      • Complete translation: #koyemba
    • New output
      • Tagger:^in<v><imp><p1><sg>$^.<sent>$
      • Biltrans:^in<v><imp><p1><sg>/koyemba<v><imp><p1><sg>$^.<sent>/.<sent>$
      • Chunker:^verb<verb>{^koyemba<v><imp>$}$^sent<SENT>{^.<sent>$}$
      • Interchunk:^verb<verb>{^koyemba<v><imp>$}$^sent<SENT>{^.<sent>$}$
      • Postchunk: ^koyemba<v><imp>$^.<sent>$
      • Complete translation: yemba
  • Map Kikuyu associative to Lingala connective by removing extra agreement tags found in Kikuyu
    • Example sentence: "wa"
    • Old output
      • Tagger: ^a<assoc><cl1><sg>$^.<sent>$
      • Biltrans: ^a<assoc><cl1><sg>/ya<conn><cl1><sg>$^.<sent>/.<sent>$
      • Chunker: ^default<default>{^ya<conn><cl1><sg>$}$^sent<SENT>{^.<sent>$}$
      • Interchunk: ^default<default>{^ya<conn><cl1><sg>$}$^sent<SENT>{^.<sent>$}$
      • Postchunk: ^ya<conn><cl1><sg>$^.<sent>$
      • Complete translation: #ya
    • New output
      • Tagger: ^a<assoc><cl1><sg>$^.<sent>$
      • Biltrans: ^a<assoc><cl1><sg>/ya<conn><cl1><sg>$^.<sent>/.<sent>$
      • Chunker: ^conn_chunk<conn>{^ya<conn>$}$^sent<SENT>{^.<sent>$}$
      • Interchunk: ^conn_chunk<conn>{^ya<conn>$}$^sent<SENT>{^.<sent>$}$
      • Postchunk: ^ya<conn>$^.<sent>$
      • Complete translation: ya
  • Basic adjective transfer: keep number markings if a variable adjective in Lingala, discard otherwise
    • Example sentence 1: anene (big)
    • Old output
      • Tagger: ^nene<adj><cl1><pl>$^.<sent>$
      • Biltrans: ^nene<adj><cl1><pl>/monéne<adj><cl1><pl>$^.<sent>/.<sent>$
      • Chunker: ^default<default>{^monéne<adj><cl1><pl>$}$^sent<SENT>{^.<sent>$}$
      • Interchunk: ^default<default>{^monéne<adj><cl1><pl>$}$^sent<SENT>{^.<sent>$}$
      • Postchunk: ^monéne<adj><cl1><pl>$^.<sent>$
      • Complete translation: #monéne
    • New output
      • Tagger: ^nene<adj><cl1><pl>$^.<sent>$
      • Biltrans: ^nene<adj><cl1><pl>/monéne<adj><cl1><pl>$^.<sent>/.<sent>$
      • Chunker: ^adjective<adj>{^monéne<adj><pl>$}$^sent<SENT>{^.<sent>$}$
      • Interchunk: ^adjective<adj>{^monéne<adj><pl>$}$^sent<SENT>{^.<sent>$}$
      • Postchunk: ^monéne<adj><pl>$^.<sent>$
      • Complete translation: minéne
    • Example sentence 2: mata (sour)
    • Old output
      • Tagger: ^mata<adj><cl7><pl>$^.<sent>$
      • Biltrans: ^mata<adj><cl7><pl>/bololo<adj><cl7><pl>$^.<sent>/.<sent>$
      • Chunker: ^default<default>{^bololo<adj><cl7><pl>$}$^sent<SENT>{^.<sent>$}$
      • Interchunk: ^default<default>{^bololo<adj><cl7><pl>$}$^sent<SENT>{^.<sent>$}$
      • Postchunk: ^bololo<adj><cl7><pl>$^.<sent>$
      • Complete translation: #bololo
    • New output
      • Tagger: ^mata<adj><cl7><pl>$^.<sent>$
      • Biltrans: ^mata<adj><cl7><pl>/bololo<adj><cl7><pl>$^.<sent>/.<sent>$
      • Chunker: ^adjective<adj>{^bololo<adj>$}$^sent<SENT>{^.<sent>$}$
      • Interchunk: ^adjective<adj>{^bololo<adj>$}$^sent<SENT>{^.<sent>$}$
      • Postchunk: ^bololo<adj>$^.<sent>$
      • Complete translation: bololo

lin Transducer Evaluation

  • Precision (annotated corpus): 80.53%
  • Recall (annotated corpus): 55.64%
  • Coverage (large corpus): 62.10%
  • Number of words (large corpus): 275,546 (actual), 304,425 (tokenized)
  • Number of stems in transducer: 273

kik Transducer Evaluation

  • Precision (annotated corpus): 79.24%
  • Recall (annotated corpus): 91.52%
  • Coverage (large corpus): 58.39%
  • Number of words (large corpus): 107564
  • Number of stems in transducer: about 267 (nouns, verbs, adjectives, and adverbs)

lin → kik MT Evaluation

Note that the bug in the Kikuyu generator significantly alters these numbers. They should improve significantly if the bug is resolved.

  • WER (longer corpus): 96.49%
  • PER (longer corpus): 92.98%
  • Percentage unknown words: 13.91%
  • Percentage known words: 86.09%
  • Trimmed coverage over longer corpus [lin-kik.automorf.bin applied to lin.longer.txt]: 89.29%
  • Trimmed coverage over large corpus [lin-kik.automorf.bin applied to lin.corpus.large.txt]: 48.90%
  • Number of stems in longer corpus: 117 (actual) 140 (tokenized, perhaps including punctuation)
  • Number of stems in large corpus: 275,546 (actual), 304,425 (tokenized)

kik → lin MT Evaluation

  • WER (longer corpus): 86.32%
  • PER (longer corpus): 75.21%
  • Percentage unknown words: 39.50% [Actually probably a bit higher since some words were marked with an "@" and not translated while the script only looks for words marked with a "*"]
  • Percentage known words: 60.60%
  • Trimmed coverage over longer corpus [kik-lin.automorf.bin applied to kik.corpus.larger.txt]: 71.43%
  • Trimmed coverage over large corpus [kik-lin.automorf.bin applied to kik.longer.txt]: 46.40%
  • Number of stems in longer corpus: 115
  • Number of stems in large corpus: 107564