Kaingang/Transducer
From LING073
[1] Link to Kaingang Transducer
Contents
Analyser Evaluation
First Evaluation
- First evaluation of coverage 2/21/19:
$ aq-covtest ling073-kgp-corpus/kgp.corpus.basic.txt ling073-kgp/kgp.automorf.bin Number of tokenised words in the corpus: 1303 Coverage: 26.94% Top unknown words in the corpus: 68 tóg 49 tỹ 44 mũ 42 nĩ 38 mỹ 36 ẽg 33 tĩ 24 vỹ 21 ki 20 han 19 ke 19 to 17 Topẽ 16 jé 16 kar 15 vẽnh 14 h 13 vĩ 12 vẽ 11 ũ Translation time: 0.005640268325805664 seconds
- Many of these top unknown words are included in our wiki and even in our morph tests. They just happen to be markers, which are added to nouns, verbs, or pronouns to indicate something about them.
Classification of First Top Unknown Words
Kaingang | Approximate Meaning | Classification |
---|---|---|
tóg | subject is agent | subject indicator |
tỹ | agent is ergative, topic marker | subject indicator |
mũ | doing, an action is done repeatedly | might be verb to do? / verb marker <v><sit> |
nĩ | in the situation of doing the action, | situational/continuous marker <v><sit> |
mỹ | subject in a yes-or-no question | subject indicator |
ẽg | "we" "us" "our" | Pronoun <prn><p1><pl> |
tĩ | habitually | verb habitual marker <v><hab> |
vỹ | subject is topic | subject indicator |
ki | in, on, at | circumstance marker postposition <post> |
han | to get better | verb <v><bas><iv> |
ke | leftovers, remaining | noun <n> subordinate noun |
to | to, in the direction of | circumstance marker postposition <post> |
Topẽ | God, anything sacred, to pray | <n> |
jé | subject expects the action, speaker desires the action | subject indicator |
tĩ | habitually | verb habitual marker <v><hab> |
kar | all | noun subordinate |
vẽnh | of oneself | reflexive pronoun |
h | ? | ? |
vĩ | word, speech, to speak | noun, verb |
vẽ | is, was, ergative | aspect marker |
ũ | makes indefinite noun | indefinite noun marker |
Second Evaluation
- Added words that are subject indicators (tóg, tỹ, mỹ, vỹ, jé) with <subj>. The second evaluation of coverage, now with added subject indicators, 2/22/19:
$ aq-covtest ling073-kgp-corpus/kgp.corpus.basic.txt ling073-kgp/kgp.automorf.bin Number of tokenised words in the corpus: 1303 Coverage: 41.98% Top unknown words in the corpus: 44 mũ 42 nĩ 36 ẽg 33 tĩ 21 ki 20 han 19 ke 19 to 17 Topẽ 16 kar 15 vẽnh 14 h 13 vĩ 12 vẽ 11 ũ 11 nén 9 ẽn 9 Ẽg 8 Jesus 8 José Translation time: 0.0059816837310791016 seconds
- How much each word added to the coverage (percentage each individual word added):
- tóg -> 32.16% - 26.94% = 5.22%
- tỹ -> 30.78% - 26.94% = 3.84%
- mỹ -> 29.85% - 26.94% = 2.91%
- vỹ -> 28.78% - 26.94% = 1.84%
- jé -> 28.17% - 26.94% = 1.53%
Notes
- Total number of stems in the transducer:
$ lexccounter apertium-kgp.kgp.lexc Unique entries: 105
- Current coverage over combined corpus: 41.98%
- Current list of unknown words returned by aq-covtest: mũ (44), nĩ (42), ẽg (36), tĩ (33), ki (21), han (20), ke (19), to (19), Topẽ (17), kar (16), vẽnh (15), h (14), vĩ (13), vẽ (12), ũ (11), nén (11), ẽn (9), Ẽg (9), Jesus (8), José (8)
- Tests:
- No tests in kgp.yaml fail.
- 5 out of 20 of the tests in commonwords.yaml pass.
Experiment
Tried testing coverage on a text file containing all of unparsed bible:
$ aq-covtest ling073-kgp-corpus/bible_kaingang.txt ling073-kgp/kgp.automorf.bin Number of tokenised words in the corpus: 388887 Coverage: 42.44% Top unknown words in the corpus: 16033 mũ 10740 ke 9523 nĩ 6238 ẽg 5604 Topẽ 5401 ki 5244 han 5013 tĩ 4944 to 4428 ẽn 4389 ũ 3985 vĩ 3569 h 3440 Jesus 3006 nỹtĩ 2907 vẽnh 2769 kar 2722 mré 2508 t 2407 sóg Translation time: 0.7014105319976807 seconds
- Coverage looks a lot similar to our combined corpus file!
Words Added Post-Experiment
- Most of the top unknown words are shared between the files.
Top unknown in... | |
---|---|
kgp.basic.corpus.txt only | bible_kaingang.txt only |
vẽ | nỹtĩ |
nén | mré |
José | t |
sóg |
- Top unknown words found in both kgp.basic.corpus.txt and bible_kaingang.txt: mũ, nĩ, ke, ẽg, Topẽ, to, han, tĩ, ki, kar, vẽnh, h, vĩ, ẽn, ũ, and Jesus.
- We added the words found in both to the .lexc file with the tag <unk> for the purposes of testing the coverage (except for ẽn and Jesus, which weren't in the original commonwords.yaml).
- The resulting coverage is:
$ aq-covtest ling073-kgp-corpus/bible_kaingang.txt ling073-kgp/kgp.automorf.bin Number of tokenised words in the corpus: 388887 Coverage: 65.48% Top unknown words in the corpus: 4428 ẽn 3440 Jesus 3006 nỹtĩ 2722 mré 2508 t 2407 sóg 2385 nén 2196 vẽ 2192 jykre 2126 ra 2077 tó 1921 ri 1746 ũn 1666 i 1639 tag 1605 a 1416 nĩgtĩ 1307 o 1187 g 1128 r Translation time: 0.7653579711914062 seconds
- Number of stems:
$ lexccounter apertium-kgp.kgp.lexc Unique entries: 119
- So 119 words cover 65.48% of the bible in Kaingang.
- Adding ẽn, Jesus, nỹtĩ, mré, t, sóg, nén, and vẽ will get us to 71.42%. That would be 127 words.
Generator Evaluation
Initial evaluation of morphological generation
- Number of passing morphological analysis tests: 214
- Number of failing morphological analysis tests: 2
- Current coverage: 67.84%
- Number of passing morphological generation tests: 107
- Number of failing morphological generation tests: 2
Evaluation as of 3/2/2019
- Current coverage: 76.51%
- Number of passing morphological generation tests: 112
- Number of failing morphological generation tests: 0