Kaingang/Transducer

From LING073
Jump to: navigation, search

[1] Link to Kaingang Transducer

Analyser Evaluation

First Evaluation

  • First evaluation of coverage 2/21/19:
$ aq-covtest ling073-kgp-corpus/kgp.corpus.basic.txt ling073-kgp/kgp.automorf.bin
Number of tokenised words in the corpus: 1303
Coverage: 26.94%
Top unknown words in the corpus:
68	 tóg
49	 tỹ 
44	 mũ
42	 nĩ
38	 mỹ
36	 ẽg
33	 tĩ
24	 vỹ
21	 ki
20	 han
19	 ke
19	 to
17	 Topẽ
16	 jé
16	 kar
15	 vẽnh
14	 h
13	 vĩ
12	 vẽ
11	 ũ
Translation time: 0.005640268325805664 seconds
  • Many of these top unknown words are included in our wiki and even in our morph tests. They just happen to be markers, which are added to nouns, verbs, or pronouns to indicate something about them.

Classification of First Top Unknown Words

Kaingang Approximate Meaning Classification
tóg subject is agent subject indicator
tỹ agent is ergative, topic marker subject indicator
doing, an action is done repeatedly might be verb to do? / verb marker <v><sit>
in the situation of doing the action, situational/continuous marker <v><sit>
mỹ subject in a yes-or-no question subject indicator
ẽg "we" "us" "our" Pronoun <prn><p1><pl>
habitually verb habitual marker <v><hab>
vỹ subject is topic subject indicator
ki in, on, at circumstance marker postposition <post>
han to get better verb <v><bas><iv>
ke leftovers, remaining noun <n> subordinate noun
to to, in the direction of circumstance marker postposition <post>
Topẽ God, anything sacred, to pray <n>
subject expects the action, speaker desires the action subject indicator
habitually verb habitual marker <v><hab>
kar all noun subordinate
vẽnh of oneself reflexive pronoun
h  ?  ?
word, speech, to speak noun, verb
vẽ is, was, ergative aspect marker
ũ makes indefinite noun indefinite noun marker

Second Evaluation

  • Added words that are subject indicators (tóg, tỹ, mỹ, vỹ, jé) with <subj>. The second evaluation of coverage, now with added subject indicators, 2/22/19:
$ aq-covtest ling073-kgp-corpus/kgp.corpus.basic.txt ling073-kgp/kgp.automorf.bin
Number of tokenised words in the corpus: 1303
Coverage: 41.98%
Top unknown words in the corpus:
44	 mũ
42	 nĩ
36	 ẽg
33	 tĩ
21	 ki
20	 han
19	 ke
19	 to
17	 Topẽ
16	 kar
15	 vẽnh
14	 h
13	 vĩ
12	 vẽ
11	 ũ
11	 nén
9	 ẽn
9	 Ẽg
8	 Jesus
8	 José
Translation time: 0.0059816837310791016 seconds
  • How much each word added to the coverage (percentage each individual word added):
    • tóg -> 32.16% - 26.94% = 5.22%
    • tỹ -> 30.78% - 26.94% = 3.84%
    • mỹ -> 29.85% - 26.94% = 2.91%
    • vỹ -> 28.78% - 26.94% = 1.84%
    • jé -> 28.17% - 26.94% = 1.53%

Notes

  • Total number of stems in the transducer:
$ lexccounter apertium-kgp.kgp.lexc
Unique entries: 105
  • Current coverage over combined corpus: 41.98%
  • Current list of unknown words returned by aq-covtest: mũ (44), nĩ (42), ẽg (36), tĩ (33), ki (21), han (20), ke (19), to (19), Topẽ (17), kar (16), vẽnh (15), h (14), vĩ (13), vẽ (12), ũ (11), nén (11), ẽn (9), Ẽg (9), Jesus (8), José (8)
  • Tests:
    • No tests in kgp.yaml fail.
    • 5 out of 20 of the tests in commonwords.yaml pass.

Experiment

Tried testing coverage on a text file containing all of unparsed bible:

$ aq-covtest ling073-kgp-corpus/bible_kaingang.txt ling073-kgp/kgp.automorf.bin
Number of tokenised words in the corpus: 388887
Coverage: 42.44%
Top unknown words in the corpus:
16033	 mũ
10740	 ke
9523	 nĩ
6238	 ẽg
5604	 Topẽ
5401	 ki
5244	 han
5013	 tĩ
4944	 to
4428	 ẽn
4389	 ũ
3985	 vĩ
3569	 h
3440	 Jesus
3006	 nỹtĩ
2907	 vẽnh
2769	 kar
2722	 mré
2508	 t
2407	 sóg
Translation time: 0.7014105319976807 seconds
  • Coverage looks a lot similar to our combined corpus file!

Words Added Post-Experiment

  • Most of the top unknown words are shared between the files.
Top unknown in...
kgp.basic.corpus.txt only bible_kaingang.txt only
vẽ nỹtĩ
nén mré
José t
sóg
  • Top unknown words found in both kgp.basic.corpus.txt and bible_kaingang.txt: mũ, nĩ, ke, ẽg, Topẽ, to, han, tĩ, ki, kar, vẽnh, h, vĩ, ẽn, ũ, and Jesus.
  • We added the words found in both to the .lexc file with the tag <unk> for the purposes of testing the coverage (except for ẽn and Jesus, which weren't in the original commonwords.yaml).
  • The resulting coverage is:
$ aq-covtest ling073-kgp-corpus/bible_kaingang.txt ling073-kgp/kgp.automorf.bin
Number of tokenised words in the corpus: 388887
Coverage: 65.48%
Top unknown words in the corpus:
4428	 ẽn
3440	 Jesus
3006	 nỹtĩ
2722	 mré
2508	 t
2407	 sóg
2385	 nén
2196	 vẽ
2192	 jykre
2126	 ra
2077	 tó
1921	 ri
1746	 ũn
1666	 i
1639	 tag
1605	 a
1416	 nĩgtĩ
1307	 o
1187	 g
1128	 r
Translation time: 0.7653579711914062 seconds
  • Number of stems:
$ lexccounter apertium-kgp.kgp.lexc
Unique entries: 119
  • So 119 words cover 65.48% of the bible in Kaingang.
  • Adding ẽn, Jesus, nỹtĩ, mré, t, sóg, nén, and vẽ will get us to 71.42%. That would be 127 words.

Generator Evaluation

Initial evaluation of morphological generation

  • Number of passing morphological analysis tests: 214
  • Number of failing morphological analysis tests: 2
  • Current coverage: 67.84%
  • Number of passing morphological generation tests: 107
  • Number of failing morphological generation tests: 2

Evaluation as of 3/2/2019

  • Current coverage: 76.51%
  • Number of passing morphological generation tests: 112
  • Number of failing morphological generation tests: 0