|
|
(6 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
− | ==Pre-Final Project== | + | ==The Project== |
− | *Number of tokenised words in the corpus: 57847
| + | For my final project, I expanded my transducer until it reached 85% coverage, which involved adding words, disambiguations, and extra suffixes. |
− | *Coverage: 57.26%
| |
− | *Top unknown words in the corpus:
| |
− | *206 kam
| |
− | *200 kum
| |
− | *179 Bah
| |
− | *172 kiwei
| |
− | *171 bynta
| |
− | *170 baroh
| |
− | *166 lah
| |
− | *159 pat
| |
− | *147 mynta
| |
− | *130 noh
| |
− | *125 paidbah
| |
− | *124 ne
| |
− | *122 ïoh
| |
− | *119 por
| |
− | *118 wan
| |
− | *118 Shillong
| |
− | *117 namar
| |
− | *117 Khasi
| |
− | *112 katei
| |
− | *111 Jylla
| |
− | ==Step 1: added all the above words (and accompanying disambiguations)==
| |
− | *Coverage: 62.96%
| |
− | *Top unknown words in the corpus:
| |
− | *109 tang
| |
− | *108 ym
| |
− | *106 shuh
| |
− | *102 haduh
| |
− | *100 skul
| |
− | *98 sorkar
| |
− | *97 Seng
| |
− | *96 briew
| |
− | *90 M
| |
− | *88 kala
| |
− | *87 ri
| |
− | *87 lang
| |
− | *85 E
| |
− | *83 kine
| |
− | *82 seng
| |
− | *77 Meghalaya
| |
− | *77 tarik
| |
− | *74 naduh
| |
− | *74 haba
| |
− | *74 shu
| |
− | I'm not sure why M and E are appearing.
| |
| | | |
− | ==Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala ==
| + | My code can be found here: https://github.com/nfeldbaum/Khasi_Transducer |
− | *Coverage: 66.05%
| |
− | *Top unknown words in the corpus (excluding M, E, and kala):
| |
− | *72 bun
| |
− | *72 hi
| |
− | *71 BJP
| |
− | *71 District
| |
− | *70 April
| |
− | *69 samla
| |
− | *68 ryngkat
| |
− | *67 liang
| |
− | *63 bor
| |
− | *63 shim
| |
− | *63 MLA
| |
− | *62 lyngba
| |
− | *62 sdang
| |
− | *62 India
| |
− | *60 shah
| |
− | *60 tylli
| |
− | *59 kim
| |
− | ==Step 3: added all the above words (and accompanying disambiguations) ==
| |
− | Coverage: 68.10%
| |
− | Top unknown words in the corpus:
| |
− | *59 kynthup
| |
− | *59 bapher
| |
− | *58 President
| |
− | *58 ioh
| |
− | *58 khynnah
| |
− | *58 ula
| |
− | *58 ei
| |
− | *57 School
| |
− | *56 lada
| |
− | *56 Secretary
| |
− | *54 khnang
| |
− | *54 wat
| |
− | *54 kumba
| |
− | *53 Hills
| |
− | *52 Dr
| |
− | *51 biang
| |
− | *50 ju
| |
− | ==Step 4: added all the above words (and accompanying disambiguations) ==
| |
− | *Coverage: 70.14%
| |
− | *Top unknown words in the corpus:
| |
− | *49 hadien
| |
− | *49 hap
| |
− | *49 Ïaiong
| |
− | *48 Myntri
| |
− | *48 thaiñ
| |
− | *47 bnai
| |
− | *46 ïaid
| |
− | *46 D
| |
− | *45 jingïalang
| |
− | *45 nongïalam
| |
− | *45 jur
| |
− | *45 kiei
| |
− | *44 ngut
| |
− | *43 lad
| |
− | *41 kali
| |
− | *41 nonghikai
| |
− | *41 wanrah
| |
− | *40 tynrai
| |
− | ==Step 5: added all the above words (and accompanying disambiguations) ==
| |
− | *Coverage: 72.18%
| |
− | *Top unknown words in the corpus:
| |
− | *40 kyntu
| |
− | *40 kyrteng
| |
− | *40 ïathuh
| |
− | *40 kino
| |
− | *40 lympung
| |
− | *39 eh
| |
− | *38 khamtam
| |
− | *38 kyntiew
| |
− | *38 rangbah
| |
− | *37 shisha
| |
− | *37 Trai
| |
− | *36 Rangbah
| |
− | *36 pisa
| |
− | *36 pule
| |
− | *36 kren
| |
− | *36 bym
| |
− | *35 kumno
| |
− | *35 Bhoi
| |
− | *35 ïohi
| |
− | *34 hok
| |
− | ==Step 6: added all the above words (and accompanying disambiguations) ==
| |
− | *Coverage: 73.59%
| |
− | *Top unknown words in the corpus:
| |
− | *34 daw
| |
− | *34 surok
| |
− | *34 Sangma
| |
− | *34 donkam
| |
− | *34 pyrshah
| |
− | *34 o
| |
− | *34 rukom
| |
− | *33 Congress
| |
− | *33 jing
| |
− | *32 dep
| |
− | *32 phah
| |
− | *32 beit
| |
− | *32 jingmut
| |
− | *32 peit
| |
− | *31 elekshon
| |
− | *31 imlang
| |
− | *31 un
| |
− | *31 sahlang
| |
− | *31 Jaiñtia
| |
− | *31 pyndonkam
| |
− | ==Step 7: added all the above words (and accompanying disambiguations) ==
| |
− | *Coverage: 74.78%
| |
− | *Top unknown words in the corpus:
| |
− | *31 T
| |
− | *31 sahlang
| |
− | *31 C
| |
− | *31 e
| |
− | *30 daw
| |
− | *30 ngin
| |
− | *30 party
| |
− | *30 poi
| |
− | *30 lem
| |
− | *29 bniah
| |
− | *29 katba
| |
− | *29 kaban
| |
− | *29 Kong
| |
− | *29 Chie
| |
− | *29 kynhun
| |
− | *29 s
| |
− | *29 uwei
| |
− | *28 dkhot
| |
− | *28 rai
| |
− | *28 Party
| |
| | | |
− | ==Step 8: added all the above words (and accompanying disambiguations) == | + | ==The Results== |
− | *Coverage: 75.88%
| |
− | *Top unknown words in the corpus:
| |
− | *30 of
| |
− | *28 Unit
| |
− | *28 nyngkong
| |
− | *28 lynti
| |
− | *28 khubor
| |
− | *28 ktah
| |
− | *28 tnad
| |
− | *28 Mukul
| |
− | *28 kynthei
| |
− | *28 lai
| |
− | *27 kat
| |
− | *27 aiñ
| |
− | *27 ïakhun
| |
− | *27 satia
| |
− | *27 jingjia
| |
− | *27 SSA
| |
− | *26 Hima
| |
− | *26 kot
| |
− | *26 shi
| |
− | *26 K
| |
− | *At this point, I also decided to add all the letters in the alphabet as individual words.
| |
− | ==Step 9: added all the above words (and accompanying disambiguations) ==
| |
− | *Coverage: 77.14%
| |
− | *Top unknown words in the corpus:
| |
− | *56 EVM
| |
− | *26 jongka - unknown
| |
− | *26 lap
| |
− | *26 lei
| |
− | *25 ïaka - unknown
| |
− | *25 kynja
| |
− | *25 ophis
| |
− | *25 Committee
| |
− | *25 UDP
| |
− | *25 treikam
| |
− | *25 lait
| |
− | *24 Suk
| |
− | *24 bishar
| |
− | *24 im
| |
− | *24 ïadei
| |
− | *24 jingïarap
| |
− | *24 tit
| |
− | *24 riew
| |
− | *23 Nalor
| |
− | *23 shna
| |
− | ==Step 10: added all the above words (and accompanying disambiguations) ==
| |
− | *Coverage: 78.15%
| |
− | *Top unknown words in the corpus:
| |
− | *23 Election
| |
− | *23 sniew
| |
− | *23 General
| |
− | *23 pynban
| |
− | *23 kmen
| |
− | *22 thain
| |
− | *22 kyrpang
| |
− | *22 dawa
| |
− | *22 ïai
| |
− | *22 Pynursla
| |
− | *22 kdew
| |
− | *21 Iaiong
| |
− | *21 Association
| |
− | *21 Secondary
| |
− | *21 Sports
| |
− | *21 riti
| |
− | *21 phai
| |
− | *21 Jowai
| |
− | ==Step 11: added all the above words (and accompanying disambiguations) ==
| |
− | Coverage: 78.91%
| |
− | Top unknown words in the corpus:
| |
− | *21 MP
| |
− | *21 shitom
| |
− | *21 burom
| |
− | *21 palat
| |
− | *21 Block
| |
− | *20 tyngka
| |
− | *20 sakhi
| |
− | *20 Raid
| |
− | *20 shuwa
| |
− | *20 khmih
| |
− | *20 juh
| |
− | *20 Court
| |
− | *20 jait
| |
− | *20 thong
| |
− | *20 pulit
| |
− | *20 Lyngdoh
| |
− | ==Step 12: added all the above words (and accompanying disambiguations) ==
| |
− | Coverage: 79.62%
| |
− | Top unknown words in the corpus:
| |
− | 20 dor
| |
− | 20 ma
| |
− | 20 thong
| |
− | 20 man
| |
− | 20 heh
| |
− | 20 kumjuh
| |
− | 19 kajuh
| |
− | 19 nam
| |
− | 19 North
| |
− | 19 barim
| |
− | 19 jingmyntoi
| |
− | 19 Constituency
| |
− | 19 sngewthuh
| |
− | 19 hajar
| |
− | 19 kut
| |
− | 19 thoh
| |
− | 19 pud
| |
− | 19 hangne
| |
− | ==Step 13: added all the above words (and accompanying disambiguations) ==
| |
− | Coverage: 80.31%
| |
− | Top unknown words in the corpus:
| |
− | 19 Officer
| |
− | 19 madan
| |
− | 19 jop
| |
− | 18 kyrpad
| |
− | 18 kyrwoh
| |
− | 18 lawei
| |
− | 18 Niam
| |
− | 18 jia
| |
− | 18 plie
| |
− | 18 haka - unsure
| |
− | 18 kit
| |
− | 18 jingeh
| |
− | 18 nongrim
| |
− | 17 vote
| |
− | 17 Kynrad
| |
− | 17 pyrshang
| |
− | 17 paitbah
| |
− | ==Step 14: added all the above words (and accompanying disambiguations) ==
| |
− | Coverage: 80.94%
| |
− | Top unknown words in the corpus:
| |
− | 17 hynne
| |
− | 17 sem
| |
− | 17 jaidbynriew
| |
− | 17 beh
| |
− | 17 jingïalehkai
| |
− | 17 Lajong
| |
− | 17 jingrakhe
| |
− | 17 ïalehkai
| |
− | 17 masi
| |
− | 17 Dorbar
| |
− | 17 pdiang
| |
− | 17 katkum
| |
− | 16 Meet
| |
− | 16 kloi
| |
− | 16 dustur
| |
− | 16 kem
| |
− | 16 kaei
| |
| | | |
| + | I managed to reach 85.15% coverage on my initial 50,000 word corpus! However, I ended up adding a significant number of English words that were present to reach this amount. A truer measure of corpus coverage is the 80.50% coverage I get when I delete all the words that aren't part of Khasi. |
| | | |
− | ==Step 15: added all the above words (and accompanying disambiguations) ==
| + | Instead of testing precision and recall against hand-annotated randomly selected forms, I decided to gather a further 25,000 words in order to test my transducer on a corpus it hadn't trained on. This new corpus can be found in the repository linked above under ling073-kha-corpus/kha.corpus.large.test.txt. On this test corpus, I achieved 83.39% coverage with English words included, and 80.50% coverage without English words - exactly the same as on my training corpus. |
− | Coverage: 81.56%
| |
− | Top unknown words in the corpus:
| |
− | 16 shaphang
| |
− | 16 arngut - unsure
| |
− | 16 jingpang
| |
− | 16 nongkyndong
| |
− | 16 Nongpoh
| |
− | 16 roi
| |
− | 16 Day
| |
− | 16 pynshai
| |
− | 16 lak
| |
− | 16 drok
| |
− | 16 National
| |
− | 16 kyrtong
| |
− | 16 ophisar
| |
− | 15 sien
| |
− | 15 nym
| |
− | 15 shakhmat
| |
− | 15 rung
| |
− | ==Step 16: added all the above words (and accompanying disambiguations) ==
| |
− | *Coverage: 82.06%
| |
− | *Top unknown words in the corpus:
| |
− | *15 Circle
| |
− | *15 Good
| |
− | *15 hiar
| |
− | *15 synshar
| |
− | *15 Hato
| |
− | *15 klur
| |
− | *15 Khanna
| |
− | *15 kynthoh
| |
− | *15 katne
| |
− | *15 kper
| |
− | *15 prokram
| |
− | *15 jingkyrshan
| |
− | *15 Nongstoiñ
| |
− | *15 Upper
| |
− | *14 lut
| |
− | *14 kynnoh
| |
− | ==Step 17: added all the above words (and accompanying disambiguations) ==
| |
− | *Coverage: 82.60%
| |
− | *Top unknown words in the corpus:
| |
− | *14 be
| |
− | *14 sngewbha
| |
− | *14 Vinod
| |
− | *14 pynbna
| |
− | *14 rep
| |
− | *14 bynrap
| |
− | *14 hukum
| |
− | *14 par
| |
− | *14 khang
| |
− | *14 ïashim
| |
− | *14 shynrang
| |
− | *14 College
| |
− | *14 Commission
| |
− | *14 nongmihkhmat
| |
− | *14 kyrdan
| |
− | *14 bampong
| |
− | ==Step 18: added all the above words (and accompanying disambiguations) ==
| |
− | Coverage: 83.03%
| |
− | Top unknown words in the corpus:
| |
− | 14 kular
| |
− | 14 nong
| |
− | 14 khraw
| |
− | 14 sngewbha
| |
− | 14 Bar
| |
− | 14 kumne
| |
− | 13 pilkrim
| |
− | 13 KSU
| |
− | 13 Khristan
| |
− | 13 pdeng
| |
− | 13 wad
| |
− | 13 pan
| |
− | 13 duh
| |
− | 13 sengbhalang
| |
− | 13 longkmie
| |
− | 13 ap
| |
| | | |
− | *aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin
| |
| [[Category:sp17_FinalProjects]] | | [[Category:sp17_FinalProjects]] |
For my final project, I expanded my transducer until it reached 85% coverage, which involved adding words, disambiguations, and extra suffixes.
I managed to reach 85.15% coverage on my initial 50,000 word corpus! However, I ended up adding a significant number of English words that were present to reach this amount. A truer measure of corpus coverage is the 80.50% coverage I get when I delete all the words that aren't part of Khasi.
Instead of testing precision and recall against hand-annotated randomly selected forms, I decided to gather a further 25,000 words in order to test my transducer on a corpus it hadn't trained on. This new corpus can be found in the repository linked above under ling073-kha-corpus/kha.corpus.large.test.txt. On this test corpus, I achieved 83.39% coverage with English words included, and 80.50% coverage without English words - exactly the same as on my training corpus.