Difference between revisions of "User:Nfeldba1/Final project"
From LING073
m (→Step 4: added all the above words (and accompanying disambiguations)) |
|||
Line 132: | Line 132: | ||
*35 ïohi | *35 ïohi | ||
*34 hok | *34 hok | ||
+ | ==Step 6: added all the above words (and accompanying disambiguations) == | ||
+ | *Coverage: 73.59% | ||
+ | *Top unknown words in the corpus: | ||
+ | *34 daw | ||
+ | *34 surok | ||
+ | *34 Sangma | ||
+ | *34 donkam | ||
+ | *34 pyrshah | ||
+ | *34 o | ||
+ | *34 rukom | ||
+ | *33 Congress | ||
+ | *33 jing | ||
+ | *32 dep | ||
+ | *32 phah | ||
+ | *32 beit | ||
+ | *32 jingmut | ||
+ | *32 peit | ||
+ | *31 elekshon | ||
+ | *31 imlang | ||
+ | *31 un | ||
+ | *31 sahlang | ||
+ | *31 Jaiñtia | ||
+ | *31 pyndonkam | ||
*aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin | *aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin | ||
[[Category:sp17_FinalProjects]] | [[Category:sp17_FinalProjects]] |
Revision as of 21:00, 5 May 2017
Contents
- 1 Pre-Final Project
- 2 Step 1: added all the above words (and accompanying disambiguations)
- 3 Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala
- 4 Step 3: added all the above words (and accompanying disambiguations)
- 5 Step 4: added all the above words (and accompanying disambiguations)
- 6 Step 5: added all the above words (and accompanying disambiguations)
- 7 Step 6: added all the above words (and accompanying disambiguations)
Pre-Final Project
- Number of tokenised words in the corpus: 57847
- Coverage: 57.26%
- Top unknown words in the corpus:
- 206 kam
- 200 kum
- 179 Bah
- 172 kiwei
- 171 bynta
- 170 baroh
- 166 lah
- 159 pat
- 147 mynta
- 130 noh
- 125 paidbah
- 124 ne
- 122 ïoh
- 119 por
- 118 wan
- 118 Shillong
- 117 namar
- 117 Khasi
- 112 katei
- 111 Jylla
Step 1: added all the above words (and accompanying disambiguations)
- Coverage: 62.96%
- Top unknown words in the corpus:
- 109 tang
- 108 ym
- 106 shuh
- 102 haduh
- 100 skul
- 98 sorkar
- 97 Seng
- 96 briew
- 90 M
- 88 kala
- 87 ri
- 87 lang
- 85 E
- 83 kine
- 82 seng
- 77 Meghalaya
- 77 tarik
- 74 naduh
- 74 haba
- 74 shu
I'm not sure why M and E are appearing.
Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala
- Coverage: 66.05%
- Top unknown words in the corpus (excluding M, E, and kala):
- 72 bun
- 72 hi
- 71 BJP
- 71 District
- 70 April
- 69 samla
- 68 ryngkat
- 67 liang
- 63 bor
- 63 shim
- 63 MLA
- 62 lyngba
- 62 sdang
- 62 India
- 60 shah
- 60 tylli
- 59 kim
Step 3: added all the above words (and accompanying disambiguations)
Coverage: 68.10% Top unknown words in the corpus:
- 59 kynthup
- 59 bapher
- 58 President
- 58 ioh
- 58 khynnah
- 58 ula
- 58 ei
- 57 School
- 56 lada
- 56 Secretary
- 54 khnang
- 54 wat
- 54 kumba
- 53 Hills
- 52 Dr
- 51 biang
- 50 ju
Step 4: added all the above words (and accompanying disambiguations)
- Coverage: 70.14%
- Top unknown words in the corpus:
- 49 hadien
- 49 hap
- 49 Ïaiong
- 48 Myntri
- 48 thaiñ
- 47 bnai
- 46 ïaid
- 46 D
- 45 jingïalang
- 45 nongïalam
- 45 jur
- 45 kiei
- 44 ngut
- 43 lad
- 41 kali
- 41 nonghikai
- 41 wanrah
- 40 tynrai
Step 5: added all the above words (and accompanying disambiguations)
- Coverage: 72.18%
- Top unknown words in the corpus:
- 40 kyntu
- 40 kyrteng
- 40 ïathuh
- 40 kino
- 40 lympung
- 39 eh
- 38 khamtam
- 38 kyntiew
- 38 rangbah
- 37 shisha
- 37 Trai
- 36 Rangbah
- 36 pisa
- 36 pule
- 36 kren
- 36 bym
- 35 kumno
- 35 Bhoi
- 35 ïohi
- 34 hok
Step 6: added all the above words (and accompanying disambiguations)
- Coverage: 73.59%
- Top unknown words in the corpus:
- 34 daw
- 34 surok
- 34 Sangma
- 34 donkam
- 34 pyrshah
- 34 o
- 34 rukom
- 33 Congress
- 33 jing
- 32 dep
- 32 phah
- 32 beit
- 32 jingmut
- 32 peit
- 31 elekshon
- 31 imlang
- 31 un
- 31 sahlang
- 31 Jaiñtia
- 31 pyndonkam
- aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin