Difference between revisions of "User:Nfeldba1/Final project"
From LING073
m (→Step 4: added all the above words (and accompanying disambiguations)) |
|||
Line 109: | Line 109: | ||
*41 wanrah | *41 wanrah | ||
*40 tynrai | *40 tynrai | ||
− | ==Step | + | ==Step 5: added all the above words (and accompanying disambiguations) == |
− | Coverage: 72.18% | + | *Coverage: 72.18% |
− | Top unknown words in the corpus: | + | *Top unknown words in the corpus: |
− | 40 kyntu | + | *40 kyntu |
− | 40 kyrteng | + | *40 kyrteng |
− | 40 ïathuh | + | *40 ïathuh |
− | 40 kino | + | *40 kino |
− | 40 lympung | + | *40 lympung |
− | 39 eh | + | *39 eh |
− | 38 khamtam | + | *38 khamtam |
− | 38 kyntiew | + | *38 kyntiew |
− | 38 rangbah | + | *38 rangbah |
− | 37 shisha | + | *37 shisha |
− | 37 Trai | + | *37 Trai |
− | 36 Rangbah | + | *36 Rangbah |
− | 36 pisa | + | *36 pisa |
− | 36 pule | + | *36 pule |
− | 36 kren | + | *36 kren |
− | 36 bym | + | *36 bym |
− | 35 kumno | + | *35 kumno |
− | 35 Bhoi | + | *35 Bhoi |
− | 35 ïohi | + | *35 ïohi |
− | 34 hok | + | *34 hok |
*aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin | *aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin | ||
[[Category:sp17_FinalProjects]] | [[Category:sp17_FinalProjects]] |
Revision as of 20:33, 5 May 2017
Contents
- 1 Pre-Final Project
- 2 Step 1: added all the above words (and accompanying disambiguations)
- 3 Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala
- 4 Step 3: added all the above words (and accompanying disambiguations)
- 5 Step 4: added all the above words (and accompanying disambiguations)
- 6 Step 5: added all the above words (and accompanying disambiguations)
Pre-Final Project
- Number of tokenised words in the corpus: 57847
- Coverage: 57.26%
- Top unknown words in the corpus:
- 206 kam
- 200 kum
- 179 Bah
- 172 kiwei
- 171 bynta
- 170 baroh
- 166 lah
- 159 pat
- 147 mynta
- 130 noh
- 125 paidbah
- 124 ne
- 122 ïoh
- 119 por
- 118 wan
- 118 Shillong
- 117 namar
- 117 Khasi
- 112 katei
- 111 Jylla
Step 1: added all the above words (and accompanying disambiguations)
- Coverage: 62.96%
- Top unknown words in the corpus:
- 109 tang
- 108 ym
- 106 shuh
- 102 haduh
- 100 skul
- 98 sorkar
- 97 Seng
- 96 briew
- 90 M
- 88 kala
- 87 ri
- 87 lang
- 85 E
- 83 kine
- 82 seng
- 77 Meghalaya
- 77 tarik
- 74 naduh
- 74 haba
- 74 shu
I'm not sure why M and E are appearing.
Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala
- Coverage: 66.05%
- Top unknown words in the corpus (excluding M, E, and kala):
- 72 bun
- 72 hi
- 71 BJP
- 71 District
- 70 April
- 69 samla
- 68 ryngkat
- 67 liang
- 63 bor
- 63 shim
- 63 MLA
- 62 lyngba
- 62 sdang
- 62 India
- 60 shah
- 60 tylli
- 59 kim
Step 3: added all the above words (and accompanying disambiguations)
Coverage: 68.10% Top unknown words in the corpus:
- 59 kynthup
- 59 bapher
- 58 President
- 58 ioh
- 58 khynnah
- 58 ula
- 58 ei
- 57 School
- 56 lada
- 56 Secretary
- 54 khnang
- 54 wat
- 54 kumba
- 53 Hills
- 52 Dr
- 51 biang
- 50 ju
Step 4: added all the above words (and accompanying disambiguations)
- Coverage: 70.14%
- Top unknown words in the corpus:
- 49 hadien
- 49 hap
- 49 Ïaiong
- 48 Myntri
- 48 thaiñ
- 47 bnai
- 46 ïaid
- 46 D
- 45 jingïalang
- 45 nongïalam
- 45 jur
- 45 kiei
- 44 ngut
- 43 lad
- 41 kali
- 41 nonghikai
- 41 wanrah
- 40 tynrai
Step 5: added all the above words (and accompanying disambiguations)
- Coverage: 72.18%
- Top unknown words in the corpus:
- 40 kyntu
- 40 kyrteng
- 40 ïathuh
- 40 kino
- 40 lympung
- 39 eh
- 38 khamtam
- 38 kyntiew
- 38 rangbah
- 37 shisha
- 37 Trai
- 36 Rangbah
- 36 pisa
- 36 pule
- 36 kren
- 36 bym
- 35 kumno
- 35 Bhoi
- 35 ïohi
- 34 hok
- aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin