Difference between revisions of "User:Nfeldba1/Final project"
From LING073
Line 112: | Line 112: | ||
*aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin | *aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin | ||
− | |||
− | |||
− | |||
− | |||
[[Category:sp17_FinalProjects]] | [[Category:sp17_FinalProjects]] |
Revision as of 00:51, 5 May 2017
Contents
- 1 Pre-Final Project
- 2 Step 1: added all the above words (and accompanying disambiguations)
- 3 Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala
- 4 Step 3: added all the above words (and accompanying disambiguations)
- 5 Step 4: added all the above words (and accompanying disambiguations)
Pre-Final Project
- Number of tokenised words in the corpus: 57847
- Coverage: 57.26%
- Top unknown words in the corpus:
- 206 kam
- 200 kum
- 179 Bah
- 172 kiwei
- 171 bynta
- 170 baroh
- 166 lah
- 159 pat
- 147 mynta
- 130 noh
- 125 paidbah
- 124 ne
- 122 ïoh
- 119 por
- 118 wan
- 118 Shillong
- 117 namar
- 117 Khasi
- 112 katei
- 111 Jylla
Step 1: added all the above words (and accompanying disambiguations)
- Coverage: 62.96%
- Top unknown words in the corpus:
- 109 tang
- 108 ym
- 106 shuh
- 102 haduh
- 100 skul
- 98 sorkar
- 97 Seng
- 96 briew
- 90 M
- 88 kala
- 87 ri
- 87 lang
- 85 E
- 83 kine
- 82 seng
- 77 Meghalaya
- 77 tarik
- 74 naduh
- 74 haba
- 74 shu
I'm not sure why M and E are appearing.
Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala
- Coverage: 66.05%
- Top unknown words in the corpus (excluding M, E, and kala):
- 72 bun
- 72 hi
- 71 BJP
- 71 District
- 70 April
- 69 samla
- 68 ryngkat
- 67 liang
- 63 bor
- 63 shim
- 63 MLA
- 62 lyngba
- 62 sdang
- 62 India
- 60 shah
- 60 tylli
- 59 kim
Step 3: added all the above words (and accompanying disambiguations)
Coverage: 68.10% Top unknown words in the corpus:
- 59 kynthup
- 59 bapher
- 58 President
- 58 ioh
- 58 khynnah
- 58 ula
- 58 ei
- 57 School
- 56 lada
- 56 Secretary
- 54 khnang
- 54 wat
- 54 kumba
- 53 Hills
- 52 Dr
- 51 biang
- 50 ju
Step 4: added all the above words (and accompanying disambiguations)
- Coverage: 70.14%
- Top unknown words in the corpus:
- 49 hadien
- 49 hap
- 49 Ïaiong
- 48 Myntri
- 48 thaiñ
- 47 bnai
- 46 ïaid
- 46 D
- 45 jingïalang
- 45 nongïalam
- 45 jur
- 45 kiei
- 44 ngut
- 43 lad
- 41 kali
- 41 nonghikai
- 41 wanrah
- 40 tynrai
- aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin