Difference between revisions of "Khasi/Final Project"
From LING073
(→Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala) |
|||
Line 48: | Line 48: | ||
I'm not sure why M and E are appearing, and I can't define kala. | I'm not sure why M and E are appearing, and I can't define kala. | ||
==Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala == | ==Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala == | ||
− | Coverage: 66.05% | + | *Coverage: 66.05% |
− | Top unknown words in the corpus (excluding M, E, and kala): | + | *Top unknown words in the corpus (excluding M, E, and kala): |
− | 72 bun | + | *72 bun |
− | 72 hi | + | *72 hi |
− | 71 BJP | + | *71 BJP |
− | 71 District | + | *71 District |
− | 70 April | + | *70 April |
− | 69 samla | + | *69 samla |
− | 68 ryngkat | + | *68 ryngkat |
− | 67 liang | + | *67 liang |
− | 63 bor | + | *63 bor |
− | 63 shim | + | *63 shim |
− | 63 MLA | + | *63 MLA |
− | 62 lyngba | + | *62 lyngba |
− | 62 sdang | + | *62 sdang |
− | 62 India | + | *62 India |
− | 60 shah | + | *60 shah |
− | 60 tylli | + | *60 tylli |
− | 59 kim | + | *59 kim |
+ | *aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin | ||
[[Category:sp17_FinalProjects]] | [[Category:sp17_FinalProjects]] | ||
− |
Revision as of 18:28, 2 May 2017
Pre-Final Project
- Number of tokenised words in the corpus: 57847
- Coverage: 57.26%
- Top unknown words in the corpus:
- 206 kam
- 200 kum
- 179 Bah
- 172 kiwei
- 171 bynta
- 170 baroh
- 166 lah
- 159 pat
- 147 mynta
- 130 noh
- 125 paidbah
- 124 ne
- 122 ïoh
- 119 por
- 118 wan
- 118 Shillong
- 117 namar
- 117 Khasi
- 112 katei
- 111 Jylla
Step 1: added all the above words (and accompanying disambiguations)
- Coverage: 62.96%
- Top unknown words in the corpus:
- 109 tang
- 108 ym
- 106 shuh
- 102 haduh
- 100 skul
- 98 sorkar
- 97 Seng
- 96 briew
- 90 M
- 88 kala
- 87 ri
- 87 lang
- 85 E
- 83 kine
- 82 seng
- 77 Meghalaya
- 77 tarik
- 74 naduh
- 74 haba
- 74 shu
I'm not sure why M and E are appearing, and I can't define kala.
Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala
- Coverage: 66.05%
- Top unknown words in the corpus (excluding M, E, and kala):
- 72 bun
- 72 hi
- 71 BJP
- 71 District
- 70 April
- 69 samla
- 68 ryngkat
- 67 liang
- 63 bor
- 63 shim
- 63 MLA
- 62 lyngba
- 62 sdang
- 62 India
- 60 shah
- 60 tylli
- 59 kim
- aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin