Khasi/Final Project
From LING073
Pre-Final Project
- Number of tokenised words in the corpus: 57847
- Coverage: 57.26%
- Top unknown words in the corpus:
- 206 kam
- 200 kum
- 179 Bah
- 172 kiwei
- 171 bynta
- 170 baroh
- 166 lah
- 159 pat
- 147 mynta
- 130 noh
- 125 paidbah
- 124 ne
- 122 ïoh
- 119 por
- 118 wan
- 118 Shillong
- 117 namar
- 117 Khasi
- 112 katei
- 111 Jylla
Step 1: added all the above words (and accompanying disambiguations) except for Jylla
- Coverage: 62.61%
- Top unknown words in the corpus:
- 111 Jylla
- 109 tang
- 108 ym
- 106 shuh
- 102 haduh
- 100 skul
- 98 sorkar
- 97 Seng
- 96 briew
- 90 M
- 88 jylla
- 88 kala
- 87 ri
- 87 lang
- 85 E
- 83 kine
- 82 seng
- 77 Meghalaya
- 77 tarik
- 74 naduh