User:Nfeldba1/Final project

From LING073
Jump to: navigation, search

Pre-Final Project

  • Number of tokenised words in the corpus: 57847
  • Coverage: 57.26%
  • Top unknown words in the corpus:
  • 206 kam
  • 200 kum
  • 179 Bah
  • 172 kiwei
  • 171 bynta
  • 170 baroh
  • 166 lah
  • 159 pat
  • 147 mynta
  • 130 noh
  • 125 paidbah
  • 124 ne
  • 122 ïoh
  • 119 por
  • 118 wan
  • 118 Shillong
  • 117 namar
  • 117 Khasi
  • 112 katei
  • 111 Jylla

Step 1: added all the above words (and accompanying disambiguations)

  • Coverage: 62.96%
  • Top unknown words in the corpus:
  • 109 tang
  • 108 ym
  • 106 shuh
  • 102 haduh
  • 100 skul
  • 98 sorkar
  • 97 Seng
  • 96 briew
  • 90 M
  • 88 kala
  • 87 ri
  • 87 lang
  • 85 E
  • 83 kine
  • 82 seng
  • 77 Meghalaya
  • 77 tarik
  • 74 naduh
  • 74 haba
  • 74 shu

I'm not sure why M and E are appearing.

Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala

  • Coverage: 66.05%
  • Top unknown words in the corpus (excluding M, E, and kala):
  • 72 bun
  • 72 hi
  • 71 BJP
  • 71 District
  • 70 April
  • 69 samla
  • 68 ryngkat
  • 67 liang
  • 63 bor
  • 63 shim
  • 63 MLA
  • 62 lyngba
  • 62 sdang
  • 62 India
  • 60 shah
  • 60 tylli
  • 59 kim

Step 3: added all the above words (and accompanying disambiguations)

Coverage: 68.10% Top unknown words in the corpus:

  • 59 kynthup
  • 59 bapher
  • 58 President
  • 58 ioh
  • 58 khynnah
  • 58 ula
  • 58 ei
  • 57 School
  • 56 lada
  • 56 Secretary
  • 54 khnang
  • 54 wat
  • 54 kumba
  • 53 Hills
  • 52 Dr
  • 51 biang
  • 50 ju

Step 4: added all the above words (and accompanying disambiguations)

  • Coverage: 70.14%
  • Top unknown words in the corpus:
  • 49 hadien
  • 49 hap
  • 49 Ïaiong
  • 48 Myntri
  • 48 thaiñ
  • 47 bnai
  • 46 ïaid
  • 46 D
  • 45 jingïalang
  • 45 nongïalam
  • 45 jur
  • 45 kiei
  • 44 ngut
  • 43 lad
  • 41 kali
  • 41 nonghikai
  • 41 wanrah
  • 40 tynrai

Step 4: added all the above words (and accompanying disambiguations)

Coverage: 72.18% Top unknown words in the corpus: 40 kyntu 40 kyrteng 40 ïathuh 40 kino 40 lympung 39 eh 38 khamtam 38 kyntiew 38 rangbah 37 shisha 37 Trai 36 Rangbah 36 pisa 36 pule 36 kren 36 bym 35 kumno 35 Bhoi 35 ïohi 34 hok

  • aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin