Difference between revisions of "User:Nfeldba1/Final project"

From LING073
Jump to: navigation, search
Line 155: Line 155:
 
*31 Jaiñtia
 
*31 Jaiñtia
 
*31 pyndonkam
 
*31 pyndonkam
 +
==Step 7: added all the above words (and accompanying disambiguations) ==
 +
*Coverage: 74.78%
 +
*Top unknown words in the corpus:
 +
*31 T
 +
*31 sahlang
 +
*31 C
 +
*31 e
 +
*30 daw
 +
*30 ngin
 +
*30 party
 +
*30 poi
 +
*30 lem
 +
*29 bniah
 +
*29 katba
 +
*29 kaban
 +
*29 Kong
 +
*29 Chie
 +
*29 kynhun
 +
*29 s
 +
*29 uwei
 +
*28 dkhot
 +
*28 rai
 +
*28 Party
 
*aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin
 
*aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin
  
  
 
[[Category:sp17_FinalProjects]]
 
[[Category:sp17_FinalProjects]]

Revision as of 21:30, 7 May 2017

Pre-Final Project

  • Number of tokenised words in the corpus: 57847
  • Coverage: 57.26%
  • Top unknown words in the corpus:
  • 206 kam
  • 200 kum
  • 179 Bah
  • 172 kiwei
  • 171 bynta
  • 170 baroh
  • 166 lah
  • 159 pat
  • 147 mynta
  • 130 noh
  • 125 paidbah
  • 124 ne
  • 122 ïoh
  • 119 por
  • 118 wan
  • 118 Shillong
  • 117 namar
  • 117 Khasi
  • 112 katei
  • 111 Jylla

Step 1: added all the above words (and accompanying disambiguations)

  • Coverage: 62.96%
  • Top unknown words in the corpus:
  • 109 tang
  • 108 ym
  • 106 shuh
  • 102 haduh
  • 100 skul
  • 98 sorkar
  • 97 Seng
  • 96 briew
  • 90 M
  • 88 kala
  • 87 ri
  • 87 lang
  • 85 E
  • 83 kine
  • 82 seng
  • 77 Meghalaya
  • 77 tarik
  • 74 naduh
  • 74 haba
  • 74 shu

I'm not sure why M and E are appearing.

Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala

  • Coverage: 66.05%
  • Top unknown words in the corpus (excluding M, E, and kala):
  • 72 bun
  • 72 hi
  • 71 BJP
  • 71 District
  • 70 April
  • 69 samla
  • 68 ryngkat
  • 67 liang
  • 63 bor
  • 63 shim
  • 63 MLA
  • 62 lyngba
  • 62 sdang
  • 62 India
  • 60 shah
  • 60 tylli
  • 59 kim

Step 3: added all the above words (and accompanying disambiguations)

Coverage: 68.10% Top unknown words in the corpus:

  • 59 kynthup
  • 59 bapher
  • 58 President
  • 58 ioh
  • 58 khynnah
  • 58 ula
  • 58 ei
  • 57 School
  • 56 lada
  • 56 Secretary
  • 54 khnang
  • 54 wat
  • 54 kumba
  • 53 Hills
  • 52 Dr
  • 51 biang
  • 50 ju

Step 4: added all the above words (and accompanying disambiguations)

  • Coverage: 70.14%
  • Top unknown words in the corpus:
  • 49 hadien
  • 49 hap
  • 49 Ïaiong
  • 48 Myntri
  • 48 thaiñ
  • 47 bnai
  • 46 ïaid
  • 46 D
  • 45 jingïalang
  • 45 nongïalam
  • 45 jur
  • 45 kiei
  • 44 ngut
  • 43 lad
  • 41 kali
  • 41 nonghikai
  • 41 wanrah
  • 40 tynrai

Step 5: added all the above words (and accompanying disambiguations)

  • Coverage: 72.18%
  • Top unknown words in the corpus:
  • 40 kyntu
  • 40 kyrteng
  • 40 ïathuh
  • 40 kino
  • 40 lympung
  • 39 eh
  • 38 khamtam
  • 38 kyntiew
  • 38 rangbah
  • 37 shisha
  • 37 Trai
  • 36 Rangbah
  • 36 pisa
  • 36 pule
  • 36 kren
  • 36 bym
  • 35 kumno
  • 35 Bhoi
  • 35 ïohi
  • 34 hok

Step 6: added all the above words (and accompanying disambiguations)

  • Coverage: 73.59%
  • Top unknown words in the corpus:
  • 34 daw
  • 34 surok
  • 34 Sangma
  • 34 donkam
  • 34 pyrshah
  • 34 o
  • 34 rukom
  • 33 Congress
  • 33 jing
  • 32 dep
  • 32 phah
  • 32 beit
  • 32 jingmut
  • 32 peit
  • 31 elekshon
  • 31 imlang
  • 31 un
  • 31 sahlang
  • 31 Jaiñtia
  • 31 pyndonkam

Step 7: added all the above words (and accompanying disambiguations)

  • Coverage: 74.78%
  • Top unknown words in the corpus:
  • 31 T
  • 31 sahlang
  • 31 C
  • 31 e
  • 30 daw
  • 30 ngin
  • 30 party
  • 30 poi
  • 30 lem
  • 29 bniah
  • 29 katba
  • 29 kaban
  • 29 Kong
  • 29 Chie
  • 29 kynhun
  • 29 s
  • 29 uwei
  • 28 dkhot
  • 28 rai
  • 28 Party
  • aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin