User:Nfeldba1/Final project

From LING073
Jump to: navigation, search

Pre-Final Project

  • Number of tokenised words in the corpus: 57847
  • Coverage: 57.26%
  • Top unknown words in the corpus:
  • 206 kam
  • 200 kum
  • 179 Bah
  • 172 kiwei
  • 171 bynta
  • 170 baroh
  • 166 lah
  • 159 pat
  • 147 mynta
  • 130 noh
  • 125 paidbah
  • 124 ne
  • 122 ïoh
  • 119 por
  • 118 wan
  • 118 Shillong
  • 117 namar
  • 117 Khasi
  • 112 katei
  • 111 Jylla

Step 1: added all the above words (and accompanying disambiguations)

  • Coverage: 62.96%
  • Top unknown words in the corpus:
  • 109 tang
  • 108 ym
  • 106 shuh
  • 102 haduh
  • 100 skul
  • 98 sorkar
  • 97 Seng
  • 96 briew
  • 90 M
  • 88 kala
  • 87 ri
  • 87 lang
  • 85 E
  • 83 kine
  • 82 seng
  • 77 Meghalaya
  • 77 tarik
  • 74 naduh
  • 74 haba
  • 74 shu

I'm not sure why M and E are appearing.

Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala

  • Coverage: 66.05%
  • Top unknown words in the corpus (excluding M, E, and kala):
  • 72 bun
  • 72 hi
  • 71 BJP
  • 71 District
  • 70 April
  • 69 samla
  • 68 ryngkat
  • 67 liang
  • 63 bor
  • 63 shim
  • 63 MLA
  • 62 lyngba
  • 62 sdang
  • 62 India
  • 60 shah
  • 60 tylli
  • 59 kim

Step 3: added all the above words (and accompanying disambiguations)

Coverage: 68.10% Top unknown words in the corpus:

  • 59 kynthup
  • 59 bapher
  • 58 President
  • 58 ioh
  • 58 khynnah
  • 58 ula
  • 58 ei
  • 57 School
  • 56 lada
  • 56 Secretary
  • 54 khnang
  • 54 wat
  • 54 kumba
  • 53 Hills
  • 52 Dr
  • 51 biang
  • 50 ju

Step 4: added all the above words (and accompanying disambiguations)

  • Coverage: 70.14%
  • Top unknown words in the corpus:
  • 49 hadien
  • 49 hap
  • 49 Ïaiong
  • 48 Myntri
  • 48 thaiñ
  • 47 bnai
  • 46 ïaid
  • 46 D
  • 45 jingïalang
  • 45 nongïalam
  • 45 jur
  • 45 kiei
  • 44 ngut
  • 43 lad
  • 41 kali
  • 41 nonghikai
  • 41 wanrah
  • 40 tynrai

Step 5: added all the above words (and accompanying disambiguations)

  • Coverage: 72.18%
  • Top unknown words in the corpus:
  • 40 kyntu
  • 40 kyrteng
  • 40 ïathuh
  • 40 kino
  • 40 lympung
  • 39 eh
  • 38 khamtam
  • 38 kyntiew
  • 38 rangbah
  • 37 shisha
  • 37 Trai
  • 36 Rangbah
  • 36 pisa
  • 36 pule
  • 36 kren
  • 36 bym
  • 35 kumno
  • 35 Bhoi
  • 35 ïohi
  • 34 hok

Step 6: added all the above words (and accompanying disambiguations)

  • Coverage: 73.59%
  • Top unknown words in the corpus:
  • 34 daw
  • 34 surok
  • 34 Sangma
  • 34 donkam
  • 34 pyrshah
  • 34 o
  • 34 rukom
  • 33 Congress
  • 33 jing
  • 32 dep
  • 32 phah
  • 32 beit
  • 32 jingmut
  • 32 peit
  • 31 elekshon
  • 31 imlang
  • 31 un
  • 31 sahlang
  • 31 Jaiñtia
  • 31 pyndonkam

Step 7: added all the above words (and accompanying disambiguations)

  • Coverage: 74.78%
  • Top unknown words in the corpus:
  • 31 T
  • 31 sahlang
  • 31 C
  • 31 e
  • 30 daw
  • 30 ngin
  • 30 party
  • 30 poi
  • 30 lem
  • 29 bniah
  • 29 katba
  • 29 kaban
  • 29 Kong
  • 29 Chie
  • 29 kynhun
  • 29 s
  • 29 uwei
  • 28 dkhot
  • 28 rai
  • 28 Party

Step 8: added all the above words (and accompanying disambiguations)

  • Coverage: 75.88%
  • Top unknown words in the corpus:
  • 30 of
  • 28 Unit
  • 28 nyngkong
  • 28 lynti
  • 28 khubor
  • 28 ktah
  • 28 tnad
  • 28 Mukul
  • 28 kynthei
  • 28 lai
  • 27 kat
  • 27 aiñ
  • 27 ïakhun
  • 27 satia
  • 27 jingjia
  • 27 SSA
  • 26 Hima
  • 26 kot
  • 26 shi
  • 26 K
  • At this point, I also decided to add all the letters in the alphabet as individual words.

Step 9: added all the above words (and accompanying disambiguations)

  • Coverage: 77.14%
  • Top unknown words in the corpus:
  • 56 EVM
  • 26 jongka - unknown
  • 26 lap
  • 26 lei
  • 25 ïaka - unknown
  • 25 kynja
  • 25 ophis
  • 25 Committee
  • 25 UDP
  • 25 treikam
  • 25 lait
  • 24 Suk
  • 24 bishar
  • 24 im
  • 24 ïadei
  • 24 jingïarap
  • 24 tit
  • 24 riew
  • 23 Nalor
  • 23 shna

Step 10: added all the above words (and accompanying disambiguations)

  • Coverage: 78.15%
  • Top unknown words in the corpus:
  • 23 Election
  • 23 sniew
  • 23 General
  • 23 pynban
  • 23 kmen
  • 22 thain
  • 22 kyrpang
  • 22 dawa
  • 22 ïai
  • 22 Pynursla
  • 22 kdew
  • 21 Iaiong
  • 21 Association
  • 21 Secondary
  • 21 Sports
  • 21 riti
  • 21 phai
  • 21 Jowai

Step 11: added all the above words (and accompanying disambiguations)

Coverage: 78.91% Top unknown words in the corpus:

  • 21 MP
  • 21 shitom
  • 21 burom
  • 21 palat
  • 21 Block
  • 20 tyngka
  • 20 sakhi
  • 20 Raid
  • 20 shuwa
  • 20 khmih
  • 20 juh
  • 20 Court
  • 20 jait
  • 20 thong
  • 20 pulit
  • 20 Lyngdoh

Step 12: added all the above words (and accompanying disambiguations)

Coverage: 79.62% Top unknown words in the corpus: 20 dor 20 ma 20 thong 20 man 20 heh 20 kumjuh 19 kajuh 19 nam 19 North 19 barim 19 jingmyntoi 19 Constituency 19 sngewthuh 19 hajar 19 kut 19 thoh 19 pud 19 hangne

Step 13: added all the above words (and accompanying disambiguations)

Coverage: 80.31% Top unknown words in the corpus: 19 Officer 19 madan 19 jop 18 kyrpad 18 kyrwoh 18 lawei 18 Niam 18 jia 18 plie 18 haka 18 kit 18 jingeh 18 nongrim 17 vote 17 Kynrad 17 pyrshang 17 paitbah

  • aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin