User:Nfeldba1/Final project

From LING073
< User:Nfeldba1
Revision as of 01:22, 9 May 2017 by Nfeldba1 (talk | contribs) (Step 14: added all the above words (and accompanying disambiguations))

Jump to: navigation, search

Pre-Final Project

  • Number of tokenised words in the corpus: 57847
  • Coverage: 57.26%
  • Top unknown words in the corpus:
  • 206 kam
  • 200 kum
  • 179 Bah
  • 172 kiwei
  • 171 bynta
  • 170 baroh
  • 166 lah
  • 159 pat
  • 147 mynta
  • 130 noh
  • 125 paidbah
  • 124 ne
  • 122 ïoh
  • 119 por
  • 118 wan
  • 118 Shillong
  • 117 namar
  • 117 Khasi
  • 112 katei
  • 111 Jylla

Step 1: added all the above words (and accompanying disambiguations)

  • Coverage: 62.96%
  • Top unknown words in the corpus:
  • 109 tang
  • 108 ym
  • 106 shuh
  • 102 haduh
  • 100 skul
  • 98 sorkar
  • 97 Seng
  • 96 briew
  • 90 M
  • 88 kala
  • 87 ri
  • 87 lang
  • 85 E
  • 83 kine
  • 82 seng
  • 77 Meghalaya
  • 77 tarik
  • 74 naduh
  • 74 haba
  • 74 shu

I'm not sure why M and E are appearing.

Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala

  • Coverage: 66.05%
  • Top unknown words in the corpus (excluding M, E, and kala):
  • 72 bun
  • 72 hi
  • 71 BJP
  • 71 District
  • 70 April
  • 69 samla
  • 68 ryngkat
  • 67 liang
  • 63 bor
  • 63 shim
  • 63 MLA
  • 62 lyngba
  • 62 sdang
  • 62 India
  • 60 shah
  • 60 tylli
  • 59 kim

Step 3: added all the above words (and accompanying disambiguations)

Coverage: 68.10% Top unknown words in the corpus:

  • 59 kynthup
  • 59 bapher
  • 58 President
  • 58 ioh
  • 58 khynnah
  • 58 ula
  • 58 ei
  • 57 School
  • 56 lada
  • 56 Secretary
  • 54 khnang
  • 54 wat
  • 54 kumba
  • 53 Hills
  • 52 Dr
  • 51 biang
  • 50 ju

Step 4: added all the above words (and accompanying disambiguations)

  • Coverage: 70.14%
  • Top unknown words in the corpus:
  • 49 hadien
  • 49 hap
  • 49 Ïaiong
  • 48 Myntri
  • 48 thaiñ
  • 47 bnai
  • 46 ïaid
  • 46 D
  • 45 jingïalang
  • 45 nongïalam
  • 45 jur
  • 45 kiei
  • 44 ngut
  • 43 lad
  • 41 kali
  • 41 nonghikai
  • 41 wanrah
  • 40 tynrai

Step 5: added all the above words (and accompanying disambiguations)

  • Coverage: 72.18%
  • Top unknown words in the corpus:
  • 40 kyntu
  • 40 kyrteng
  • 40 ïathuh
  • 40 kino
  • 40 lympung
  • 39 eh
  • 38 khamtam
  • 38 kyntiew
  • 38 rangbah
  • 37 shisha
  • 37 Trai
  • 36 Rangbah
  • 36 pisa
  • 36 pule
  • 36 kren
  • 36 bym
  • 35 kumno
  • 35 Bhoi
  • 35 ïohi
  • 34 hok

Step 6: added all the above words (and accompanying disambiguations)

  • Coverage: 73.59%
  • Top unknown words in the corpus:
  • 34 daw
  • 34 surok
  • 34 Sangma
  • 34 donkam
  • 34 pyrshah
  • 34 o
  • 34 rukom
  • 33 Congress
  • 33 jing
  • 32 dep
  • 32 phah
  • 32 beit
  • 32 jingmut
  • 32 peit
  • 31 elekshon
  • 31 imlang
  • 31 un
  • 31 sahlang
  • 31 Jaiñtia
  • 31 pyndonkam

Step 7: added all the above words (and accompanying disambiguations)

  • Coverage: 74.78%
  • Top unknown words in the corpus:
  • 31 T
  • 31 sahlang
  • 31 C
  • 31 e
  • 30 daw
  • 30 ngin
  • 30 party
  • 30 poi
  • 30 lem
  • 29 bniah
  • 29 katba
  • 29 kaban
  • 29 Kong
  • 29 Chie
  • 29 kynhun
  • 29 s
  • 29 uwei
  • 28 dkhot
  • 28 rai
  • 28 Party

Step 8: added all the above words (and accompanying disambiguations)

  • Coverage: 75.88%
  • Top unknown words in the corpus:
  • 30 of
  • 28 Unit
  • 28 nyngkong
  • 28 lynti
  • 28 khubor
  • 28 ktah
  • 28 tnad
  • 28 Mukul
  • 28 kynthei
  • 28 lai
  • 27 kat
  • 27 aiñ
  • 27 ïakhun
  • 27 satia
  • 27 jingjia
  • 27 SSA
  • 26 Hima
  • 26 kot
  • 26 shi
  • 26 K
  • At this point, I also decided to add all the letters in the alphabet as individual words.

Step 9: added all the above words (and accompanying disambiguations)

  • Coverage: 77.14%
  • Top unknown words in the corpus:
  • 56 EVM
  • 26 jongka - unknown
  • 26 lap
  • 26 lei
  • 25 ïaka - unknown
  • 25 kynja
  • 25 ophis
  • 25 Committee
  • 25 UDP
  • 25 treikam
  • 25 lait
  • 24 Suk
  • 24 bishar
  • 24 im
  • 24 ïadei
  • 24 jingïarap
  • 24 tit
  • 24 riew
  • 23 Nalor
  • 23 shna

Step 10: added all the above words (and accompanying disambiguations)

  • Coverage: 78.15%
  • Top unknown words in the corpus:
  • 23 Election
  • 23 sniew
  • 23 General
  • 23 pynban
  • 23 kmen
  • 22 thain
  • 22 kyrpang
  • 22 dawa
  • 22 ïai
  • 22 Pynursla
  • 22 kdew
  • 21 Iaiong
  • 21 Association
  • 21 Secondary
  • 21 Sports
  • 21 riti
  • 21 phai
  • 21 Jowai

Step 11: added all the above words (and accompanying disambiguations)

Coverage: 78.91% Top unknown words in the corpus:

  • 21 MP
  • 21 shitom
  • 21 burom
  • 21 palat
  • 21 Block
  • 20 tyngka
  • 20 sakhi
  • 20 Raid
  • 20 shuwa
  • 20 khmih
  • 20 juh
  • 20 Court
  • 20 jait
  • 20 thong
  • 20 pulit
  • 20 Lyngdoh

Step 12: added all the above words (and accompanying disambiguations)

Coverage: 79.62% Top unknown words in the corpus: 20 dor 20 ma 20 thong 20 man 20 heh 20 kumjuh 19 kajuh 19 nam 19 North 19 barim 19 jingmyntoi 19 Constituency 19 sngewthuh 19 hajar 19 kut 19 thoh 19 pud 19 hangne

Step 13: added all the above words (and accompanying disambiguations)

Coverage: 80.31% Top unknown words in the corpus: 19 Officer 19 madan 19 jop 18 kyrpad 18 kyrwoh 18 lawei 18 Niam 18 jia 18 plie 18 haka - unsure 18 kit 18 jingeh 18 nongrim 17 vote 17 Kynrad 17 pyrshang 17 paitbah

Step 14: added all the above words (and accompanying disambiguations)

Coverage: 80.94% Top unknown words in the corpus: 17 hynne 17 sem 17 jaidbynriew 17 beh 17 jingïalehkai 17 Lajong 17 jingrakhe 17 ïalehkai 17 masi 17 Dorbar 17 pdiang 17 katkum 16 Meet 16 kloi 16 dustur 16 kem 16 kaei


Step 15: added all the above words (and accompanying disambiguations)

Coverage: 81.56% Top unknown words in the corpus:

16 shaphang 16 arngut 16 jingpang 16 nongkyndong 16 Nongpoh 16 roi 16 Day 16 pynshai 16 lak 16 drok 16 National 16 kyrtong 16 ophisar 15 sien 15 nym 15 shakhmat 15 rung

  • aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin