User:Nfeldba1/Final project

From LING073
< User:Nfeldba1
Revision as of 11:00, 11 May 2017 by Nfeldba1 (talk | contribs) (Step 20: added all the above words (and accompanying disambiguations))

Jump to: navigation, search

Contents

Pre-Final Project

  • Number of tokenised words in the corpus: 57847
  • Coverage: 57.26%
  • Top unknown words in the corpus:
  • 206 kam
  • 200 kum
  • 179 Bah
  • 172 kiwei
  • 171 bynta
  • 170 baroh
  • 166 lah
  • 159 pat
  • 147 mynta
  • 130 noh
  • 125 paidbah
  • 124 ne
  • 122 ïoh
  • 119 por
  • 118 wan
  • 118 Shillong
  • 117 namar
  • 117 Khasi
  • 112 katei
  • 111 Jylla

Step 1: added all the above words (and accompanying disambiguations)

  • Coverage: 62.96%
  • Top unknown words in the corpus:
  • 109 tang
  • 108 ym
  • 106 shuh
  • 102 haduh
  • 100 skul
  • 98 sorkar
  • 97 Seng
  • 96 briew
  • 90 M
  • 88 kala
  • 87 ri
  • 87 lang
  • 85 E
  • 83 kine
  • 82 seng
  • 77 Meghalaya
  • 77 tarik
  • 74 naduh
  • 74 haba
  • 74 shu

I'm not sure why M and E are appearing.

Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala

  • Coverage: 66.05%
  • Top unknown words in the corpus (excluding M, E, and kala):
  • 72 bun
  • 72 hi
  • 71 BJP
  • 71 District
  • 70 April
  • 69 samla
  • 68 ryngkat
  • 67 liang
  • 63 bor
  • 63 shim
  • 63 MLA
  • 62 lyngba
  • 62 sdang
  • 62 India
  • 60 shah
  • 60 tylli
  • 59 kim

Step 3: added all the above words (and accompanying disambiguations)

Coverage: 68.10% Top unknown words in the corpus:

  • 59 kynthup
  • 59 bapher
  • 58 President
  • 58 ioh
  • 58 khynnah
  • 58 ula
  • 58 ei
  • 57 School
  • 56 lada
  • 56 Secretary
  • 54 khnang
  • 54 wat
  • 54 kumba
  • 53 Hills
  • 52 Dr
  • 51 biang
  • 50 ju

Step 4: added all the above words (and accompanying disambiguations)

  • Coverage: 70.14%
  • Top unknown words in the corpus:
  • 49 hadien
  • 49 hap
  • 49 Ïaiong
  • 48 Myntri
  • 48 thaiñ
  • 47 bnai
  • 46 ïaid
  • 46 D
  • 45 jingïalang
  • 45 nongïalam
  • 45 jur
  • 45 kiei
  • 44 ngut
  • 43 lad
  • 41 kali
  • 41 nonghikai
  • 41 wanrah
  • 40 tynrai

Step 5: added all the above words (and accompanying disambiguations)

  • Coverage: 72.18%
  • Top unknown words in the corpus:
  • 40 kyntu
  • 40 kyrteng
  • 40 ïathuh
  • 40 kino
  • 40 lympung
  • 39 eh
  • 38 khamtam
  • 38 kyntiew
  • 38 rangbah
  • 37 shisha
  • 37 Trai
  • 36 Rangbah
  • 36 pisa
  • 36 pule
  • 36 kren
  • 36 bym
  • 35 kumno
  • 35 Bhoi
  • 35 ïohi
  • 34 hok

Step 6: added all the above words (and accompanying disambiguations)

  • Coverage: 73.59%
  • Top unknown words in the corpus:
  • 34 daw
  • 34 surok
  • 34 Sangma
  • 34 donkam
  • 34 pyrshah
  • 34 o
  • 34 rukom
  • 33 Congress
  • 33 jing
  • 32 dep
  • 32 phah
  • 32 beit
  • 32 jingmut
  • 32 peit
  • 31 elekshon
  • 31 imlang
  • 31 un
  • 31 sahlang
  • 31 Jaiñtia
  • 31 pyndonkam

Step 7: added all the above words (and accompanying disambiguations)

  • Coverage: 74.78%
  • Top unknown words in the corpus:
  • 31 T
  • 31 sahlang
  • 31 C
  • 31 e
  • 30 daw
  • 30 ngin
  • 30 party
  • 30 poi
  • 30 lem
  • 29 bniah
  • 29 katba
  • 29 kaban
  • 29 Kong
  • 29 Chie
  • 29 kynhun
  • 29 s
  • 29 uwei
  • 28 dkhot
  • 28 rai
  • 28 Party

Step 8: added all the above words (and accompanying disambiguations)

  • Coverage: 75.88%
  • Top unknown words in the corpus:
  • 30 of
  • 28 Unit
  • 28 nyngkong
  • 28 lynti
  • 28 khubor
  • 28 ktah
  • 28 tnad
  • 28 Mukul
  • 28 kynthei
  • 28 lai
  • 27 kat
  • 27 aiñ
  • 27 ïakhun
  • 27 satia
  • 27 jingjia
  • 27 SSA
  • 26 Hima
  • 26 kot
  • 26 shi
  • 26 K
  • At this point, I also decided to add all the letters in the alphabet as individual words.

Step 9: added all the above words (and accompanying disambiguations)

  • Coverage: 77.14%
  • Top unknown words in the corpus:
  • 56 EVM
  • 26 jongka - unknown
  • 26 lap
  • 26 lei
  • 25 ïaka - unknown
  • 25 kynja
  • 25 ophis
  • 25 Committee
  • 25 UDP
  • 25 treikam
  • 25 lait
  • 24 Suk
  • 24 bishar
  • 24 im
  • 24 ïadei
  • 24 jingïarap
  • 24 tit
  • 24 riew
  • 23 Nalor
  • 23 shna

Step 10: added all the above words (and accompanying disambiguations)

  • Coverage: 78.15%
  • Top unknown words in the corpus:
  • 23 Election
  • 23 sniew
  • 23 General
  • 23 pynban
  • 23 kmen
  • 22 thain
  • 22 kyrpang
  • 22 dawa
  • 22 ïai
  • 22 Pynursla
  • 22 kdew
  • 21 Iaiong
  • 21 Association
  • 21 Secondary
  • 21 Sports
  • 21 riti
  • 21 phai
  • 21 Jowai

Step 11: added all the above words (and accompanying disambiguations)

Coverage: 78.91% Top unknown words in the corpus:

  • 21 MP
  • 21 shitom
  • 21 burom
  • 21 palat
  • 21 Block
  • 20 tyngka
  • 20 sakhi
  • 20 Raid
  • 20 shuwa
  • 20 khmih
  • 20 juh
  • 20 Court
  • 20 jait
  • 20 thong
  • 20 pulit
  • 20 Lyngdoh

Step 12: added all the above words (and accompanying disambiguations)

Coverage: 79.62% Top unknown words in the corpus: 20 dor 20 ma 20 thong 20 man 20 heh 20 kumjuh 19 kajuh 19 nam 19 North 19 barim 19 jingmyntoi 19 Constituency 19 sngewthuh 19 hajar 19 kut 19 thoh 19 pud 19 hangne

Step 13: added all the above words (and accompanying disambiguations)

Coverage: 80.31% Top unknown words in the corpus: 19 Officer 19 madan 19 jop 18 kyrpad 18 kyrwoh 18 lawei 18 Niam 18 jia 18 plie 18 haka - unsure 18 kit 18 jingeh 18 nongrim 17 vote 17 Kynrad 17 pyrshang 17 paitbah

Step 14: added all the above words (and accompanying disambiguations)

Coverage: 80.94% Top unknown words in the corpus: 17 hynne 17 sem 17 jaidbynriew 17 beh 17 jingïalehkai 17 Lajong 17 jingrakhe 17 ïalehkai 17 masi 17 Dorbar 17 pdiang 17 katkum 16 Meet 16 kloi 16 dustur 16 kem 16 kaei


Step 15: added all the above words (and accompanying disambiguations)

Coverage: 81.56% Top unknown words in the corpus: 16 shaphang 16 arngut - unsure 16 jingpang 16 nongkyndong 16 Nongpoh 16 roi 16 Day 16 pynshai 16 lak 16 drok 16 National 16 kyrtong 16 ophisar 15 sien 15 nym 15 shakhmat 15 rung

Step 16: added all the above words (and accompanying disambiguations)

  • Coverage: 82.06%
  • Top unknown words in the corpus:
  • 15 Circle
  • 15 Good
  • 15 hiar
  • 15 synshar
  • 15 Hato
  • 15 klur
  • 15 Khanna
  • 15 kynthoh
  • 15 katne
  • 15 kper
  • 15 prokram
  • 15 jingkyrshan
  • 15 Nongstoiñ
  • 15 Upper
  • 14 lut
  • 14 kynnoh

Step 17: added all the above words (and accompanying disambiguations)

  • Coverage: 82.60%
  • Top unknown words in the corpus:
  • 14 be
  • 14 sngewbha
  • 14 Vinod
  • 14 pynbna
  • 14 rep
  • 14 bynrap
  • 14 hukum
  • 14 par
  • 14 khang
  • 14 ïashim
  • 14 shynrang
  • 14 College
  • 14 Commission
  • 14 nongmihkhmat
  • 14 kyrdan
  • 14 bampong

Step 18: added all the above words (and accompanying disambiguations)

Coverage: 83.03% Top unknown words in the corpus: 14 kular 14 nong 14 khraw 14 sngewbha 14 Bar 14 kumne 13 pilkrim 13 KSU 13 Khristan 13 pdeng 13 wad 13 pan 13 duh 13 sengbhalang 13 longkmie 13 ap

Step 19: added all the above words (and accompanying disambiguations)

Coverage: 83.54% Top unknown words in the corpus: 13 runar 13 Mawlai 13 thep 13 kyndon 13 ïap 13 jingnang 13 bala - unsure 13 pynkhreh 13 jingstad 13 katto 13 ding 12 shakri 12 san 12 siew 12 phang

Step 20: added all the above words (and accompanying disambiguations)

Coverage: 83.90% Top unknown words in the corpus: 12 kai 12 Friday 12 nongialam 12 jingshisha 12 jongki 12 shisien 12 men 12 tnat 12 Bangladesh 12 Mobile 12 Central 12 kadei 12 ngim 12 naba

Step 21: added all the above words (and accompanying disambiguations) and reworked morphology with u, la, yn, ym, etc

Coverage: 84.51% Top unknown words in the corpus: 12 kongsan 12 pyrda 12 tulop 12 kidei 12 Wahumkhrah 12 kadei 12 Municipal 12 kol 12 pynbha 12 Bagan 12 thung 12 Executive 11 St 11 bat 11 pateng 11 shem 11 mon 11 shen

  • aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin