Difference between revisions of "User:Nfeldba1/Final project"

From LING073
Jump to: navigation, search
(Step 21: added all the above words (and accompanying disambiguations) and reworked morphology with u, la, yn, ym, etc)
(Blanked the page)
Line 1: Line 1:
==Pre-Final Project==
 
*Number of tokenised words in the corpus: 57847
 
*Coverage: 57.26%
 
*Top unknown words in the corpus:
 
*206 kam
 
*200 kum
 
*179 Bah
 
*172 kiwei
 
*171 bynta
 
*170 baroh
 
*166 lah
 
*159 pat
 
*147 mynta
 
*130 noh
 
*125 paidbah
 
*124 ne
 
*122 ïoh
 
*119 por
 
*118 wan
 
*118 Shillong
 
*117 namar
 
*117 Khasi
 
*112 katei
 
*111 Jylla
 
==Step 1: added all the above words (and accompanying disambiguations)==
 
*Coverage: 62.96%
 
*Top unknown words in the corpus:
 
*109 tang
 
*108 ym
 
*106 shuh
 
*102 haduh
 
*100 skul
 
*98 sorkar
 
*97 Seng
 
*96 briew
 
*90      M
 
*88 kala
 
*87 ri
 
*87 lang
 
*85 E
 
*83 kine
 
*82 seng
 
*77 Meghalaya
 
*77 tarik
 
*74 naduh
 
*74 haba
 
*74 shu
 
I'm not sure why M and E are appearing.
 
  
==Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala ==
 
*Coverage: 66.05%
 
*Top unknown words in the corpus (excluding M, E, and kala):
 
*72 bun
 
*72 hi
 
*71 BJP
 
*71 District
 
*70 April
 
*69 samla
 
*68 ryngkat
 
*67 liang
 
*63 bor
 
*63 shim
 
*63 MLA
 
*62 lyngba
 
*62 sdang
 
*62 India
 
*60 shah
 
*60 tylli
 
*59 kim
 
==Step 3: added all the above words (and accompanying disambiguations) ==
 
Coverage: 68.10%
 
Top unknown words in the corpus:
 
*59 kynthup
 
*59 bapher
 
*58 President
 
*58 ioh
 
*58 khynnah
 
*58 ula
 
*58 ei
 
*57 School
 
*56 lada
 
*56 Secretary
 
*54 khnang
 
*54 wat
 
*54 kumba
 
*53 Hills
 
*52 Dr
 
*51 biang
 
*50 ju
 
==Step 4: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 70.14%
 
*Top unknown words in the corpus:
 
*49 hadien
 
*49 hap
 
*49 Ïaiong
 
*48 Myntri
 
*48 thaiñ
 
*47 bnai
 
*46 ïaid
 
*46 D
 
*45 jingïalang
 
*45 nongïalam
 
*45 jur
 
*45 kiei
 
*44 ngut
 
*43 lad
 
*41 kali
 
*41 nonghikai
 
*41 wanrah
 
*40 tynrai
 
==Step 5: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 72.18%
 
*Top unknown words in the corpus:
 
*40 kyntu
 
*40 kyrteng
 
*40 ïathuh
 
*40 kino
 
*40 lympung
 
*39 eh
 
*38 khamtam
 
*38 kyntiew
 
*38 rangbah
 
*37 shisha
 
*37 Trai
 
*36 Rangbah
 
*36 pisa
 
*36 pule
 
*36 kren
 
*36 bym
 
*35 kumno
 
*35 Bhoi
 
*35 ïohi
 
*34 hok
 
==Step 6: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 73.59%
 
*Top unknown words in the corpus:
 
*34 daw
 
*34 surok
 
*34 Sangma
 
*34 donkam
 
*34 pyrshah
 
*34 o
 
*34 rukom
 
*33 Congress
 
*33 jing
 
*32 dep
 
*32 phah
 
*32 beit
 
*32 jingmut
 
*32 peit
 
*31 elekshon
 
*31 imlang
 
*31 un
 
*31 sahlang
 
*31 Jaiñtia
 
*31 pyndonkam
 
==Step 7: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 74.78%
 
*Top unknown words in the corpus:
 
*31 T
 
*31 sahlang
 
*31 C
 
*31 e
 
*30 daw
 
*30 ngin
 
*30 party
 
*30 poi
 
*30 lem
 
*29 bniah
 
*29 katba
 
*29 kaban
 
*29 Kong
 
*29 Chie
 
*29 kynhun
 
*29 s
 
*29 uwei
 
*28 dkhot
 
*28 rai
 
*28 Party
 
 
==Step 8: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 75.88%
 
*Top unknown words in the corpus:
 
*30 of
 
*28 Unit
 
*28 nyngkong
 
*28 lynti
 
*28 khubor
 
*28 ktah
 
*28 tnad
 
*28 Mukul
 
*28 kynthei
 
*28 lai
 
*27 kat
 
*27 aiñ
 
*27 ïakhun
 
*27 satia
 
*27 jingjia
 
*27 SSA
 
*26 Hima
 
*26 kot
 
*26 shi
 
*26 K
 
*At this point, I also decided to add all the letters in the alphabet as individual words.
 
==Step 9: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 77.14%
 
*Top unknown words in the corpus:
 
*56 EVM
 
*26 jongka - unknown
 
*26 lap
 
*26 lei
 
*25 ïaka - unknown
 
*25 kynja
 
*25 ophis
 
*25 Committee
 
*25 UDP
 
*25 treikam
 
*25 lait
 
*24 Suk
 
*24 bishar
 
*24 im
 
*24 ïadei
 
*24 jingïarap
 
*24 tit
 
*24 riew
 
*23 Nalor
 
*23 shna
 
==Step 10: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 78.15%
 
*Top unknown words in the corpus:
 
*23 Election
 
*23 sniew
 
*23 General
 
*23 pynban
 
*23 kmen
 
*22 thain
 
*22 kyrpang
 
*22 dawa
 
*22 ïai
 
*22 Pynursla
 
*22 kdew
 
*21 Iaiong
 
*21 Association
 
*21 Secondary
 
*21 Sports
 
*21 riti
 
*21 phai
 
*21 Jowai
 
==Step 11: added all the above words (and accompanying disambiguations) ==
 
Coverage: 78.91%
 
Top unknown words in the corpus:
 
*21 MP
 
*21 shitom
 
*21 burom
 
*21 palat
 
*21 Block
 
*20 tyngka
 
*20 sakhi
 
*20 Raid
 
*20 shuwa
 
*20 khmih
 
*20 juh
 
*20 Court
 
*20 jait
 
*20 thong
 
*20 pulit
 
*20 Lyngdoh
 
==Step 12: added all the above words (and accompanying disambiguations) ==
 
Coverage: 79.62%
 
Top unknown words in the corpus:
 
20 dor
 
20 ma
 
20 thong
 
20 man
 
20 heh
 
20 kumjuh
 
19 kajuh
 
19 nam
 
19 North
 
19 barim
 
19 jingmyntoi
 
19 Constituency
 
19 sngewthuh
 
19 hajar
 
19 kut
 
19 thoh
 
19 pud
 
19 hangne
 
==Step 13: added all the above words (and accompanying disambiguations) ==
 
Coverage: 80.31%
 
Top unknown words in the corpus:
 
19 Officer
 
19 madan
 
19 jop
 
18 kyrpad
 
18 kyrwoh
 
18 lawei
 
18 Niam
 
18 jia
 
18 plie
 
18 haka - unsure
 
18 kit
 
18 jingeh
 
18 nongrim
 
17 vote
 
17 Kynrad
 
17 pyrshang
 
17 paitbah
 
==Step 14: added all the above words (and accompanying disambiguations) ==
 
Coverage: 80.94%
 
Top unknown words in the corpus:
 
17 hynne
 
17 sem
 
17 jaidbynriew
 
17 beh
 
17 jingïalehkai
 
17 Lajong
 
17 jingrakhe
 
17 ïalehkai
 
17 masi
 
17 Dorbar
 
17 pdiang
 
17 katkum
 
16 Meet
 
16 kloi
 
16 dustur
 
16 kem
 
16 kaei
 
 
 
==Step 15: added all the above words (and accompanying disambiguations) ==
 
Coverage: 81.56%
 
Top unknown words in the corpus:
 
16 shaphang
 
16 arngut - unsure
 
16 jingpang
 
16 nongkyndong
 
16 Nongpoh
 
16 roi
 
16 Day
 
16 pynshai
 
16 lak
 
16 drok
 
16 National
 
16 kyrtong
 
16 ophisar
 
15 sien
 
15 nym
 
15 shakhmat
 
15 rung
 
==Step 16: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 82.06%
 
*Top unknown words in the corpus:
 
*15 Circle
 
*15 Good
 
*15 hiar
 
*15 synshar
 
*15 Hato
 
*15 klur
 
*15 Khanna
 
*15 kynthoh
 
*15 katne
 
*15 kper
 
*15 prokram
 
*15 jingkyrshan
 
*15 Nongstoiñ
 
*15 Upper
 
*14 lut
 
*14 kynnoh
 
==Step 17: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 82.60%
 
*Top unknown words in the corpus:
 
*14 be
 
*14 sngewbha
 
*14 Vinod
 
*14 pynbna
 
*14 rep
 
*14 bynrap
 
*14 hukum
 
*14 par
 
*14 khang
 
*14 ïashim
 
*14 shynrang
 
*14 College
 
*14 Commission
 
*14 nongmihkhmat
 
*14 kyrdan
 
*14 bampong
 
==Step 18: added all the above words (and accompanying disambiguations) ==
 
Coverage: 83.03%
 
Top unknown words in the corpus:
 
14 kular
 
14 nong
 
14 khraw
 
14 sngewbha
 
14 Bar
 
14 kumne
 
13 pilkrim
 
13 KSU
 
13 Khristan
 
13 pdeng
 
13 wad
 
13 pan
 
13 duh
 
13 sengbhalang
 
13 longkmie
 
13 ap
 
==Step 19: added all the above words (and accompanying disambiguations) ==
 
Coverage: 83.54%
 
Top unknown words in the corpus:
 
13 runar
 
13 Mawlai
 
13 thep
 
13 kyndon
 
13 ïap
 
13 jingnang
 
13 bala - unsure
 
13 pynkhreh
 
13 jingstad
 
13 katto
 
13 ding
 
12 shakri
 
12 san
 
12 siew
 
12 phang
 
==Step 20: added all the above words (and accompanying disambiguations) ==
 
Coverage: 83.90%
 
Top unknown words in the corpus:
 
12 kai
 
12 Friday
 
12 nongialam
 
12 jingshisha
 
12 jongki
 
12 shisien
 
12 men
 
12 tnat
 
12 Bangladesh
 
12 Mobile
 
12 Central
 
12 kadei
 
12 ngim
 
12 naba
 
==Step 21: added all the above words (and accompanying disambiguations) and reworked morphology with u, la, yn, ym, etc ==
 
Coverage: 84.51%
 
Top unknown words in the corpus:
 
12 kongsan
 
12 pyrda
 
12 tulop
 
12 Wahumkhrah
 
12 Municipal
 
12 pynbha
 
12 Bagan
 
12 thung
 
12 Executive
 
11 St
 
11 bat
 
11 pateng
 
11 shem
 
11 mon
 
11 shen
 
==Step 22: added all the above words (and accompanying disambiguations)==
 
Coverage: 84.87%
 
Top unknown words in the corpus:
 
11 pa
 
11 pyllait
 
11 dak
 
11 mariang
 
11 Inter
 
11 MDC
 
11 jam
 
11 NPP
 
11 khia
 
11 Education
 
11 kylleng
 
11 Adam
 
11 met
 
11 ïalade
 
11 RBYF
 
 
*aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin
 
[[Category:sp17_FinalProjects]]
 

Revision as of 12:37, 11 May 2017