Difference between revisions of "User:Nfeldba1/Final project"

From LING073
Jump to: navigation, search
(Step 18: added all the above words (and accompanying disambiguations))
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Pre-Final Project==
+
==The Project==
*Number of tokenised words in the corpus: 57847
+
For my final project, I expanded my transducer until it reached 85% coverage, which involved adding words, disambiguations, and extra suffixes.
*Coverage: 57.26%
 
*Top unknown words in the corpus:
 
*206 kam
 
*200 kum
 
*179 Bah
 
*172 kiwei
 
*171 bynta
 
*170 baroh
 
*166 lah
 
*159 pat
 
*147 mynta
 
*130 noh
 
*125 paidbah
 
*124 ne
 
*122 ïoh
 
*119 por
 
*118 wan
 
*118 Shillong
 
*117 namar
 
*117 Khasi
 
*112 katei
 
*111 Jylla
 
==Step 1: added all the above words (and accompanying disambiguations)==
 
*Coverage: 62.96%
 
*Top unknown words in the corpus:
 
*109 tang
 
*108 ym
 
*106 shuh
 
*102 haduh
 
*100 skul
 
*98 sorkar
 
*97 Seng
 
*96 briew
 
*90      M
 
*88 kala
 
*87 ri
 
*87 lang
 
*85 E
 
*83 kine
 
*82 seng
 
*77 Meghalaya
 
*77 tarik
 
*74 naduh
 
*74 haba
 
*74 shu
 
I'm not sure why M and E are appearing.
 
  
==Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala ==
+
My code can be found here: https://github.com/nfeldbaum/Khasi_Transducer
*Coverage: 66.05%
 
*Top unknown words in the corpus (excluding M, E, and kala):
 
*72 bun
 
*72 hi
 
*71 BJP
 
*71 District
 
*70 April
 
*69 samla
 
*68 ryngkat
 
*67 liang
 
*63 bor
 
*63 shim
 
*63 MLA
 
*62 lyngba
 
*62 sdang
 
*62 India
 
*60 shah
 
*60 tylli
 
*59 kim
 
==Step 3: added all the above words (and accompanying disambiguations) ==
 
Coverage: 68.10%
 
Top unknown words in the corpus:
 
*59 kynthup
 
*59 bapher
 
*58 President
 
*58 ioh
 
*58 khynnah
 
*58 ula
 
*58 ei
 
*57 School
 
*56 lada
 
*56 Secretary
 
*54 khnang
 
*54 wat
 
*54 kumba
 
*53 Hills
 
*52 Dr
 
*51 biang
 
*50 ju
 
==Step 4: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 70.14%
 
*Top unknown words in the corpus:
 
*49 hadien
 
*49 hap
 
*49 Ïaiong
 
*48 Myntri
 
*48 thaiñ
 
*47 bnai
 
*46 ïaid
 
*46 D
 
*45 jingïalang
 
*45 nongïalam
 
*45 jur
 
*45 kiei
 
*44 ngut
 
*43 lad
 
*41 kali
 
*41 nonghikai
 
*41 wanrah
 
*40 tynrai
 
==Step 5: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 72.18%
 
*Top unknown words in the corpus:
 
*40 kyntu
 
*40 kyrteng
 
*40 ïathuh
 
*40 kino
 
*40 lympung
 
*39 eh
 
*38 khamtam
 
*38 kyntiew
 
*38 rangbah
 
*37 shisha
 
*37 Trai
 
*36 Rangbah
 
*36 pisa
 
*36 pule
 
*36 kren
 
*36 bym
 
*35 kumno
 
*35 Bhoi
 
*35 ïohi
 
*34 hok
 
==Step 6: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 73.59%
 
*Top unknown words in the corpus:
 
*34 daw
 
*34 surok
 
*34 Sangma
 
*34 donkam
 
*34 pyrshah
 
*34 o
 
*34 rukom
 
*33 Congress
 
*33 jing
 
*32 dep
 
*32 phah
 
*32 beit
 
*32 jingmut
 
*32 peit
 
*31 elekshon
 
*31 imlang
 
*31 un
 
*31 sahlang
 
*31 Jaiñtia
 
*31 pyndonkam
 
==Step 7: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 74.78%
 
*Top unknown words in the corpus:
 
*31 T
 
*31 sahlang
 
*31 C
 
*31 e
 
*30 daw
 
*30 ngin
 
*30 party
 
*30 poi
 
*30 lem
 
*29 bniah
 
*29 katba
 
*29 kaban
 
*29 Kong
 
*29 Chie
 
*29 kynhun
 
*29 s
 
*29 uwei
 
*28 dkhot
 
*28 rai
 
*28 Party
 
  
==Step 8: added all the above words (and accompanying disambiguations) ==
+
==The Results==
*Coverage: 75.88%
 
*Top unknown words in the corpus:
 
*30 of
 
*28 Unit
 
*28 nyngkong
 
*28 lynti
 
*28 khubor
 
*28 ktah
 
*28 tnad
 
*28 Mukul
 
*28 kynthei
 
*28 lai
 
*27 kat
 
*27 aiñ
 
*27 ïakhun
 
*27 satia
 
*27 jingjia
 
*27 SSA
 
*26 Hima
 
*26 kot
 
*26 shi
 
*26 K
 
*At this point, I also decided to add all the letters in the alphabet as individual words.
 
==Step 9: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 77.14%
 
*Top unknown words in the corpus:
 
*56 EVM
 
*26 jongka - unknown
 
*26 lap
 
*26 lei
 
*25 ïaka - unknown
 
*25 kynja
 
*25 ophis
 
*25 Committee
 
*25 UDP
 
*25 treikam
 
*25 lait
 
*24 Suk
 
*24 bishar
 
*24 im
 
*24 ïadei
 
*24 jingïarap
 
*24 tit
 
*24 riew
 
*23 Nalor
 
*23 shna
 
==Step 10: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 78.15%
 
*Top unknown words in the corpus:
 
*23 Election
 
*23 sniew
 
*23 General
 
*23 pynban
 
*23 kmen
 
*22 thain
 
*22 kyrpang
 
*22 dawa
 
*22 ïai
 
*22 Pynursla
 
*22 kdew
 
*21 Iaiong
 
*21 Association
 
*21 Secondary
 
*21 Sports
 
*21 riti
 
*21 phai
 
*21 Jowai
 
==Step 11: added all the above words (and accompanying disambiguations) ==
 
Coverage: 78.91%
 
Top unknown words in the corpus:
 
*21 MP
 
*21 shitom
 
*21 burom
 
*21 palat
 
*21 Block
 
*20 tyngka
 
*20 sakhi
 
*20 Raid
 
*20 shuwa
 
*20 khmih
 
*20 juh
 
*20 Court
 
*20 jait
 
*20 thong
 
*20 pulit
 
*20 Lyngdoh
 
==Step 12: added all the above words (and accompanying disambiguations) ==
 
Coverage: 79.62%
 
Top unknown words in the corpus:
 
20 dor
 
20 ma
 
20 thong
 
20 man
 
20 heh
 
20 kumjuh
 
19 kajuh
 
19 nam
 
19 North
 
19 barim
 
19 jingmyntoi
 
19 Constituency
 
19 sngewthuh
 
19 hajar
 
19 kut
 
19 thoh
 
19 pud
 
19 hangne
 
==Step 13: added all the above words (and accompanying disambiguations) ==
 
Coverage: 80.31%
 
Top unknown words in the corpus:
 
19 Officer
 
19 madan
 
19 jop
 
18 kyrpad
 
18 kyrwoh
 
18 lawei
 
18 Niam
 
18 jia
 
18 plie
 
18 haka - unsure
 
18 kit
 
18 jingeh
 
18 nongrim
 
17 vote
 
17 Kynrad
 
17 pyrshang
 
17 paitbah
 
==Step 14: added all the above words (and accompanying disambiguations) ==
 
Coverage: 80.94%
 
Top unknown words in the corpus:
 
17 hynne
 
17 sem
 
17 jaidbynriew
 
17 beh
 
17 jingïalehkai
 
17 Lajong
 
17 jingrakhe
 
17 ïalehkai
 
17 masi
 
17 Dorbar
 
17 pdiang
 
17 katkum
 
16 Meet
 
16 kloi
 
16 dustur
 
16 kem
 
16 kaei
 
  
 +
I managed to reach 85.15% coverage on my initial 50,000 word corpus! However, I ended up adding a significant number of English words that were present to reach this amount. A truer measure of corpus coverage is the 80.50% coverage I get when I delete all the words that aren't part of Khasi.
  
==Step 15: added all the above words (and accompanying disambiguations) ==
+
Instead of testing precision and recall against hand-annotated randomly selected forms, I decided to gather a further 25,000 words in order to test my transducer on a corpus it hadn't trained on. This new corpus can be found in the repository linked above under ling073-kha-corpus/kha.corpus.large.test.txt. On this test corpus, I achieved 83.39% coverage with English words included, and 80.50% coverage without English words - exactly the same as on my training corpus.
Coverage: 81.56%
 
Top unknown words in the corpus:
 
16 shaphang
 
16 arngut - unsure
 
16 jingpang
 
16 nongkyndong
 
16 Nongpoh
 
16 roi
 
16 Day
 
16 pynshai
 
16 lak
 
16 drok
 
16 National
 
16 kyrtong
 
16 ophisar
 
15 sien
 
15 nym
 
15 shakhmat
 
15 rung
 
==Step 16: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 82.06%
 
*Top unknown words in the corpus:
 
*15 Circle
 
*15 Good
 
*15 hiar
 
*15 synshar
 
*15 Hato
 
*15 klur
 
*15 Khanna
 
*15 kynthoh
 
*15 katne
 
*15 kper
 
*15 prokram
 
*15 jingkyrshan
 
*15 Nongstoiñ
 
*15 Upper
 
*14 lut
 
*14 kynnoh
 
==Step 17: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 82.60%
 
*Top unknown words in the corpus:
 
*14 be
 
*14 sngewbha
 
*14 Vinod
 
*14 pynbna
 
*14 rep
 
*14 bynrap
 
*14 hukum
 
*14 par
 
*14 khang
 
*14 ïashim
 
*14 shynrang
 
*14 College
 
*14 Commission
 
*14 nongmihkhmat
 
*14 kyrdan
 
*14 bampong
 
==Step 18: added all the above words (and accompanying disambiguations) ==
 
Coverage: 83.03%
 
Top unknown words in the corpus:
 
14 kular
 
14 nong
 
14 khraw
 
14 sngewbha
 
14 Bar
 
14 kumne
 
13 pilkrim
 
13 KSU
 
13 Khristan
 
13 pdeng
 
13 wad
 
13 pan
 
13 duh
 
13 sengbhalang
 
13 longkmie
 
13 ap
 
==Step 19: added all the above words (and accompanying disambiguations) ==
 
Coverage: 83.54%
 
Top unknown words in the corpus:
 
13 runar
 
13 Mawlai
 
13 thep
 
13 kyndon
 
13 ïap
 
13 jingnang
 
13 bala
 
13 pilkrim
 
13 pynkhreh
 
13 jingstad
 
13 katto
 
13 ding
 
12 shakri
 
12 san
 
12 siew
 
12 phang
 
  
*aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin
 
 
[[Category:sp17_FinalProjects]]
 
[[Category:sp17_FinalProjects]]

Latest revision as of 12:21, 11 May 2017

The Project

For my final project, I expanded my transducer until it reached 85% coverage, which involved adding words, disambiguations, and extra suffixes.

My code can be found here: https://github.com/nfeldbaum/Khasi_Transducer

The Results

I managed to reach 85.15% coverage on my initial 50,000 word corpus! However, I ended up adding a significant number of English words that were present to reach this amount. A truer measure of corpus coverage is the 80.50% coverage I get when I delete all the words that aren't part of Khasi.

Instead of testing precision and recall against hand-annotated randomly selected forms, I decided to gather a further 25,000 words in order to test my transducer on a corpus it hadn't trained on. This new corpus can be found in the repository linked above under ling073-kha-corpus/kha.corpus.large.test.txt. On this test corpus, I achieved 83.39% coverage with English words included, and 80.50% coverage without English words - exactly the same as on my training corpus.