Difference between revisions of "User:Nfeldba1/Final project"

From LING073
Jump to: navigation, search
(Step 15: added all the above words (and accompanying disambiguations))
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Pre-Final Project==
+
==The Project==
*Number of tokenised words in the corpus: 57847
+
For my final project, I expanded my transducer until it reached 85% coverage, which involved adding words, disambiguations, and extra suffixes.
*Coverage: 57.26%
 
*Top unknown words in the corpus:
 
*206 kam
 
*200 kum
 
*179 Bah
 
*172 kiwei
 
*171 bynta
 
*170 baroh
 
*166 lah
 
*159 pat
 
*147 mynta
 
*130 noh
 
*125 paidbah
 
*124 ne
 
*122 ïoh
 
*119 por
 
*118 wan
 
*118 Shillong
 
*117 namar
 
*117 Khasi
 
*112 katei
 
*111 Jylla
 
==Step 1: added all the above words (and accompanying disambiguations)==
 
*Coverage: 62.96%
 
*Top unknown words in the corpus:
 
*109 tang
 
*108 ym
 
*106 shuh
 
*102 haduh
 
*100 skul
 
*98 sorkar
 
*97 Seng
 
*96 briew
 
*90      M
 
*88 kala
 
*87 ri
 
*87 lang
 
*85 E
 
*83 kine
 
*82 seng
 
*77 Meghalaya
 
*77 tarik
 
*74 naduh
 
*74 haba
 
*74 shu
 
I'm not sure why M and E are appearing.
 
  
==Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala ==
+
My code can be found here: https://github.com/nfeldbaum/Khasi_Transducer
*Coverage: 66.05%
 
*Top unknown words in the corpus (excluding M, E, and kala):
 
*72 bun
 
*72 hi
 
*71 BJP
 
*71 District
 
*70 April
 
*69 samla
 
*68 ryngkat
 
*67 liang
 
*63 bor
 
*63 shim
 
*63 MLA
 
*62 lyngba
 
*62 sdang
 
*62 India
 
*60 shah
 
*60 tylli
 
*59 kim
 
==Step 3: added all the above words (and accompanying disambiguations) ==
 
Coverage: 68.10%
 
Top unknown words in the corpus:
 
*59 kynthup
 
*59 bapher
 
*58 President
 
*58 ioh
 
*58 khynnah
 
*58 ula
 
*58 ei
 
*57 School
 
*56 lada
 
*56 Secretary
 
*54 khnang
 
*54 wat
 
*54 kumba
 
*53 Hills
 
*52 Dr
 
*51 biang
 
*50 ju
 
==Step 4: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 70.14%
 
*Top unknown words in the corpus:
 
*49 hadien
 
*49 hap
 
*49 Ïaiong
 
*48 Myntri
 
*48 thaiñ
 
*47 bnai
 
*46 ïaid
 
*46 D
 
*45 jingïalang
 
*45 nongïalam
 
*45 jur
 
*45 kiei
 
*44 ngut
 
*43 lad
 
*41 kali
 
*41 nonghikai
 
*41 wanrah
 
*40 tynrai
 
==Step 5: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 72.18%
 
*Top unknown words in the corpus:
 
*40 kyntu
 
*40 kyrteng
 
*40 ïathuh
 
*40 kino
 
*40 lympung
 
*39 eh
 
*38 khamtam
 
*38 kyntiew
 
*38 rangbah
 
*37 shisha
 
*37 Trai
 
*36 Rangbah
 
*36 pisa
 
*36 pule
 
*36 kren
 
*36 bym
 
*35 kumno
 
*35 Bhoi
 
*35 ïohi
 
*34 hok
 
==Step 6: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 73.59%
 
*Top unknown words in the corpus:
 
*34 daw
 
*34 surok
 
*34 Sangma
 
*34 donkam
 
*34 pyrshah
 
*34 o
 
*34 rukom
 
*33 Congress
 
*33 jing
 
*32 dep
 
*32 phah
 
*32 beit
 
*32 jingmut
 
*32 peit
 
*31 elekshon
 
*31 imlang
 
*31 un
 
*31 sahlang
 
*31 Jaiñtia
 
*31 pyndonkam
 
==Step 7: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 74.78%
 
*Top unknown words in the corpus:
 
*31 T
 
*31 sahlang
 
*31 C
 
*31 e
 
*30 daw
 
*30 ngin
 
*30 party
 
*30 poi
 
*30 lem
 
*29 bniah
 
*29 katba
 
*29 kaban
 
*29 Kong
 
*29 Chie
 
*29 kynhun
 
*29 s
 
*29 uwei
 
*28 dkhot
 
*28 rai
 
*28 Party
 
  
==Step 8: added all the above words (and accompanying disambiguations) ==
+
==The Results==
*Coverage: 75.88%
 
*Top unknown words in the corpus:
 
*30 of
 
*28 Unit
 
*28 nyngkong
 
*28 lynti
 
*28 khubor
 
*28 ktah
 
*28 tnad
 
*28 Mukul
 
*28 kynthei
 
*28 lai
 
*27 kat
 
*27 aiñ
 
*27 ïakhun
 
*27 satia
 
*27 jingjia
 
*27 SSA
 
*26 Hima
 
*26 kot
 
*26 shi
 
*26 K
 
*At this point, I also decided to add all the letters in the alphabet as individual words.
 
==Step 9: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 77.14%
 
*Top unknown words in the corpus:
 
*56 EVM
 
*26 jongka - unknown
 
*26 lap
 
*26 lei
 
*25 ïaka - unknown
 
*25 kynja
 
*25 ophis
 
*25 Committee
 
*25 UDP
 
*25 treikam
 
*25 lait
 
*24 Suk
 
*24 bishar
 
*24 im
 
*24 ïadei
 
*24 jingïarap
 
*24 tit
 
*24 riew
 
*23 Nalor
 
*23 shna
 
==Step 10: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 78.15%
 
*Top unknown words in the corpus:
 
*23 Election
 
*23 sniew
 
*23 General
 
*23 pynban
 
*23 kmen
 
*22 thain
 
*22 kyrpang
 
*22 dawa
 
*22 ïai
 
*22 Pynursla
 
*22 kdew
 
*21 Iaiong
 
*21 Association
 
*21 Secondary
 
*21 Sports
 
*21 riti
 
*21 phai
 
*21 Jowai
 
==Step 11: added all the above words (and accompanying disambiguations) ==
 
Coverage: 78.91%
 
Top unknown words in the corpus:
 
*21 MP
 
*21 shitom
 
*21 burom
 
*21 palat
 
*21 Block
 
*20 tyngka
 
*20 sakhi
 
*20 Raid
 
*20 shuwa
 
*20 khmih
 
*20 juh
 
*20 Court
 
*20 jait
 
*20 thong
 
*20 pulit
 
*20 Lyngdoh
 
==Step 12: added all the above words (and accompanying disambiguations) ==
 
Coverage: 79.62%
 
Top unknown words in the corpus:
 
20 dor
 
20 ma
 
20 thong
 
20 man
 
20 heh
 
20 kumjuh
 
19 kajuh
 
19 nam
 
19 North
 
19 barim
 
19 jingmyntoi
 
19 Constituency
 
19 sngewthuh
 
19 hajar
 
19 kut
 
19 thoh
 
19 pud
 
19 hangne
 
==Step 13: added all the above words (and accompanying disambiguations) ==
 
Coverage: 80.31%
 
Top unknown words in the corpus:
 
19 Officer
 
19 madan
 
19 jop
 
18 kyrpad
 
18 kyrwoh
 
18 lawei
 
18 Niam
 
18 jia
 
18 plie
 
18 haka - unsure
 
18 kit
 
18 jingeh
 
18 nongrim
 
17 vote
 
17 Kynrad
 
17 pyrshang
 
17 paitbah
 
==Step 14: added all the above words (and accompanying disambiguations) ==
 
Coverage: 80.94%
 
Top unknown words in the corpus:
 
17 hynne
 
17 sem
 
17 jaidbynriew
 
17 beh
 
17 jingïalehkai
 
17 Lajong
 
17 jingrakhe
 
17 ïalehkai
 
17 masi
 
17 Dorbar
 
17 pdiang
 
17 katkum
 
16 Meet
 
16 kloi
 
16 dustur
 
16 kem
 
16 kaei
 
  
 +
I managed to reach 85.15% coverage on my initial 50,000 word corpus! However, I ended up adding a significant number of English words that were present to reach this amount. A truer measure of corpus coverage is the 80.50% coverage I get when I delete all the words that aren't part of Khasi.
  
==Step 15: added all the above words (and accompanying disambiguations) ==
+
Instead of testing precision and recall against hand-annotated randomly selected forms, I decided to gather a further 25,000 words in order to test my transducer on a corpus it hadn't trained on. This new corpus can be found in the repository linked above under ling073-kha-corpus/kha.corpus.large.test.txt. On this test corpus, I achieved 83.39% coverage with English words included, and 80.50% coverage without English words - exactly the same as on my training corpus.
Coverage: 81.56%
 
Top unknown words in the corpus:
 
16 shaphang
 
16 arngut - unsure
 
16 jingpang
 
16 nongkyndong
 
16 Nongpoh
 
16 roi
 
16 Day
 
16 pynshai
 
16 lak
 
16 drok
 
16 National
 
16 kyrtong
 
16 ophisar
 
15 sien
 
15 nym
 
15 shakhmat
 
15 rung
 
==Step 16: added all the above words (and accompanying disambiguations) ==
 
Coverage: 82.06%
 
Top unknown words in the corpus:
 
16 arngut
 
15 Circle
 
15 Good
 
15 hiar
 
15 synshar
 
15 Hato
 
15 klur
 
15 Khanna
 
15 kynthoh
 
15 katne
 
15 kper
 
15 prokram
 
15 jingkyrshan
 
15 Nongstoiñ
 
15 Upper
 
14 lut
 
14 kynnoh
 
  
*aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin
 
 
[[Category:sp17_FinalProjects]]
 
[[Category:sp17_FinalProjects]]

Latest revision as of 13:21, 11 May 2017

The Project

For my final project, I expanded my transducer until it reached 85% coverage, which involved adding words, disambiguations, and extra suffixes.

My code can be found here: https://github.com/nfeldbaum/Khasi_Transducer

The Results

I managed to reach 85.15% coverage on my initial 50,000 word corpus! However, I ended up adding a significant number of English words that were present to reach this amount. A truer measure of corpus coverage is the 80.50% coverage I get when I delete all the words that aren't part of Khasi.

Instead of testing precision and recall against hand-annotated randomly selected forms, I decided to gather a further 25,000 words in order to test my transducer on a corpus it hadn't trained on. This new corpus can be found in the repository linked above under ling073-kha-corpus/kha.corpus.large.test.txt. On this test corpus, I achieved 83.39% coverage with English words included, and 80.50% coverage without English words - exactly the same as on my training corpus.