Difference between revisions of "User:Nfeldba1/Final project"

From LING073
Jump to: navigation, search
 
(15 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Pre-Final Project==
+
==The Project==
*Number of tokenised words in the corpus: 57847
+
For my final project, I expanded my transducer until it reached 85% coverage, which involved adding words, disambiguations, and extra suffixes.
*Coverage: 57.26%
 
*Top unknown words in the corpus:
 
*206 kam
 
*200 kum
 
*179 Bah
 
*172 kiwei
 
*171 bynta
 
*170 baroh
 
*166 lah
 
*159 pat
 
*147 mynta
 
*130 noh
 
*125 paidbah
 
*124 ne
 
*122 ïoh
 
*119 por
 
*118 wan
 
*118 Shillong
 
*117 namar
 
*117 Khasi
 
*112 katei
 
*111 Jylla
 
==Step 1: added all the above words (and accompanying disambiguations)==
 
*Coverage: 62.96%
 
*Top unknown words in the corpus:
 
*109 tang
 
*108 ym
 
*106 shuh
 
*102 haduh
 
*100 skul
 
*98 sorkar
 
*97 Seng
 
*96 briew
 
*90      M
 
*88 kala
 
*87 ri
 
*87 lang
 
*85 E
 
*83 kine
 
*82 seng
 
*77 Meghalaya
 
*77 tarik
 
*74 naduh
 
*74 haba
 
*74 shu
 
I'm not sure why M and E are appearing.
 
  
==Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala ==
+
My code can be found here: https://github.com/nfeldbaum/Khasi_Transducer
*Coverage: 66.05%
 
*Top unknown words in the corpus (excluding M, E, and kala):
 
*72 bun
 
*72 hi
 
*71 BJP
 
*71 District
 
*70 April
 
*69 samla
 
*68 ryngkat
 
*67 liang
 
*63 bor
 
*63 shim
 
*63 MLA
 
*62 lyngba
 
*62 sdang
 
*62 India
 
*60 shah
 
*60 tylli
 
*59 kim
 
==Step 3: added all the above words (and accompanying disambiguations) ==
 
Coverage: 68.10%
 
Top unknown words in the corpus:
 
*59 kynthup
 
*59 bapher
 
*58 President
 
*58 ioh
 
*58 khynnah
 
*58 ula
 
*58 ei
 
*57 School
 
*56 lada
 
*56 Secretary
 
*54 khnang
 
*54 wat
 
*54 kumba
 
*53 Hills
 
*52 Dr
 
*51 biang
 
*50 ju
 
==Step 4: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 70.14%
 
*Top unknown words in the corpus:
 
*49 hadien
 
*49 hap
 
*49 Ïaiong
 
*48 Myntri
 
*48 thaiñ
 
*47 bnai
 
*46 ïaid
 
*46 D
 
*45 jingïalang
 
*45 nongïalam
 
*45 jur
 
*45 kiei
 
*44 ngut
 
*43 lad
 
*41 kali
 
*41 nonghikai
 
*41 wanrah
 
*40 tynrai
 
==Step 4: added all the above words (and accompanying disambiguations) ==
 
Coverage: 72.18%
 
Top unknown words in the corpus:
 
40 kyntu
 
40 kyrteng
 
40 ïathuh
 
40 kino
 
40 lympung
 
39 eh
 
38 khamtam
 
38 kyntiew
 
38 rangbah
 
37 shisha
 
37 Trai
 
36 Rangbah
 
36 pisa
 
36 pule
 
36 kren
 
36 bym
 
35 kumno
 
35 Bhoi
 
35 ïohi
 
34 hok
 
*aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin
 
  
 +
==The Results==
 +
 +
I managed to reach 85.15% coverage on my initial 50,000 word corpus! However, I ended up adding a significant number of English words that were present to reach this amount. A truer measure of corpus coverage is the 80.50% coverage I get when I delete all the words that aren't part of Khasi.
 +
 +
Instead of testing precision and recall against hand-annotated randomly selected forms, I decided to gather a further 25,000 words in order to test my transducer on a corpus it hadn't trained on. This new corpus can be found in the repository linked above under ling073-kha-corpus/kha.corpus.large.test.txt. On this test corpus, I achieved 83.39% coverage with English words included, and 80.50% coverage without English words - exactly the same as on my training corpus.
  
 
[[Category:sp17_FinalProjects]]
 
[[Category:sp17_FinalProjects]]

Latest revision as of 13:21, 11 May 2017

The Project

For my final project, I expanded my transducer until it reached 85% coverage, which involved adding words, disambiguations, and extra suffixes.

My code can be found here: https://github.com/nfeldbaum/Khasi_Transducer

The Results

I managed to reach 85.15% coverage on my initial 50,000 word corpus! However, I ended up adding a significant number of English words that were present to reach this amount. A truer measure of corpus coverage is the 80.50% coverage I get when I delete all the words that aren't part of Khasi.

Instead of testing precision and recall against hand-annotated randomly selected forms, I decided to gather a further 25,000 words in order to test my transducer on a corpus it hadn't trained on. This new corpus can be found in the repository linked above under ling073-kha-corpus/kha.corpus.large.test.txt. On this test corpus, I achieved 83.39% coverage with English words included, and 80.50% coverage without English words - exactly the same as on my training corpus.