Difference between revisions of "User:Nfeldba1/Final project"

From LING073
Jump to: navigation, search
(Created page with "==Pre-Final Project== *Number of tokenised words in the corpus: 57847 *Coverage: 57.26% *Top unknown words in the corpus: *206 kam *200 kum *179 Bah *172 kiwei *171 bynta...")
 
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Pre-Final Project==
+
==The Project==
*Number of tokenised words in the corpus: 57847
+
For my final project, I expanded my transducer until it reached 85% coverage, which involved adding words, disambiguations, and extra suffixes.
*Coverage: 57.26%
 
*Top unknown words in the corpus:
 
*206 kam
 
*200 kum
 
*179 Bah
 
*172 kiwei
 
*171 bynta
 
*170 baroh
 
*166 lah
 
*159 pat
 
*147 mynta
 
*130 noh
 
*125 paidbah
 
*124 ne
 
*122 ïoh
 
*119 por
 
*118 wan
 
*118 Shillong
 
*117 namar
 
*117 Khasi
 
*112 katei
 
*111 Jylla
 
==Step 1: added all the above words (and accompanying disambiguations)==
 
*Coverage: 62.96%
 
*Top unknown words in the corpus:
 
*109 tang
 
*108 ym
 
*106 shuh
 
*102 haduh
 
*100 skul
 
*98 sorkar
 
*97 Seng
 
*96 briew
 
*90      M
 
*88 kala
 
*87 ri
 
*87 lang
 
*85 E
 
*83 kine
 
*82 seng
 
*77 Meghalaya
 
*77 tarik
 
*74 naduh
 
*74 haba
 
*74 shu
 
I'm not sure why M and E are appearing.
 
  
==Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala ==
+
My code can be found here: https://github.com/nfeldbaum/Khasi_Transducer
*Coverage: 66.05%
+
 
*Top unknown words in the corpus (excluding M, E, and kala):
+
==The Results==
*72 bun
+
 
*72 hi
+
I managed to reach 85.15% coverage on my initial 50,000 word corpus! However, I ended up adding a significant number of English words that were present to reach this amount. A truer measure of corpus coverage is the 80.50% coverage I get when I delete all the words that aren't part of Khasi.
*71 BJP
+
 
*71 District
+
Instead of testing precision and recall against hand-annotated randomly selected forms, I decided to gather a further 25,000 words in order to test my transducer on a corpus it hadn't trained on. This new corpus can be found in the repository linked above under ling073-kha-corpus/kha.corpus.large.test.txt. On this test corpus, I achieved 83.39% coverage with English words included, and 80.50% coverage without English words - exactly the same as on my training corpus.
*70 April
 
*69 samla
 
*68 ryngkat
 
*67 liang
 
*63 bor
 
*63 shim
 
*63 MLA
 
*62 lyngba
 
*62 sdang
 
*62 India
 
*60 shah
 
*60 tylli
 
*59 kim
 
==Step 3: added all the above words (and accompanying disambiguations) ==
 
Coverage: 68.10%
 
Top unknown words in the corpus:
 
*59 kynthup
 
*59 bapher
 
*58 President
 
*58 ioh
 
*58 khynnah
 
*58 ula
 
*58 ei
 
*57 School
 
*56 lada
 
*56 Secretary
 
*54 khnang
 
*54 wat
 
*54 kumba
 
*53 Hills
 
*52 Dr
 
*51 biang
 
*50 ju
 
==Step 4: added all the above words (and accompanying disambiguations) ==
 
*Coverage: 70.14%
 
*Top unknown words in the corpus:
 
*49 hadien
 
*49 hap
 
*49 Ïaiong
 
*48 Myntri
 
*48 thaiñ
 
*47 bnai
 
*46 ïaid
 
*46 D
 
*45 jingïalang
 
*45 nongïalam
 
*45 jur
 
*45 kiei
 
*44 ngut
 
*43 lad
 
*41 kali
 
*41 nonghikai
 
*41 wanrah
 
*40 tynrai
 
  
*aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin
 
 
[[Category:sp17_FinalProjects]]
 
[[Category:sp17_FinalProjects]]

Latest revision as of 13:21, 11 May 2017

The Project

For my final project, I expanded my transducer until it reached 85% coverage, which involved adding words, disambiguations, and extra suffixes.

My code can be found here: https://github.com/nfeldbaum/Khasi_Transducer

The Results

I managed to reach 85.15% coverage on my initial 50,000 word corpus! However, I ended up adding a significant number of English words that were present to reach this amount. A truer measure of corpus coverage is the 80.50% coverage I get when I delete all the words that aren't part of Khasi.

Instead of testing precision and recall against hand-annotated randomly selected forms, I decided to gather a further 25,000 words in order to test my transducer on a corpus it hadn't trained on. This new corpus can be found in the repository linked above under ling073-kha-corpus/kha.corpus.large.test.txt. On this test corpus, I achieved 83.39% coverage with English words included, and 80.50% coverage without English words - exactly the same as on my training corpus.