Difference between revisions of "Khasi/Final Project"

From LING073
Jump to: navigation, search
(Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala)
Line 48: Line 48:
 
I'm not sure why M and E are appearing, and I can't define kala.
 
I'm not sure why M and E are appearing, and I can't define kala.
 
==Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala ==
 
==Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala ==
Coverage: 66.05%
+
*Coverage: 66.05%
Top unknown words in the corpus (excluding M, E, and kala):
+
*Top unknown words in the corpus (excluding M, E, and kala):
72 bun
+
*72 bun
72 hi
+
*72 hi
71 BJP
+
*71 BJP
71 District
+
*71 District
70 April
+
*70 April
69 samla
+
*69 samla
68 ryngkat
+
*68 ryngkat
67 liang
+
*67 liang
63 bor
+
*63 bor
63 shim
+
*63 shim
63 MLA
+
*63 MLA
62 lyngba
+
*62 lyngba
62 sdang
+
*62 sdang
62 India
+
*62 India
60 shah
+
*60 shah
60 tylli
+
*60 tylli
59 kim
+
*59 kim
  
 +
*aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin
 
[[Category:sp17_FinalProjects]]
 
[[Category:sp17_FinalProjects]]
aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin
 

Revision as of 18:28, 2 May 2017

Pre-Final Project

  • Number of tokenised words in the corpus: 57847
  • Coverage: 57.26%
  • Top unknown words in the corpus:
  • 206 kam
  • 200 kum
  • 179 Bah
  • 172 kiwei
  • 171 bynta
  • 170 baroh
  • 166 lah
  • 159 pat
  • 147 mynta
  • 130 noh
  • 125 paidbah
  • 124 ne
  • 122 ïoh
  • 119 por
  • 118 wan
  • 118 Shillong
  • 117 namar
  • 117 Khasi
  • 112 katei
  • 111 Jylla

Step 1: added all the above words (and accompanying disambiguations)

  • Coverage: 62.96%
  • Top unknown words in the corpus:
  • 109 tang
  • 108 ym
  • 106 shuh
  • 102 haduh
  • 100 skul
  • 98 sorkar
  • 97 Seng
  • 96 briew
  • 90 M
  • 88 kala
  • 87 ri
  • 87 lang
  • 85 E
  • 83 kine
  • 82 seng
  • 77 Meghalaya
  • 77 tarik
  • 74 naduh
  • 74 haba
  • 74 shu

I'm not sure why M and E are appearing, and I can't define kala.

Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala

  • Coverage: 66.05%
  • Top unknown words in the corpus (excluding M, E, and kala):
  • 72 bun
  • 72 hi
  • 71 BJP
  • 71 District
  • 70 April
  • 69 samla
  • 68 ryngkat
  • 67 liang
  • 63 bor
  • 63 shim
  • 63 MLA
  • 62 lyngba
  • 62 sdang
  • 62 India
  • 60 shah
  • 60 tylli
  • 59 kim
  • aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin