Difference between revisions of "Khasi/Final Project"

From LING073
Jump to: navigation, search
Line 23: Line 23:
 
*112 katei
 
*112 katei
 
*111 Jylla
 
*111 Jylla
==Step 1: added all the above words (and accompanying disambiguations) except for Jylla==
+
==Step 1: added all the above words (and accompanying disambiguations)==
*Coverage: 62.61%
+
*Coverage: 62.96%
 
*Top unknown words in the corpus:
 
*Top unknown words in the corpus:
*111 Jylla
 
 
*109 tang
 
*109 tang
 
*108 ym
 
*108 ym
Line 35: Line 34:
 
*97 Seng
 
*97 Seng
 
*96 briew
 
*96 briew
*90 M
+
*90     M
*88 jylla
 
 
*88 kala
 
*88 kala
 
*87 ri
 
*87 ri
Line 46: Line 44:
 
*77 tarik
 
*77 tarik
 
*74 naduh
 
*74 naduh
 +
*74 haba
 +
*74 shu
 +
I'm not sure why M and E are appearing, and I can't define kala.
 +
==Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala ==
 +
Coverage: 66.05%
 +
Top unknown words in the corpus (excluding M, E, and kala):
 +
72 bun
 +
72 hi
 +
71 BJP
 +
71 District
 +
70 April
 +
69 samla
 +
68 ryngkat
 +
67 liang
 +
63 bor
 +
63 shim
 +
63 MLA
 +
62 lyngba
 +
62 sdang
 +
62 India
 +
60 shah
 +
60 tylli
 +
59 kim
  
 
[[Category:sp17_FinalProjects]]
 
[[Category:sp17_FinalProjects]]
 +
aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin

Revision as of 18:27, 2 May 2017

Pre-Final Project

  • Number of tokenised words in the corpus: 57847
  • Coverage: 57.26%
  • Top unknown words in the corpus:
  • 206 kam
  • 200 kum
  • 179 Bah
  • 172 kiwei
  • 171 bynta
  • 170 baroh
  • 166 lah
  • 159 pat
  • 147 mynta
  • 130 noh
  • 125 paidbah
  • 124 ne
  • 122 ïoh
  • 119 por
  • 118 wan
  • 118 Shillong
  • 117 namar
  • 117 Khasi
  • 112 katei
  • 111 Jylla

Step 1: added all the above words (and accompanying disambiguations)

  • Coverage: 62.96%
  • Top unknown words in the corpus:
  • 109 tang
  • 108 ym
  • 106 shuh
  • 102 haduh
  • 100 skul
  • 98 sorkar
  • 97 Seng
  • 96 briew
  • 90 M
  • 88 kala
  • 87 ri
  • 87 lang
  • 85 E
  • 83 kine
  • 82 seng
  • 77 Meghalaya
  • 77 tarik
  • 74 naduh
  • 74 haba
  • 74 shu

I'm not sure why M and E are appearing, and I can't define kala.

Step 2: added all the above words (and accompanying disambiguations) except M, E, and kala

Coverage: 66.05% Top unknown words in the corpus (excluding M, E, and kala): 72 bun 72 hi 71 BJP 71 District 70 April 69 samla 68 ryngkat 67 liang 63 bor 63 shim 63 MLA 62 lyngba 62 sdang 62 India 60 shah 60 tylli 59 kim aq-covtest ling073-kha-corpus/kha.corpus.large.txt kha.automorf.bin