Difference between revisions of "Khasi/Final Project"

From LING073
Jump to: navigation, search
(Pre-Final Project)
Line 1: Line 1:
 
==Pre-Final Project==
 
==Pre-Final Project==
Number of tokenised words in the corpus: 57847
+
*Number of tokenised words in the corpus: 57847
Coverage: 57.26%
+
*Coverage: 57.26%
Top unknown words in the corpus:
+
*Top unknown words in the corpus:
206 kam
+
*206 kam
200 kum
+
*200 kum
179 Bah
+
*179 Bah
172 kiwei
+
*172 kiwei
171 bynta
+
*171 bynta
170 baroh
+
*170 baroh
166 lah
+
*166 lah
159 pat
+
*159 pat
147 mynta
+
*147 mynta
130 noh
+
*130 noh
125 paidbah
+
*125 paidbah
124 ne
+
*124 ne
122 ïoh
+
*122 ïoh
119 por
+
*119 por
118 wan
+
*118 wan
118 Shillong
+
*118 Shillong
117 namar
+
*117 namar
117 Khasi
+
*117 Khasi
112 katei
+
*112 katei
111 Jylla
+
*111 Jylla
 
==Step 1: added all the above words (and accompanying disambiguations) except for Jylla==
 
==Step 1: added all the above words (and accompanying disambiguations) except for Jylla==
Coverage: 62.61%
+
*Coverage: 62.61%
Top unknown words in the corpus:
+
*Top unknown words in the corpus:
111 Jylla
+
*111 Jylla
109 tang
+
*109 tang
108 ym
+
*108 ym
106 shuh
+
*106 shuh
102 haduh
+
*102 haduh
100 skul
+
*100 skul
98 sorkar
+
*98 sorkar
97 Seng
+
*97 Seng
96 briew
+
*96 briew
90 M
+
*90 M
88 jylla
+
*88 jylla
88 kala
+
*88 kala
87 ri
+
*87 ri
87 lang
+
*87 lang
85 E
+
*85 E
83 kine
+
*83 kine
82 seng
+
*82 seng
77 Meghalaya
+
*77 Meghalaya
77 tarik
+
*77 tarik
74 naduh
+
*74 naduh
  
 
[[Category:sp17_FinalProjects]]
 
[[Category:sp17_FinalProjects]]

Revision as of 23:56, 1 May 2017

Pre-Final Project

  • Number of tokenised words in the corpus: 57847
  • Coverage: 57.26%
  • Top unknown words in the corpus:
  • 206 kam
  • 200 kum
  • 179 Bah
  • 172 kiwei
  • 171 bynta
  • 170 baroh
  • 166 lah
  • 159 pat
  • 147 mynta
  • 130 noh
  • 125 paidbah
  • 124 ne
  • 122 ïoh
  • 119 por
  • 118 wan
  • 118 Shillong
  • 117 namar
  • 117 Khasi
  • 112 katei
  • 111 Jylla

Step 1: added all the above words (and accompanying disambiguations) except for Jylla

  • Coverage: 62.61%
  • Top unknown words in the corpus:
  • 111 Jylla
  • 109 tang
  • 108 ym
  • 106 shuh
  • 102 haduh
  • 100 skul
  • 98 sorkar
  • 97 Seng
  • 96 briew
  • 90 M
  • 88 jylla
  • 88 kala
  • 87 ri
  • 87 lang
  • 85 E
  • 83 kine
  • 82 seng
  • 77 Meghalaya
  • 77 tarik
  • 74 naduh