Difference between revisions of "User:Qfeng1/Final project"

From LING073
Jump to: navigation, search
(Created page with "Our project is to expand what we have accomplished in class, the monolingual transducer for the Chechen language. We are aiming at a goal of over 85% coverage rate over the l...")
 
(Corpus For Evaluation)
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
Our project is to expand what we have accomplished in class, the monolingual transducer for the Chechen language. We are aiming at a goal of over 85% coverage rate over the large corpus we extracted from Wikipedia pages that are in Chechen.
+
Our project is to expand what we have accomplished in class, the monolingual transducer for the Chechen language. The main goal is to increase the coverage rate of the transducer over a large corpus (we have a large corpus extracted from Wikipedia with all pages in Chechen and therefore there is a wide range of vocabularies included).  
  
 
==Major Steps==
 
==Major Steps==
 
*To expand the morphology (in lexc and towl files)  
 
*To expand the morphology (in lexc and towl files)  
*To generate the top list of unknown words in corpus and add more word stems to lexc file  
+
*To generate the top list of unknown words in corpus and add more word stems to lexc file
*To measure the level of ambiguity in the large corpus and figure out more disambiguation rules to increase the accuracy of tagger.  
+
 
 +
==Code at GitHub==
 +
https://github.com/sfeng233/Chechen_Transducer
 +
 
 +
==Corpus For Evaluation ==
 +
https://github.swarthmore.edu/Ling073-sp19/ling073-che-corpus
 +
*Wikipedia Corpus: It is the large corpus that we run coverage test on.
 +
*Hand Annotated Corpus: This corpus is what we run precisionRecall test on. It consists example sentences that have been glossed in the grammar book we use for this project as main consulting resource.
  
 
==Evaluation==
 
==Evaluation==
  
{|
+
{| class="wikitable"
 
|# of stems in transducer
 
|# of stems in transducer
|# of disambiguation rules
 
 
|# of words in Wiki corpus  
 
|# of words in Wiki corpus  
 
|Coverage Rate
 
|Coverage Rate
Line 16: Line 22:
 
|Recall  
 
|Recall  
 
|-
 
|-
|Bread
+
|608
|Pie
+
|14,093,835
|
+
|~80.10%
|
+
|~75.34%
|
+
|~56.97%
|
 
 
|}
 
|}
 +
 
==Further Improvement==
 
==Further Improvement==
 +
*some augment base for nouns and verbs are not clear due to the limitation of resources
 +
*more verb morphology needs to be added:
 +
:*proverb
 +
:*deictic
 +
:* agreement on noun class (when happen and when do not is still no clear)
  
  
 
[[Category:sp19_FinalProjects]]
 
[[Category:sp19_FinalProjects]]

Latest revision as of 21:04, 14 May 2019

Our project is to expand what we have accomplished in class, the monolingual transducer for the Chechen language. The main goal is to increase the coverage rate of the transducer over a large corpus (we have a large corpus extracted from Wikipedia with all pages in Chechen and therefore there is a wide range of vocabularies included).

Major Steps

  • To expand the morphology (in lexc and towl files)
  • To generate the top list of unknown words in corpus and add more word stems to lexc file

Code at GitHub

https://github.com/sfeng233/Chechen_Transducer

Corpus For Evaluation

https://github.swarthmore.edu/Ling073-sp19/ling073-che-corpus

  • Wikipedia Corpus: It is the large corpus that we run coverage test on.
  • Hand Annotated Corpus: This corpus is what we run precisionRecall test on. It consists example sentences that have been glossed in the grammar book we use for this project as main consulting resource.

Evaluation

# of stems in transducer # of words in Wiki corpus Coverage Rate Precision Recall
608 14,093,835 ~80.10% ~75.34% ~56.97%

Further Improvement

  • some augment base for nouns and verbs are not clear due to the limitation of resources
  • more verb morphology needs to be added:
  • proverb
  • deictic
  • agreement on noun class (when happen and when do not is still no clear)