Difference between revisions of "Magahi/Final Project"

From LING073
Jump to: navigation, search
(Initial Evaluation)
(GitHub)
 
(12 intermediate revisions by 2 users not shown)
Line 1: Line 1:
= Initial Evaluation =
+
= What we did =
* Coverage: 1732745 / 3678603 (~0.47103343307228314662)<br>
+
We improved our morphological transducer by figuring out how to analyze the top unknown words repeatedly.
 +
 
 +
This consisted of adding new lemmas to our lexd as well as new patterns to cover compound verbs and adjectives.
 +
 
 +
= Potential Improvements =
 +
* Add a script to ignore the virama <code> ्</code> diacritic to the front of the apertium pipeline.
 +
* Find a Magahi dictionary or native Magahi speakers so we can add many more words and potentially correct old ones.
 +
* Optimize compilation process so we don't have to recompile old words whenever we add new ones.
 +
 
 +
= Evaluation =
 +
Over [https://github.swarthmore.edu/Ling073-sp21/ling073-mag-corpus mag.corpus.large.txt].
 +
== Initial Evaluation ==
 +
* Coverage: 94567 / 187701 (~0.50381724125071256946)
 
* Totals: 142 forms, 339 tp, 12 fp, 0 tn, 65 fn<br>
 
* Totals: 142 forms, 339 tp, 12 fp, 0 tn, 65 fn<br>
 
* Precision: 96.58120%<br>
 
* Precision: 96.58120%<br>
Line 6: Line 18:
 
* Unknown words
 
* Unknown words
 
<code>
 
<code>
  31801 ^हे/*हे$    
+
    2785 ^हे/*हे$
  17942 ^ऊ/*ऊ$     
+
    960 ^नञ/*नञ$
  17754 ^1/*1$
+
    793 ^ऊ/*ऊ$
  15114 ^तो/*तो$     
+
     655 ^अपपन/*अपपन$
  13642 ^2/*2$
+
    478 ^हममर/*हममर$
  12158 ^न/*न$    
+
    437 ^तो/*तो$
  9535 ^नयँ/*नयँ$  
+
     429 ^कवि/*कवि$
  9129 ^3/*3$
+
    424 ^न/*न$
  7072 ^नसध॰/*नसध॰$  
+
    414 ^तऽ/*तऽ$
  6821 ^5/*5$
+
    397 ^ले/*ले$
  6626 ^ले/*ले$     
+
    396 ^कुमार/*कुमार$
  6158 ^घर/*घर$     
+
    375 ^जे/*जे$
  5972 ^की/*की$  
+
    367 ^हऽ/*हऽ$
  5842 ^जे/*जे$      
+
     352 ^सिंह/*सिंह$
  5752 ^/*$    
+
     344 ^की/*की$
  5606 ^अप्पन/*अप्पन$
+
     317 ^लेल/*लेल$
  5513 ^mso/*mso$  
+
    309 ^साहित/*साहित$
  5487 ^10/*10$
+
    290 ^घर/*घर$
  5384 ^4/*4$
+
    280 ^कविता/*कविता$
  5318 ^6/*6$
+
    255 ^उनखर/*उनखर$
 
</code>
 
</code>
 +
* Remaining unknown forms: 93134
 +
* Total number of forms: 362
 +
* Lexical forms (not morphology): 273
  
 
== Final Evaluation ==
 
== Final Evaluation ==
Line 60: Line 75:
 
* Lexical forms (not morphology): 1012
 
* Lexical forms (not morphology): 1012
  
 +
= GitHub =
 +
* [https://github.swarthmore.edu/Ling073-sp21/ling073-mag Private]
 +
* [https://github.com/William103/MagahiApertium Public]
  
 
[[Category:sp21_FinalProjects]] [[Category:Magahi]]
 
[[Category:sp21_FinalProjects]] [[Category:Magahi]]

Latest revision as of 23:27, 20 May 2021

What we did

We improved our morphological transducer by figuring out how to analyze the top unknown words repeatedly.

This consisted of adding new lemmas to our lexd as well as new patterns to cover compound verbs and adjectives.

Potential Improvements

  • Add a script to ignore the virama diacritic to the front of the apertium pipeline.
  • Find a Magahi dictionary or native Magahi speakers so we can add many more words and potentially correct old ones.
  • Optimize compilation process so we don't have to recompile old words whenever we add new ones.

Evaluation

Over mag.corpus.large.txt.

Initial Evaluation

  • Coverage: 94567 / 187701 (~0.50381724125071256946)
  • Totals: 142 forms, 339 tp, 12 fp, 0 tn, 65 fn
  • Precision: 96.58120%
  • Recall: 83.91089%
  • Unknown words

   2785 ^हे/*हे$
   960 ^नञ/*नञ$
   793 ^ऊ/*ऊ$
   655 ^अपपन/*अपपन$
   478 ^हममर/*हममर$
   437 ^तो/*तो$
   429 ^कवि/*कवि$
   424 ^न/*न$
   414 ^तऽ/*तऽ$
   397 ^ले/*ले$
   396 ^कुमार/*कुमार$
   375 ^जे/*जे$
   367 ^हऽ/*हऽ$
   352 ^सिंह/*सिंह$
   344 ^की/*की$
   317 ^लेल/*लेल$
   309 ^साहित/*साहित$
   290 ^घर/*घर$
   280 ^कविता/*कविता$
   255 ^उनखर/*उनखर$

  • Remaining unknown forms: 93134
  • Total number of forms: 362
  • Lexical forms (not morphology): 273

Final Evaluation

  • Coverage: 150272 / 187701 (~0.80059243158001289285)
  • Totals: 142 forms, 340 tp, 103 fp, 0 tn, 64 fn
  • Precision: 76.74944% (went down because we introduced ambiguity by massively expanding the lexicon; it was artificially high before because all of our lexicon was based on that story)
  • Recall: 84.15842%
  • Unknown words

    58 ^डॉ०/*डॉ०$
    21 ^हौले/*हौले$
    20 ^गाड/*गाड$
    19 ^सुनावे/*सुनावे$
    19 ^छो/*छो$
    18 ^हलूं/*हलूं$
    18 ^विदवान/*विदवान$
    18 ^तोरे/*तोरे$
    18 ^जुगाड़/*जुगाड़$
    18 ^गते/*गते$
    17 ^होते/*होते$
    17 ^हिंछा/*हिंछा$
    17 ^हाँथ/*हाँथ$
    17 ^सथान/*सथान$
    17 ^सजल/*सजल$
    17 ^वाह/*वाह$
    17 ^योगदान/*योगदान$
    17 ^मुनचुन/*मुनचुन$
    17 ^महाकवि/*महाकवि$
    17 ^मरदाना/*मरदाना$

  • Remaining unknown forms: 37429
  • Total number of forms: 1172
  • Lexical forms (not morphology): 1012

GitHub