Difference between revisions of "Magahi/Final Project"

From LING073
Jump to: navigation, search
(Potential Improvements)
(GitHub)
 
(6 intermediate revisions by 2 users not shown)
Line 5: Line 5:
  
 
= Potential Improvements =
 
= Potential Improvements =
* Add a script to ignore the haland diacritic to the front of the apertium pipeline.
+
* Add a script to ignore the virama <code> ्</code> diacritic to the front of the apertium pipeline.
* Find a Magahi dictionary or native Magahi speakers so we can add many more words.
+
* Find a Magahi dictionary or native Magahi speakers so we can add many more words and potentially correct old ones.
 
* Optimize compilation process so we don't have to recompile old words whenever we add new ones.
 
* Optimize compilation process so we don't have to recompile old words whenever we add new ones.
  
 
= Evaluation =
 
= Evaluation =
 +
Over [https://github.swarthmore.edu/Ling073-sp21/ling073-mag-corpus mag.corpus.large.txt].
 
== Initial Evaluation ==
 
== Initial Evaluation ==
 
* Coverage: 94567 / 187701 (~0.50381724125071256946)
 
* Coverage: 94567 / 187701 (~0.50381724125071256946)
Line 74: Line 75:
 
* Lexical forms (not morphology): 1012
 
* Lexical forms (not morphology): 1012
  
 +
= GitHub =
 +
* [https://github.swarthmore.edu/Ling073-sp21/ling073-mag Private]
 +
* [https://github.com/William103/MagahiApertium Public]
  
 
[[Category:sp21_FinalProjects]] [[Category:Magahi]]
 
[[Category:sp21_FinalProjects]] [[Category:Magahi]]

Latest revision as of 22:27, 20 May 2021

What we did

We improved our morphological transducer by figuring out how to analyze the top unknown words repeatedly.

This consisted of adding new lemmas to our lexd as well as new patterns to cover compound verbs and adjectives.

Potential Improvements

  • Add a script to ignore the virama diacritic to the front of the apertium pipeline.
  • Find a Magahi dictionary or native Magahi speakers so we can add many more words and potentially correct old ones.
  • Optimize compilation process so we don't have to recompile old words whenever we add new ones.

Evaluation

Over mag.corpus.large.txt.

Initial Evaluation

  • Coverage: 94567 / 187701 (~0.50381724125071256946)
  • Totals: 142 forms, 339 tp, 12 fp, 0 tn, 65 fn
  • Precision: 96.58120%
  • Recall: 83.91089%
  • Unknown words

   2785 ^हे/*हे$
   960 ^नञ/*नञ$
   793 ^ऊ/*ऊ$
   655 ^अपपन/*अपपन$
   478 ^हममर/*हममर$
   437 ^तो/*तो$
   429 ^कवि/*कवि$
   424 ^न/*न$
   414 ^तऽ/*तऽ$
   397 ^ले/*ले$
   396 ^कुमार/*कुमार$
   375 ^जे/*जे$
   367 ^हऽ/*हऽ$
   352 ^सिंह/*सिंह$
   344 ^की/*की$
   317 ^लेल/*लेल$
   309 ^साहित/*साहित$
   290 ^घर/*घर$
   280 ^कविता/*कविता$
   255 ^उनखर/*उनखर$

  • Remaining unknown forms: 93134
  • Total number of forms: 362
  • Lexical forms (not morphology): 273

Final Evaluation

  • Coverage: 150272 / 187701 (~0.80059243158001289285)
  • Totals: 142 forms, 340 tp, 103 fp, 0 tn, 64 fn
  • Precision: 76.74944% (went down because we introduced ambiguity by massively expanding the lexicon; it was artificially high before because all of our lexicon was based on that story)
  • Recall: 84.15842%
  • Unknown words

    58 ^डॉ०/*डॉ०$
    21 ^हौले/*हौले$
    20 ^गाड/*गाड$
    19 ^सुनावे/*सुनावे$
    19 ^छो/*छो$
    18 ^हलूं/*हलूं$
    18 ^विदवान/*विदवान$
    18 ^तोरे/*तोरे$
    18 ^जुगाड़/*जुगाड़$
    18 ^गते/*गते$
    17 ^होते/*होते$
    17 ^हिंछा/*हिंछा$
    17 ^हाँथ/*हाँथ$
    17 ^सथान/*सथान$
    17 ^सजल/*सजल$
    17 ^वाह/*वाह$
    17 ^योगदान/*योगदान$
    17 ^मुनचुन/*मुनचुन$
    17 ^महाकवि/*महाकवि$
    17 ^मरदाना/*मरदाना$

  • Remaining unknown forms: 37429
  • Total number of forms: 1172
  • Lexical forms (not morphology): 1012

GitHub