Magahi/Final Project

From LING073
Jump to: navigation, search

What we did

We improved our morphological transducer by figuring out how to analyze the top unknown words repeatedly.

This consisted of adding new lemmas to our lexd as well as new patterns to cover compound verbs and adjectives.

Potential Improvements

  • Add a script to ignore the virama diacritic to the front of the apertium pipeline.
  • Find a Magahi dictionary or native Magahi speakers so we can add many more words and potentially correct old ones.
  • Optimize compilation process so we don't have to recompile old words whenever we add new ones.

Evaluation

Over mag.corpus.large.txt.

Initial Evaluation

  • Coverage: 94567 / 187701 (~0.50381724125071256946)
  • Totals: 142 forms, 339 tp, 12 fp, 0 tn, 65 fn
  • Precision: 96.58120%
  • Recall: 83.91089%
  • Unknown words

   2785 ^हे/*हे$
   960 ^नञ/*नञ$
   793 ^ऊ/*ऊ$
   655 ^अपपन/*अपपन$
   478 ^हममर/*हममर$
   437 ^तो/*तो$
   429 ^कवि/*कवि$
   424 ^न/*न$
   414 ^तऽ/*तऽ$
   397 ^ले/*ले$
   396 ^कुमार/*कुमार$
   375 ^जे/*जे$
   367 ^हऽ/*हऽ$
   352 ^सिंह/*सिंह$
   344 ^की/*की$
   317 ^लेल/*लेल$
   309 ^साहित/*साहित$
   290 ^घर/*घर$
   280 ^कविता/*कविता$
   255 ^उनखर/*उनखर$

  • Remaining unknown forms: 93134
  • Total number of forms: 362
  • Lexical forms (not morphology): 273

Final Evaluation

  • Coverage: 150272 / 187701 (~0.80059243158001289285)
  • Totals: 142 forms, 340 tp, 103 fp, 0 tn, 64 fn
  • Precision: 76.74944% (went down because we introduced ambiguity by massively expanding the lexicon; it was artificially high before because all of our lexicon was based on that story)
  • Recall: 84.15842%
  • Unknown words

    58 ^डॉ०/*डॉ०$
    21 ^हौले/*हौले$
    20 ^गाड/*गाड$
    19 ^सुनावे/*सुनावे$
    19 ^छो/*छो$
    18 ^हलूं/*हलूं$
    18 ^विदवान/*विदवान$
    18 ^तोरे/*तोरे$
    18 ^जुगाड़/*जुगाड़$
    18 ^गते/*गते$
    17 ^होते/*होते$
    17 ^हिंछा/*हिंछा$
    17 ^हाँथ/*हाँथ$
    17 ^सथान/*सथान$
    17 ^सजल/*सजल$
    17 ^वाह/*वाह$
    17 ^योगदान/*योगदान$
    17 ^मुनचुन/*मुनचुन$
    17 ^महाकवि/*महाकवि$
    17 ^मरदाना/*मरदाना$

  • Remaining unknown forms: 37429
  • Total number of forms: 1172
  • Lexical forms (not morphology): 1012

GitHub