Difference between revisions of "Miskito/Final project"

From LING073
Jump to: navigation, search
(Overview)
Line 36: Line 36:
  
 
==Evaluation==
 
==Evaluation==
 
[[Category:sp21_FinalProjects]]
 
  
 
To evaluate our progress, we used the metric <code>coverage-hfst</code> to record our overall corpus coverage. Below is our reported results.
 
To evaluate our progress, we used the metric <code>coverage-hfst</code> to record our overall corpus coverage. Below is our reported results.
Line 52: Line 50:
 
# Adding stems for high-frequency words
 
# Adding stems for high-frequency words
 
# Implementing morphological rules (ie. tenses)
 
# Implementing morphological rules (ie. tenses)
 +
 +
 +
 +
 +
[[Category:sp21_FinalProjects]] [[Category:Miskito]]

Revision as of 14:34, 20 May 2021

Overview

Over the course of the semester, we've worked on developing a rule based machine translation (RBMT) system for Miskito utilizing apertium resources.

Implementation

For our final project we decided to focus on our monolingual transducer and improve it as much as possible. Specifically, we directed our attention to the .lexd and .twol files to encode grammatical generalizations. We managed to implement a large range of morphological analysis for our language.

  • Verb tense morphology
    • Absolute Past
    • Present
    • Present Progressive
    • Absolute Future
    • Future
  • Irregular Verb Morphology
  • Negation in the present tense
  • Possessive Noun morphology

With just a combined set of around 230 nouns and verbs, we achieved an impressive amount of coverage detailed below.

Areas to fix

  • Full coverage of possessive noun morphology
  • Adjective morphology

Developed Resources

We expanded our morphological analyzer's capabilities by adding in

Evaluation

To evaluate our progress, we used the metric coverage-hfst to record our overall corpus coverage. Below is our reported results.

Initial Corpus Coverage

  • Coverage: 3873 / 7569 (~0.51169242964724534285)
  • Remaining unknown forms: 3696

Final Corpus Coverage

  • Coverage: 5987 / 7588 (~0.78900896151818661044)
  • Remaining unknown forms: 1601

We were able to attain this increase in overall coverage by using two strategies:

  1. Adding stems for high-frequency words
  2. Implementing morphological rules (ie. tenses)