Difference between revisions of "Miskito/Final project"

From LING073
Jump to: navigation, search
(Developed Resources)
(Evaluation)
 
(14 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==Overview==
 
==Overview==
Over the course of the semester
+
Over the course of the semester, we've worked on developing a rule based machine translation (RBMT) system for Miskito utilizing apertium resources.
  
===Developed Resources===
+
====Implementation/Solution====
 +
For our final project we decided to focus on our monolingual transducer and improve it as much as possible. Specifically, we directed our attention to the .lexd and .twol files to encode grammatical generalizations. We managed to implement a large range of morphological analysis for our language.
 +
 
 +
* Verb tense morphology
 +
** Absolute Past
 +
** Present
 +
** Present Progressive
 +
** Absolute Future
 +
** Future
 +
* Irregular Verb Morphology
 +
* Negation in the present tense
 +
* Possessive Noun morphology
 +
 
 +
A large portion of our increase in corpus coverage can be attributed to one of the two following strategies:
 +
* Adding high-frequency stems into the <code>.lexd</code> file.
 +
* Adding common morphological rules (ie. stems) into the <code>.twol</code> file.
 +
 
 +
With just a combined set of around 230 nouns and verbs, we achieved an impressive amount of coverage detailed below.
 +
 
 +
====Problems Encountered====
 +
* Limited Resources
 +
As Miskito has only ~140,000 native speakers, there were limited available resources to pull from in the creation of these language tools. Additionally, some of the resources we found were very dated or did not function very well.
 +
* Reliable Information
 +
We were able to make use of a couple of dictionaries and grammar workbooks. However, we found there were inconsistencies between these sources, which caused implementing some components of Miskito to be rather challenging.
 +
* Implementing Morphological Rules
 +
We found it challenging to concretely establish the morphology of Miskito and implement it into our <code>.twol</code> file.
 +
 
 +
====Areas to Fix====
 +
* Full coverage of possessive noun morphology
 +
* Adjective morphology
 +
 
 +
==Developed Resources==
  
 
* [https://github.swarthmore.edu/Ling073-sp21/ling073-miq-corpus Corpus Repo]
 
* [https://github.swarthmore.edu/Ling073-sp21/ling073-miq-corpus Corpus Repo]
Line 14: Line 45:
 
* [https://github.com/shanecjones1999/miq-morph-analyzer Link to the code for our morphological analyzer]
 
* [https://github.com/shanecjones1999/miq-morph-analyzer Link to the code for our morphological analyzer]
  
We expanded our morphological analyzer's capabilities by adding in
+
==Evaluation==
 +
 
 +
To evaluate our progress, we used the metric <code>coverage-hfst</code> to record our overall corpus coverage. Below is our reported results.
  
==Evaluation==
+
====Final Corpus Coverage====
 +
* Coverage: 5987 / 7588 (~0.7008)
 +
* Remaining unknown forms: 1601
  
[[Category:sp21_FinalProjects]]
+
[[Category:sp21_FinalProjects]] [[Category:Miskito]]

Latest revision as of 18:53, 20 May 2021

Overview

Over the course of the semester, we've worked on developing a rule based machine translation (RBMT) system for Miskito utilizing apertium resources.

Implementation/Solution

For our final project we decided to focus on our monolingual transducer and improve it as much as possible. Specifically, we directed our attention to the .lexd and .twol files to encode grammatical generalizations. We managed to implement a large range of morphological analysis for our language.

  • Verb tense morphology
    • Absolute Past
    • Present
    • Present Progressive
    • Absolute Future
    • Future
  • Irregular Verb Morphology
  • Negation in the present tense
  • Possessive Noun morphology

A large portion of our increase in corpus coverage can be attributed to one of the two following strategies:

  • Adding high-frequency stems into the .lexd file.
  • Adding common morphological rules (ie. stems) into the .twol file.

With just a combined set of around 230 nouns and verbs, we achieved an impressive amount of coverage detailed below.

Problems Encountered

  • Limited Resources

As Miskito has only ~140,000 native speakers, there were limited available resources to pull from in the creation of these language tools. Additionally, some of the resources we found were very dated or did not function very well.

  • Reliable Information

We were able to make use of a couple of dictionaries and grammar workbooks. However, we found there were inconsistencies between these sources, which caused implementing some components of Miskito to be rather challenging.

  • Implementing Morphological Rules

We found it challenging to concretely establish the morphology of Miskito and implement it into our .twol file.

Areas to Fix

  • Full coverage of possessive noun morphology
  • Adjective morphology

Developed Resources

Evaluation

To evaluate our progress, we used the metric coverage-hfst to record our overall corpus coverage. Below is our reported results.

Final Corpus Coverage

  • Coverage: 5987 / 7588 (~0.7008)
  • Remaining unknown forms: 1601