Dhivehi and English/Final project

From LING073
Revision as of 22:03, 14 May 2019 by Xluo1 (talk | contribs) (RBMT)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Dhivehi RBMT and ONMT

Background

Dhivehi is a language with limited technological resources. Currently there are online Dhivehi-English dictionaries and various keyboard layouts. However, the dictionaries are not comprehensive. There is no existing machine translation system for Dhivehi to English. For this project, we explored two approaches to Dhivehi-English machine translation - rule-based machine translation and neural machine translation.

Motivation/Goal

Building a robust Dhivehi-English machine translation system would be beneficial to the Dhivehi language community by connecting them to the English language through technology. In addition, it will be beneficial for language learner of Dhivehi or English. It also promotes Dhivehi. Our goal is to explore rule-based machine translation and neural machine translation for Dhivehi-English. We will examine the advantages and disadvantages of both approaches.

Methods

RBMT

Using Apertium, we studied the grammatical structure of Dhivehi and wrote various rules (phonological, syntactic, etc.) that map Dhivehi to English.

Things Implemented (since the Final Project):

  1. ordinal number suffix
  2. proper noun tags
  3. structural transfer rule (1)
  4. determiners, pronouns, interjection, demonstratives
  5. added new words (~ 50)

NMT

Neural Machine Translation: using OpenNMT, we trained on Dhivehi-English parallel corpus consisting of mostly biblical text, some modern phrases, and Universal Declaration of Human Rights. The neural net learns the features of Dhivehi based on input and tries to predict the correct English sequences.

Evaluation

Cross Validation for NMT

Fold WER PER Words in target Words in prediction
1 86.68% 40.49% 12970 9650
2 90.69% 29.38% 12664 12070
3 85.65% 48.29% 12292 7802

We used three-fold cross validation and found that the second model performed the best against its corresponding validation data set. We deemed it the best because the WER's of all the models were somewhat similar, but model two's PER was significantly lower. We can note that model two's prediction feature more words overall than the other two models.

RBMT and NMT

System WER PER Words in target Words in prediction
RBMT 94.77% 89.37% 12211 10125
NMT 88.17% 32.02% 12211 10629

We found that an NMT model (the Transformer model) that trained for approximately thirty-nine hours performed better in-terms of WER and PER than the RBMT we worked on this semester. We must note, however, the data pooled together for training, validating, and testing the NMT system consisted primarily of biblical text. So, it is likely that the majority of the test data was from the Bible.

Meaning of Results

These results tell us that the NMT implementation is very promising. If we were able to find a relatively inexpensive translator (perhaps through Translation Services) and be able to build a large parallel corpus consisting of more modernized phrases, then we could have a more effective "modern-day" NMT system. The RBMT is more adaptable to modernized texts as we can tweak the individual mappings of words and grammar rules.

In short, if we had a large parallel corpus of modern text, then the NMT implementation would likely provide decent translation given enough training time.

Future Work

Improving the RBMT system

  • Implement more grammatical structures including:
  1. Honorific System
  2. Verb Tenses and Forms
  3. Pronouns

Improving the NMT system

  • Tweak hyper-parameters associated with the neural net such as learning rate, number of hidden layers, and others
  • Expand size of parallel corpus

Github

You can find the code for this project available at: https://github.com/nav6maini/Dhivehi-RBMT-and-ONMT

References

Apertium

OpenNMT

Neural Machine Translation