Dhivehi and English/Final project
Contents
Dhivehi RBMT and ONMT
Background
Dhivehi is a language with limited technological resources. Currently there are online Dhivehi-English dictionaries and various keyboard layouts. However, the dictionaries are not comprehensive. There is no existing machine translation system for Dhivehi to English. For this project, we explored two approaches to Dhivehi-English machine translation - rule-based machine translation and neural machine translation.
Motivation/Goal
Building a robust Dhivehi-English machine translation system would be beneficial to the Dhivehi language community by connecting them to the English language through technology. In addition, it will be beneficial for language learner of Dhivehi or English. It also promotes Dhivehi. Our goal is to explore rule-based machine translation and neural machine translation for Dhivehi-English. We will examine the advantages and disadvantages of both approaches.
Methods
RBMT
Using Apertium, we studied the grammatical structure of Dhivehi and wrote various rules (phonological, syntactic, etc.) that map Dhivehi to English.
Things Implemented (since the Final Project):
- ordinal number suffix
- proper noun tags
- structural transfer rule (1)
- determiners, pronouns, interjection, demonstratives
- added new words (~ 50)
NMT
Neural Machine Translation: using OpenNMT, we trained on Dhivehi-English parallel corpus consisting of mostly biblical text, some modern phrases, and Universal Declaration of Human Rights. The neural net learns the features of Dhivehi based on input and tries to predict the correct English sequences.
Evaluation
Cross Validation for NMT
Fold | WER | PER | Words in target | Words in prediction |
---|---|---|---|---|
1 | 86.68% | 40.49% | 12970 | 9650 |
2 | 90.69% | 29.38% | 12664 | 12070 |
3 | 85.65% | 48.29% | 12292 | 7802 |
We used three-fold cross validation and found that the second model performed the best against its corresponding validation data set. We deemed it the best because the WER's of all the models were somewhat similar, but model two's PER was significantly lower. We can note that model two's prediction feature more words overall than the other two models.
RBMT and NMT
System | WER | PER | Words in target | Words in prediction |
---|---|---|---|---|
RBMT | 94.77% | 89.37% | 12211 | 10125 |
NMT | 88.17% | 32.02% | 12211 | 10629 |
We found that an NMT model (the Transformer model) that trained for approximately thirty-nine hours performed better in-terms of WER and PER than the RBMT we worked on this semester. We must note, however, the data pooled together for training, validating, and testing the NMT system consisted primarily of biblical text. So, it is likely that the majority of the test data was from the Bible.
Meaning of Results
These results tell us that the NMT implementation is very promising. If we were able to find a relatively inexpensive translator (perhaps through Translation Services) and be able to build a large parallel corpus consisting of more modernized phrases, then we could have a more effective "modern-day" NMT system. The RBMT is more adaptable to modernized texts as we can tweak the individual mappings of words and grammar rules.
In short, if we had a large parallel corpus of modern text, then the NMT implementation would likely provide decent translation given enough training time.
Future Work
Improving the RBMT system
- Implement more grammatical structures including:
- Honorific System
- Verb Tenses and Forms
- Pronouns
Improving the NMT system
- Tweak hyper-parameters associated with the neural net such as learning rate, number of hidden layers, and others
- Expand size of parallel corpus
Github
You can find the code for this project available at: https://github.com/nav6maini/Dhivehi-RBMT-and-ONMT