Kaingang and Portuguese/Final Project
Contents
The project: RBMT or NMT?
Background
Kaingang is a language spoken in the southern Brazil region. It has roughly 20,000 speakers, many of whom are bilingual with Portuguese. Kaingang is one the most well-documented languages of indigenous peoples of Brazil; however, in our project, we were only able to use a Kaingang-Portuguese dictionary and a Kaingang new testament bible (the other resources were written in languages we didn't read).
Idea
We wanted to find out how well a neural machine translation could work with a language that didn't have millions of parallel lines (in fact, we have ~10,000). We also wanted to compare it to our rule-based transducer, which we've been making additions to since the end of the semester.
Motivation
This is the only (at least easily accessible when searching) machine translation available between Kaingang and another language. The motivation behind our project was to possibly allow the Kaingang people to not have to choose between teaching Kaingang or Portuguese to later generations. We hoped (if at some point one of the translators was state-of-the-art) any need for Portuguese could be accessed through the translator, which would be an incentive to keep the language Kaingang alive.
What we did
- Added many more features to and improved performance of rule-based transducer
- Trained a neural network with 75% of our parallel corpus and tested with 25% using OpenNMT
- Documented ambiguity in Kaingang
- Created parallel corpora for the whole new testament between Kaingang and Portuguese!
Links to repositories
Kaingang-Portuguese translator
Evaluation Metric
Our new evaluation accounts for the fact that words in the Apertium translation sentences has # and * symbols. The accuracy script looks something like this:
correct_word_percentage = [] for i in range(len(true_translations)): correct_word_count = 0 for word in prediction[i].split(): if word.strip("\n").strip("*").strip("#") in true_translations[i]: correct_word_count += 1 correct_word_percentage.append(correct_word_percentage/len(prediction[i].split())) average_accuracy = sum(correct_word_percentage)/len(correct_word_percentage) print(average_accuracy)
Accuracy
Initial RBMT accuracy (different from poster because of the new evaluation metric):
Final RBMT accuracy:
Initial NMT accuracy:
Final NMT accuracy:
Specs for rule-based
Number of...
Coverage
$ aq-covtest ../ling073-kgp-corpus/kgp.corpus.large.txt kgp.automorf.bin Number of tokenised words in the corpus: 406362 Coverage: 92.93% Top unknown words in the corpus: 716 Cristo 609 vég 549 venh 480 nỹtĩn 470 Senhor 428 nĩm 392 hẽn 385 kuprĩg 375 cidade 350 henh 340 Paulo 334 jyvẽn 331 pir 308 mũꞌ 307 nĩꞌ 307 Pedro 289 régre 271 vin 270 nĩnh 269 tũꞌ Translation time: 1.422363042831421 seconds