Difference between revisions of "Kaingang and Portuguese/Final Project"
(→Accuracy) |
(→Accuracy) |
||
Line 31: | Line 31: | ||
Initial RBMT accuracy: | Initial RBMT accuracy: | ||
− | Final RBMT accuracy: | + | Final RBMT accuracy: 9.91% |
− | Initial NMT accuracy: | + | Initial NMT accuracy: 21.4% |
Final NMT accuracy: (currently training on new larger corpus, so no final accuracy yet) | Final NMT accuracy: (currently training on new larger corpus, so no final accuracy yet) |
Revision as of 23:27, 14 May 2019
Contents
The project: RBMT or NMT?
Background
Kaingang is a language spoken in the southern Brazil region. It has roughly 20,000 speakers, many of whom are bilingual with Portuguese. Kaingang is one the most well-documented languages of indigenous peoples of Brazil; however, in our project, we were only able to use a Kaingang-Portuguese dictionary and a Kaingang new testament bible (the other resources were written in languages we didn't read).
Idea
We wanted to find out how well a neural machine translation could work with a language that didn't have millions of parallel lines (in fact, we have ~10,000). We also wanted to compare it to our rule-based transducer, which we've been making additions to since the end of the semester.
Motivation
This is the only (at least easily accessible when searching) machine translation available between Kaingang and another language. The motivation behind our project was to possibly allow the Kaingang people to not have to choose between teaching Kaingang or Portuguese to later generations. We hoped (if at some point one of the translators was state-of-the-art) any need for Portuguese could be accessed through the translator, which would be an incentive to keep the language Kaingang alive.
What we did
- Added many more features to and improved performance of rule-based transducer
- Trained a neural network with 75% of our parallel corpus and tested with 25% using OpenNMT
- Documented ambiguity in Kaingang
- Created parallel corpora for the whole new testament between Kaingang and Portuguese!
Links to repositories
Kaingang-Portuguese translator
Evaluation Metric
Our new evaluation accounts for the fact that words in the Apertium translation sentences has # and * symbols. The accuracy script looks something like this:
correct_word_percentage = [] for i in range(len(true_translations)): correct_word_count = 0 for word in prediction[i].split(): if word.strip("\n").strip("*").strip("#") in true_translations[i]: correct_word_count += 1 correct_word_percentage.append(correct_word_percentage/len(prediction[i].split())) average_accuracy = sum(correct_word_percentage)/len(correct_word_percentage) print(average_accuracy)
Accuracy
Initial RBMT accuracy:
Final RBMT accuracy: 9.91%
Initial NMT accuracy: 21.4%
Final NMT accuracy: (currently training on new larger corpus, so no final accuracy yet)
Specs for rule-based
Number of...
Coverage
- Coverage over new testament:
$ aq-covtest ../ling073-kgp-corpus/kgp.corpus.large.txt kgp.automorf.bin Number of tokenised words in the corpus: 406362 Coverage: 92.93% Top unknown words in the corpus: 716 Cristo 609 vég 549 venh 480 nỹtĩn 470 Senhor 428 nĩm 392 hẽn 385 kuprĩg 375 cidade 350 henh 340 Paulo 334 jyvẽn 331 pir 308 mũꞌ 307 nĩꞌ 307 Pedro 289 régre 271 vin 270 nĩnh 269 tũꞌ Translation time: 1.422363042831421 seconds
- Coverage over dictionary sentences:
$ aq-covtest ../ling073-kgp-por-corpus/kgp.sentences.txt kgp.automorf.bin Number of tokenised words in the corpus: 32659 Coverage: 92.72% Top unknown words in the corpus: 30 nĩm 27 vég 26 isóg 24 jagy 23 prũ 23 tãg 22 kyr 20 mỹnh 20 régre 18 pir 18 féj 17 jã 17 Kusa 17 nĩgé 16 kãtĩ 16 jamã 15 venh 15 gãnh 15 jakré 14 nug Translation time: 0.1878523826599121 seconds