Kaingang and Portuguese/Final Project

From LING073
Revision as of 23:27, 14 May 2019 by Drosset1 (talk | contribs) (Accuracy)

Jump to: navigation, search

The project: RBMT or NMT?

Background

Kaingang is a language spoken in the southern Brazil region. It has roughly 20,000 speakers, many of whom are bilingual with Portuguese. Kaingang is one the most well-documented languages of indigenous peoples of Brazil; however, in our project, we were only able to use a Kaingang-Portuguese dictionary and a Kaingang new testament bible (the other resources were written in languages we didn't read).

Idea

We wanted to find out how well a neural machine translation could work with a language that didn't have millions of parallel lines (in fact, we have ~10,000). We also wanted to compare it to our rule-based transducer, which we've been making additions to since the end of the semester.

Motivation

This is the only (at least easily accessible when searching) machine translation available between Kaingang and another language. The motivation behind our project was to possibly allow the Kaingang people to not have to choose between teaching Kaingang or Portuguese to later generations. We hoped (if at some point one of the translators was state-of-the-art) any need for Portuguese could be accessed through the translator, which would be an incentive to keep the language Kaingang alive.

What we did

  • Added many more features to and improved performance of rule-based transducer
  • Trained a neural network with 75% of our parallel corpus and tested with 25% using OpenNMT
  • Documented ambiguity in Kaingang
  • Created parallel corpora for the whole new testament between Kaingang and Portuguese!

Links to repositories

Kaingang transducer

Kaingang-Portuguese translator

Evaluation Metric

Our new evaluation accounts for the fact that words in the Apertium translation sentences has # and * symbols. The accuracy script looks something like this:

correct_word_percentage = []
for i in range(len(true_translations)):
   correct_word_count = 0
   for word in prediction[i].split():
      if word.strip("\n").strip("*").strip("#") in true_translations[i]:
         correct_word_count += 1
   correct_word_percentage.append(correct_word_percentage/len(prediction[i].split()))
average_accuracy = sum(correct_word_percentage)/len(correct_word_percentage)
print(average_accuracy) 

Accuracy

Initial RBMT accuracy:

Final RBMT accuracy: 0.09910740279632312


Initial NMT accuracy: 0.21369239840801263

Final NMT accuracy: (currently training on new larger corpus, so no final accuracy yet)

Specs for rule-based

Number of...

Coverage

  • Coverage over new testament:
$ aq-covtest ../ling073-kgp-corpus/kgp.corpus.large.txt kgp.automorf.bin
Number of tokenised words in the corpus: 406362
Coverage: 92.93%
Top unknown words in the corpus:
716	 Cristo
609	 vég
549	 venh
480	 nỹtĩn
470	 Senhor
428	 nĩm
392	 hẽn
385	 kuprĩg
375	 cidade
350	 henh
340	 Paulo
334	 jyvẽn
331	 pir
308	 mũꞌ
307	 nĩꞌ
307	 Pedro
289	 régre
271	 vin
270	 nĩnh
269	 tũꞌ
Translation time: 1.422363042831421 seconds
  • Coverage over dictionary sentences:
$ aq-covtest ../ling073-kgp-por-corpus/kgp.sentences.txt kgp.automorf.bin
Number of tokenised words in the corpus: 32659
Coverage: 92.72%
Top unknown words in the corpus:
30	 nĩm
27	 vég
26	 isóg
24	 jagy
23	 prũ
23	 tãg
22	 kyr
20	 mỹnh
20	 régre
18	 pir
18	 féj
17	 jã
17	 Kusa
17	 nĩgé
16	 kãtĩ
16	 jamã
15	 venh
15	 gãnh
15	 jakré
14	 nug
Translation time: 0.1878523826599121 seconds

Word Error Rate

Documented ambiguity in Kaingang