Kaingang and Portuguese/Final Project

From LING073
Revision as of 23:16, 14 May 2019 by Drosset1 (talk | contribs) (Specs for rule-based)

Jump to: navigation, search

The project: RBMT or NMT?

Background

Kaingang is a language spoken in the southern Brazil region. It has roughly 20,000 speakers, many of whom are bilingual with Portuguese. Kaingang is one the most well-documented languages of indigenous peoples of Brazil; however, in our project, we were only able to use a Kaingang-Portuguese dictionary and a Kaingang new testament bible (the other resources were written in languages we didn't read).

Idea

We wanted to find out how well a neural machine translation could work with a language that didn't have millions of parallel lines (in fact, we have ~10,000). We also wanted to compare it to our rule-based transducer, which we've been making additions to since the end of the semester.

Motivation

This is the only (at least easily accessible when searching) machine translation available between Kaingang and another language. The motivation behind our project was to possibly allow the Kaingang people to not have to choose between teaching Kaingang or Portuguese to later generations. We hoped (if at some point one of the translators was state-of-the-art) any need for Portuguese could be accessed through the translator, which would be an incentive to keep the language Kaingang alive.

Links to repositories

Kaingang transducer

Kaingang-Portuguese translator

Evaluation Metric

Our new evaluation accounts for the fact that words in the Apertium translation sentences has # and * symbols. The accuracy script looks something like this:

correct_word_percentage = []
for i in range(len(true_translations)):
   correct_word_count = 0
   for word in prediction[i].split():
      if word.strip("\n").strip("*").strip("#") in true_translations[i]:
         correct_word_count += 1
   correct_word_percentage.append(correct_word_percentage/len(prediction[i].split()))

average_accuracy = sum(correct_word_percentage)/len(correct_word_percentage)
print(average_accuracy)

Specs for rule-based

Number of...

Coverage

Word Error Rate