Difference between revisions of "Kaingang and Portuguese/Final Project"

From LING073
Jump to: navigation, search
(What we did)
(Documented ambiguity in Kaingang)
Line 97: Line 97:
 
===Word Error Rate===
 
===Word Error Rate===
  
==Documented ambiguity in Kaingang==
+
==Further work==
 +
===Documented ambiguity in Kaingang===
 +
We haven't employed rules for the vast majority of these but we compiled a list of all ambiguous words in the translator!
 
  ag - (pr pes) eles / (pr dem) os
 
  ag - (pr pes) eles / (pr dem) os
 
  ã -
 
  ã -

Revision as of 23:53, 14 May 2019

The project: RBMT or NMT?

Background

Kaingang is a language spoken in the southern Brazil region. It has roughly 20,000 speakers, many of whom are bilingual with Portuguese. Kaingang is one the most well-documented languages of indigenous peoples of Brazil; however, in our project, we were only able to use a Kaingang-Portuguese dictionary and a Kaingang new testament bible (the other resources were written in languages we didn't read).

Idea

We wanted to find out how well a neural machine translation could work with a language that didn't have millions of parallel lines (in fact, we have ~10,000). We also wanted to compare it to our rule-based transducer, which we've been making additions to since the end of the semester.

Motivation

This is the only (at least easily accessible when searching) machine translation available between Kaingang and another language. The motivation behind our project was to possibly allow the Kaingang people to not have to choose between teaching Kaingang or Portuguese to later generations. We hoped (if at some point one of the translators was state-of-the-art) any need for Portuguese could be accessed through the translator, which would be an incentive to keep the language Kaingang alive.

What we did

  • Added many more features (such as much larger coverage, new disambiguation rules, and 2 new structural transfer rules) and improved performance of rule-based transducer
  • Trained a neural network with 75% of our parallel corpus and tested with 25% using OpenNMT
  • Documented ambiguity in Kaingang
  • Created parallel corpora for the whole new testament between Kaingang and Portuguese!

Links to repositories

Kaingang transducer

Kaingang-Portuguese translator

Evaluation Metric

Our new evaluation accounts for the fact that words in the Apertium translation sentences has # and * symbols. The accuracy script looks something like this:

correct_word_percentage = []
for i in range(len(true_translations)):
   correct_word_count = 0
   for word in prediction[i].split():
      if word.strip("\n").strip("*").strip("#") in true_translations[i]:
         correct_word_count += 1
   correct_word_percentage.append(correct_word_percentage/len(prediction[i].split()))
average_accuracy = sum(correct_word_percentage)/len(correct_word_percentage)
print(average_accuracy) 

Accuracy

Initial RBMT accuracy (actually higher than poster because of new evaluation metric): 6.78%

Final RBMT accuracy: 9.91%


Initial NMT accuracy: 21.4%

Final NMT accuracy: (currently training on new larger corpus, so no final accuracy yet)

Specs for rule-based

Coverage

  • Coverage over new testament:
$ aq-covtest ../ling073-kgp-corpus/kgp.corpus.large.txt kgp.automorf.bin
Number of tokenised words in the corpus: 406362
Coverage: 92.93%
Top unknown words in the corpus:
716	 Cristo
609	 vég
549	 venh
480	 nỹtĩn
470	 Senhor
428	 nĩm
392	 hẽn
385	 kuprĩg
375	 cidade
350	 henh
340	 Paulo
334	 jyvẽn
331	 pir
308	 mũꞌ
307	 nĩꞌ
307	 Pedro
289	 régre
271	 vin
270	 nĩnh
269	 tũꞌ
Translation time: 1.422363042831421 seconds
  • Coverage over dictionary sentences:
$ aq-covtest ../ling073-kgp-por-corpus/kgp.sentences.txt kgp.automorf.bin
Number of tokenised words in the corpus: 32659
Coverage: 92.72%
Top unknown words in the corpus:
30	 nĩm
27	 vég
26	 isóg
24	 jagy
23	 prũ
23	 tãg
22	 kyr
20	 mỹnh
20	 régre
18	 pir
18	 féj
17	 jã
17	 Kusa
17	 nĩgé
16	 kãtĩ
16	 jamã
15	 venh
15	 gãnh
15	 jakré
14	 nug
Translation time: 0.1878523826599121 seconds

Word Error Rate

Further work

Documented ambiguity in Kaingang

We haven't employed rules for the vast majority of these but we compiled a list of all ambiguous words in the translator!

ag - (pr pes) eles / (pr dem) os
ã -
ãjag -
e - (mod) muito / (v iv) fazer, causar
eg -
ẽmĩ - (n) pao de milho / (v iv) tatear, apalpar
fa - (n) perna, planta, amargo / (v tv sg) lavar roupa
fag - (pr pes) elas / (pr dem) as
fi - (pr pes) ela / (pr dem) a
ge - (mod) entao, tambem / (n) semelhante / (v iv) fazer igual
gé - (mod) tambem / (v tv pl) levar, carregar
gunhgunh - (n) picao preto / (v tv pl) fincar
han - (v iv sg) sarar, melhorar / (v tv sg) fazer
he - (ij) sim / (v tv) dizer
hur - (o) ja / (v iv sg) caimbra
hỹn - (o) de certo, provavelmente / (int) onde?
inh -
jãn -
jãvãnh - (mod) recusar, nao saber fazer / (v tv sg) esperar, procurar
jé - (su) futuro / (n) reza, hino
jẽn -
jo - (conj) entao / (cir) antes, na frente
jũn - (n) palmito / (v tv) embrabecer
kafãn -
ke - (n) sobra, resto / (v iv) futuro, fazer, dizer
ker - (o) cuidado / (n) presente de comida, cama de barbante de urtiga
kre - (n) balaio, coxa / (v tv pl) cortar, ceifar
kren - (mod) quase / (v iv sg) escapar / (v tv sg) perder
krenkren - (v iv pl) escapar / (v tv pl) perder
kur - (o) ligeiro, rapido! / (n) roupa, vestido
mé - (mod) fazer diariamente, ligeiro, gostar de fazer / (n) carneiro
mẽ - (mod) muito, ligeiro / (v tv sg) cheirar, escutar, sentir, tocar
mro - (v iv sg) tomar banho / (v tv sg) por de molho
mũ - (a) fazendo, narrativo, acao unica, consequencia / (v iv pl) andar
mỹ - (cir) para / (su) pergunta
mỹr - (o) verdadeiro, e certo / (conj) ?
ne - (su) originador / (v tv sg) enterrar / (n) enterrado / de / (int) o que?
nẽ - (a) sera que? / (v tv) cobrir encostado
nig - (n) lagoa, poco / (v tv sg) dar pontape
nĩ - (a) na situacao de, sentado, ter, ter a obrigacao, condicao / (n) carne / (v iv sg) sentar
nỹ - (a) deitado / (su) topico na pergunta / (n) mae, irma da mae / (v iv sg) deitar-se
nỹtĩ - (a) sendo, ter / (v iv pl) existir
pẽ - (mod) muito, o verdadeiro / (n) braco
ra - (cir) para, na direcao de, apesar de / (n) queixo / (a) faca ja!
rã - (cir) perto, por baixo de / (n) sol, maduro / (v iv sg) entrar, comecar, estar perto de
rán - (n) encosta, declive, barranco, subida, perau / (v tv sg) escrever
rãké - (v iv) chegando tarde / tarde ?
rãnhrãj - (n) trabalho / (v tv pl) trabalhar
re - (n) grama, campo, fora de casa / (v iv) encher-se de bebida / (v iv pl) descer
sa - (a) pendurado / (n) sal, cachoeira / (v iv) pendurado / (v tv) pendurar, pregar
se - (n) quati / (v tv sg) amarrar, prender / (ij) puxa!
tãnh - (n) morador, dono / (v iv) fazer forte, habituar-se
ti - (pr pes) ele / (pr dem) o
tógfĩn - (n) gaviao / (v tv sg) atar, amarrar
tỹ - (cir) com, por / (erg) / (ex) / (top)
vãn - (n) taquara / (v tv sg) carregar coisa comprida
vãnh - (mod) nao querer ou poder fazer algo / (n) capoeira, mato, plantas curtas
ve - (n) aparencia, natureza, irma de um homem, primeiro / (v iv sg) parecer ser / (v tv sg) ver, enxergar
vo - (v iv) barulho de andar no mato, coracao batendo / (n) macuco
vỹ - (su) topico / (o) sera que e? / (voc) homem! moco!