Wamesa and Tongan
wad → ton evaluation
tests file, 77.47% of the words translate correctly, and the unknown word rate is 28.57%. The WER and PER are both 71.43% when *'d words are removed, and 100% when not removing *'s.
ton → wad evaluation
Percentage of unknown words in ton.sentences.txt (bilingual corpus): 7%
Percentage of stems translated correctly: 93%
Percentage of unknown words in test file: 0%
Results when removing unknown word marks (stars) | WER: 100% PER: 100% Number of position-independent correct words: 1
Results when unknown word marks (stars) not removed | WER: 100% PER: 100%
- There is still one bug that we are trying to fix. siʻaku is being translated to #<poss>, but it should translate to nothing as there is no equivalent in Wamesa. The percentages above are when the incorrect #<poss> is taken out of the ton-wad.tests.txt file. Otherwise, the percentage increases to 114.29%.
There are about 110 stems in the transducer, and about 160 in the translation system. Precision and recall for the
wad.annotated.basic.txt corpus are both 100%. There are 940 words in the
wad.large.txt corpus, and the coverage over it is 26.17%.
1. Precision and recall against the annotated.basic corpus.
Precision: 49.38272% Recall: 70.79646%
2. Coverage over the large corpus: 58.33%
3. The number of words in the large corpus: 1200625
4. The number of stems in the transducer. ~270 stems
MT Pair: ton-wad
1. WER and PER over longer corpus.
- WER: 100.00%
- PER: 96.55%
2. The proportion of stems translated correctly in the longer corpus:96.55 %
Percentage of unknown words: 3.45%
3. Trimmed coverage over longer and large corpora.
- Large: 58.33%
- Longer: 94.49%
4. The number of tokens in longer and large corpora.
- Large: 1200625
- Longer: 127