Difference between revisions of "Wamesa and Tongan"

From LING073
Jump to: navigation, search
(Documentation)
(Tongan Transducer)
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
This page is a resource for machine translation between [https://wikis.swarthmore.edu/ling073/Wamesa Wamesa] and [https://wikis.swarthmore.edu/ling073/Tongan Tongan].
 
This page is a resource for machine translation between [https://wikis.swarthmore.edu/ling073/Wamesa Wamesa] and [https://wikis.swarthmore.edu/ling073/Tongan Tongan].
 
<br/>
 
<br/>
 +
==Initial Evaluation==
 
===wad → ton evaluation===
 
===wad → ton evaluation===
 
On the <code>tests</code> file, 77.47% of the words translate correctly, and the unknown word rate is 28.57%. The WER and PER are both 71.43% when *'d words are removed, and 100% when not removing *'s.
 
On the <code>tests</code> file, 77.47% of the words translate correctly, and the unknown word rate is 28.57%. The WER and PER are both 71.43% when *'d words are removed, and 100% when not removing *'s.
Line 21: Line 22:
  
 
*There is still one bug that we are trying to fix. '''siʻaku''' is being translated to '''#<poss>''', but it should translate to nothing as there is no equivalent in Wamesa. The percentages above are when the incorrect #<poss> is taken out of the ton-wad.tests.txt file. Otherwise, the percentage increases to 114.29%.
 
*There is still one bug that we are trying to fix. '''siʻaku''' is being translated to '''#<poss>''', but it should translate to nothing as there is no equivalent in Wamesa. The percentages above are when the incorrect #<poss> is taken out of the ton-wad.tests.txt file. Otherwise, the percentage increases to 114.29%.
 +
 +
==Final Evaluation==
 +
===Wamesa Transducer===
 +
There are about 110 stems in the transducer, and about 160 in the translation system. Precision and recall for the <code>wad.annotated.basic.txt</code> corpus are both 100%. There are 940 words in the <code>wad.large.txt</code> corpus, and the coverage over it is 26.17%.
 +
 +
===Tongan Transducer===
 +
1. Precision and recall against the annotated.basic corpus. <br>
 +
<pre>
 +
Precision: 49.38272%
 +
Recall: 70.79646%
 +
</pre>
 +
2. Coverage over the large corpus: 58.33% <br>
 +
3. The number of words in the large corpus: 1200625 <br>
 +
4. The number of stems in the transducer.  ~270 stems<br>
 +
 +
===MT Pair: ton-wad===
 +
1. WER and PER over longer corpus.  <br>
 +
* WER: 100.00%<br>
 +
* PER: 96.55%<br>
 +
2. The proportion of stems translated correctly in the longer corpus:96.55 %  <br>
 +
Percentage of unknown words: 3.45% <br>
 +
3. Trimmed coverage over longer and large corpora.  <br>
 +
* Large: 58.33%
 +
* Longer: 94.49%
 +
4. The number of tokens in longer and large corpora. <br>
 +
* Large: 1200625
 +
* Longer: 127
  
 
==Documentation==
 
==Documentation==
Line 30: Line 58:
 
[https://github.swarthmore.edu/mcostag1/ling073-ton-wad.git ton→wad translator]<br/>
 
[https://github.swarthmore.edu/mcostag1/ling073-ton-wad.git ton→wad translator]<br/>
 
[https://github.swarthmore.edu/twarner2/ling073-wad-ton.git wad→ton translator]<br/>
 
[https://github.swarthmore.edu/twarner2/ling073-wad-ton.git wad→ton translator]<br/>
 +
[https://github.swarthmore.edu/mcostag1/ling073-ton-wad Bidirectional Wamesa/Tongan Translator]<br/>
  
  

Latest revision as of 16:41, 9 May 2017

This page is a resource for machine translation between Wamesa and Tongan.

Initial Evaluation

wad → ton evaluation

On the tests file, 77.47% of the words translate correctly, and the unknown word rate is 28.57%. The WER and PER are both 71.43% when *'d words are removed, and 100% when not removing *'s.

ton → wad evaluation

Percentage of unknown words in ton.sentences.txt (bilingual corpus): 7%

Percentage of stems translated correctly: 93%

Percentage of unknown words in test file: 0%

Results when removing unknown word marks (stars) | WER: 100% PER: 100% Number of position-independent correct words: 1

Results when unknown word marks (stars) not removed | WER: 100% PER: 100%

  • There is still one bug that we are trying to fix. siʻaku is being translated to #<poss>, but it should translate to nothing as there is no equivalent in Wamesa. The percentages above are when the incorrect #<poss> is taken out of the ton-wad.tests.txt file. Otherwise, the percentage increases to 114.29%.

Final Evaluation

Wamesa Transducer

There are about 110 stems in the transducer, and about 160 in the translation system. Precision and recall for the wad.annotated.basic.txt corpus are both 100%. There are 940 words in the wad.large.txt corpus, and the coverage over it is 26.17%.

Tongan Transducer

1. Precision and recall against the annotated.basic corpus.

Precision: 49.38272%
Recall: 70.79646%

2. Coverage over the large corpus: 58.33%
3. The number of words in the large corpus: 1200625
4. The number of stems in the transducer. ~270 stems

MT Pair: ton-wad

1. WER and PER over longer corpus.

  • WER: 100.00%
  • PER: 96.55%

2. The proportion of stems translated correctly in the longer corpus:96.55 %
Percentage of unknown words: 3.45%
3. Trimmed coverage over longer and large corpora.

  • Large: 58.33%
  • Longer: 94.49%

4. The number of tokens in longer and large corpora.

  • Large: 1200625
  • Longer: 127

Documentation

Contrastive Grammar
Lexical Selection
Structural Transfer

Developed Resources

ton→wad translator
wad→ton translator
Bidirectional Wamesa/Tongan Translator