Difference between revisions of "Wamesa and Tongan"
(→Developed Resources) |
(→Tongan Transducer) |
||
(38 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
This page is a resource for machine translation between [https://wikis.swarthmore.edu/ling073/Wamesa Wamesa] and [https://wikis.swarthmore.edu/ling073/Tongan Tongan]. | This page is a resource for machine translation between [https://wikis.swarthmore.edu/ling073/Wamesa Wamesa] and [https://wikis.swarthmore.edu/ling073/Tongan Tongan]. | ||
<br/> | <br/> | ||
+ | ==Initial Evaluation== | ||
===wad → ton evaluation=== | ===wad → ton evaluation=== | ||
On the <code>tests</code> file, 77.47% of the words translate correctly, and the unknown word rate is 28.57%. The WER and PER are both 71.43% when *'d words are removed, and 100% when not removing *'s. | On the <code>tests</code> file, 77.47% of the words translate correctly, and the unknown word rate is 28.57%. The WER and PER are both 71.43% when *'d words are removed, and 100% when not removing *'s. | ||
===ton → wad evaluation=== | ===ton → wad evaluation=== | ||
+ | Percentage of unknown words in ton.sentences.txt (bilingual corpus): 7% | ||
− | === | + | Percentage of stems translated correctly: 93% |
− | ====wad | + | |
− | + | Percentage of unknown words in test file: 0% | |
− | + | ||
− | + | Results when removing unknown word marks (stars) | | |
− | + | '''WER:''' 100% | |
− | + | '''PER:''' 100% | |
− | + | Number of position-independent correct words: 1 | |
− | === | + | |
− | + | Results when unknown word marks (stars) not removed | | |
− | * | + | '''WER:''' 100% |
− | * | + | '''PER:''' 100% |
− | + | ||
− | * | + | *There is still one bug that we are trying to fix. '''siʻaku''' is being translated to '''#<poss>''', but it should translate to nothing as there is no equivalent in Wamesa. The percentages above are when the incorrect #<poss> is taken out of the ton-wad.tests.txt file. Otherwise, the percentage increases to 114.29%. |
− | * | + | |
+ | ==Final Evaluation== | ||
+ | ===Wamesa Transducer=== | ||
+ | There are about 110 stems in the transducer, and about 160 in the translation system. Precision and recall for the <code>wad.annotated.basic.txt</code> corpus are both 100%. There are 940 words in the <code>wad.large.txt</code> corpus, and the coverage over it is 26.17%. | ||
+ | |||
+ | ===Tongan Transducer=== | ||
+ | 1. Precision and recall against the annotated.basic corpus. <br> | ||
+ | <pre> | ||
+ | Precision: 49.38272% | ||
+ | Recall: 70.79646% | ||
+ | </pre> | ||
+ | 2. Coverage over the large corpus: 58.33% <br> | ||
+ | 3. The number of words in the large corpus: 1200625 <br> | ||
+ | 4. The number of stems in the transducer. ~270 stems<br> | ||
+ | |||
+ | ===MT Pair: ton-wad=== | ||
+ | 1. WER and PER over longer corpus. <br> | ||
+ | * WER: 100.00%<br> | ||
+ | * PER: 96.55%<br> | ||
+ | 2. The proportion of stems translated correctly in the longer corpus:96.55 % <br> | ||
+ | Percentage of unknown words: 3.45% <br> | ||
+ | 3. Trimmed coverage over longer and large corpora. <br> | ||
+ | * Large: 58.33% | ||
+ | * Longer: 94.49% | ||
+ | 4. The number of tokens in longer and large corpora. <br> | ||
+ | * Large: 1200625 | ||
+ | * Longer: 127 | ||
==Documentation== | ==Documentation== | ||
− | [ | + | [[Wamesa_and_Tongan/Contrastive_Grammar | Contrastive Grammar]]<br/> |
+ | [[Wamesa_and_Tongan/Lexical_Selection | Lexical Selection]]<br/> | ||
+ | [[Wamesa_and_Tongan/Structural_transfer | Structural Transfer]]<br/> | ||
==Developed Resources== | ==Developed Resources== | ||
− | [https://github.swarthmore.edu/mcostag1/ling073-ton-wad.git | + | [https://github.swarthmore.edu/mcostag1/ling073-ton-wad.git ton→wad translator]<br/> |
− | [https://github.swarthmore.edu/twarner2/ling073-wad-ton.git | + | [https://github.swarthmore.edu/twarner2/ling073-wad-ton.git wad→ton translator]<br/> |
+ | [https://github.swarthmore.edu/mcostag1/ling073-ton-wad Bidirectional Wamesa/Tongan Translator]<br/> | ||
Latest revision as of 16:41, 9 May 2017
This page is a resource for machine translation between Wamesa and Tongan.
Contents
Initial Evaluation
wad → ton evaluation
On the tests
file, 77.47% of the words translate correctly, and the unknown word rate is 28.57%. The WER and PER are both 71.43% when *'d words are removed, and 100% when not removing *'s.
ton → wad evaluation
Percentage of unknown words in ton.sentences.txt (bilingual corpus): 7%
Percentage of stems translated correctly: 93%
Percentage of unknown words in test file: 0%
Results when removing unknown word marks (stars) | WER: 100% PER: 100% Number of position-independent correct words: 1
Results when unknown word marks (stars) not removed | WER: 100% PER: 100%
- There is still one bug that we are trying to fix. siʻaku is being translated to #<poss>, but it should translate to nothing as there is no equivalent in Wamesa. The percentages above are when the incorrect #<poss> is taken out of the ton-wad.tests.txt file. Otherwise, the percentage increases to 114.29%.
Final Evaluation
Wamesa Transducer
There are about 110 stems in the transducer, and about 160 in the translation system. Precision and recall for the wad.annotated.basic.txt
corpus are both 100%. There are 940 words in the wad.large.txt
corpus, and the coverage over it is 26.17%.
Tongan Transducer
1. Precision and recall against the annotated.basic corpus.
Precision: 49.38272% Recall: 70.79646%
2. Coverage over the large corpus: 58.33%
3. The number of words in the large corpus: 1200625
4. The number of stems in the transducer. ~270 stems
MT Pair: ton-wad
1. WER and PER over longer corpus.
- WER: 100.00%
- PER: 96.55%
2. The proportion of stems translated correctly in the longer corpus:96.55 %
Percentage of unknown words: 3.45%
3. Trimmed coverage over longer and large corpora.
- Large: 58.33%
- Longer: 94.49%
4. The number of tokens in longer and large corpora.
- Large: 1200625
- Longer: 127
Documentation
Contrastive Grammar
Lexical Selection
Structural Transfer
Developed Resources
ton→wad translator
wad→ton translator
Bidirectional Wamesa/Tongan Translator