Difference between revisions of "Danish and English/Lexical Transfer"

From LING073
Jump to: navigation, search
(Lexical Transfer)
(Contrastive Grammar)
 
Line 3: Line 3:
 
===Contrastive Grammar===
 
===Contrastive Grammar===
  
We ran `apertium-eval-translator` on our contrastive grammar.  The results are as follows:
+
We ran apertium-eval-translator on our contrastive grammar.  The results are as follows:
  
 
  Test file: 'dan-eng.tests.txt'
 
  Test file: 'dan-eng.tests.txt'
Line 66: Line 66:
 
  Percentage of unknown words that were free rides: 0%
 
  Percentage of unknown words that were free rides: 0%
  
We also ran ```apertium-eval-translator``` on a mini-corpus of Cinderella translated into Danish.  For this test, we got a WER of 95%, meaning that at least a few words were translating completely correctly.
+
===Mini-Corpus===
 +
 
 +
We also ran apertium-eval-translator on a mini-corpus of Cinderella translated into Danish.  For this test, we got a WER of 95%, meaning that at least a few words were translating completely correctly.
  
 
  Test file: 'dan-eng.cinderella.txt'
 
  Test file: 'dan-eng.cinderella.txt'

Latest revision as of 12:26, 3 April 2018

Lexical Transfer

Contrastive Grammar

We ran apertium-eval-translator on our contrastive grammar. The results are as follows:

Test file: 'dan-eng.tests.txt'
Reference file '../ling073-dan-eng-corpus/eng.tests.txt'
Statistics about input files
-------------------------------------------------------
Number of words in reference: 58
Number of words in test: 58
Number of unknown words (marked with a star) in test: 58
Percentage of unknown words: 100.00 %
Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 0
Word error rate (WER): 0.00 %
Number of position-independent correct words: 58
Position-independent word error rate (PER): 0.00 %
Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 58
Word Error Rate (WER): 100.00 %
Number of position-independent correct words: 0
Position-independent word error rate (PER): 100.00 %
Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: 58
Percentage of unknown words that were free rides: 100.00 %


This really high WER is due to the fact that all of the words in our contrastive grammar translate with a leading pound sign (#). We we scrape these pound signs out of the file, our WER drops to a more reasonable 32.76%. We're still not sure why it's this high, but this mark is certainly better than 100%.

Test file: 'dan-eng.reformatted.tests.txt'
Reference file '../ling073-dan-eng-corpus/eng.tests.txt'
Statistics about input files
-------------------------------------------------------
Number of words in reference: 58
Number of words in test: 54
Number of unknown words (marked with a star) in test: 
Percentage of unknown words: 0.00 %
Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 19
Word error rate (WER): 32.76 %
Number of position-independent correct words: 40
Position-independent word error rate (PER): 31.03 %
Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 19
Word Error Rate (WER): 32.76 %
Number of position-independent correct words: 40
Position-independent word error rate (PER): 31.03 %
Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: 0
Percentage of unknown words that were free rides: 0%

Mini-Corpus

We also ran apertium-eval-translator on a mini-corpus of Cinderella translated into Danish. For this test, we got a WER of 95%, meaning that at least a few words were translating completely correctly.

Test file: 'dan-eng.cinderella.txt'
Reference file '../ling073-dan-eng-corpus/grimm/Cinderella/eng_Cinderella.txt'
Statistics about input files
-------------------------------------------------------
Number of words in reference: 2695
Number of words in test: 2031
Number of unknown words (marked with a star) in test: 1279
Percentage of unknown words: 62.97 %
Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 2553
Word error rate (WER): 94.73 %
Number of position-independent correct words: 282
Position-independent word error rate (PER): 89.54 %
Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 2559
Word Error Rate (WER): 94.95 %
Number of position-independent correct words: 234
Position-independent word error rate (PER): 91.32 %
Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: 6
Percentage of unknown words that were free rides: 0.47 %