Dzongkha and English

From LING073
Revision as of 04:32, 25 May 2021 by Tdorji1 (talk | contribs) (Final Evaluation)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Resources for machine translation between Dzongkha and English.

External Resources

  • Find our dzo-eng machine translation repo here.
  • Find our dzo morphological transducer repo here.
  • Fine apertium-eng repo here.
  • Find our dzo-eng corpus here.

dzo → eng evaluation

Coverage Analysis

  • Monolingual transducer coverage: 36 / 36
  • Bilingual transducer coverage: 36 / 36
  • Total number of tokens in the dzo.sentences.txt file: 55
  • Total number of tokens not found in the dictionary (number of unknown words): 0

Sentence Evaluation

Sentence 1

 Original Sentence: བྱི་ལི་དེ་སྒྲོམ་ནང་འདུག་
 Intended English Translation: The cat is in the box.
 Biltrans Output: ^བྱི་ལི་<n>/cat<n>$^དེ་<det>/the<det><def>$^སྒྲོམ་<n><loc>/box<n><loc>$^འདུག་<vbser>/is<vbser>$
 Translation Output: #cat #the #box #is

Sentence 2

 Original Sentence: ང་བཅས་ཀྱི་ཁྱིམ་སྦོམ་ཡོད་
 Intended English Translation: Our house is big.
 Biltrans Output: ^ང་བཅས་<prn><p1><pl><gen>/we<prn><p1><pl><gen>$^ཁྱིམ་<n>/house<n>$^སྦོམ་<adj>/big<adj>$^ཡོད་<vbser>/has<vbser>$
 Translation Output: #we #big #house #has

Sentence 3

 Original Sentence: ཁོ་དགེ་སློང་མེན་
 Intended English Translation: He is not a monk.
 Biltrans Output: ^ཁོ་<prn><p3><sg><m>/he<prn><p3><sg><m>$^དགེ་སློང་<n>/monk<n>$^ཨིན་<vbser><neg>/is<vbser><neg>$
 Translation Output:  #he #monk #is not

Sentence 4

 Original Sentence: མོ་ལུ་སྲིངམོ་གཉིས་ཡོད་
 Intended English Translation: She has two younger sisters.
 Biltrans Output: ^མོ་<prn><p3><sg><f><dat>/she<prn><p3><sg><f><dat>$^སྲིངམོ་<n>/sister<n>$^གཉིས་<num>/two<num>$^ཡོད་<vbser>/has<vbser>$
 Translation Output: #she #sister #two #has

Sentence 5

 Original Sentence: མོ་དེ་ཁོ་བ་རྒས་
 Intended English Translation: She is older than him.
 Biltrans Output: ^མོ་<prn><p3><sg><f>/she<prn><p3><sg><f>$^དེ་<det>/the<det><def>$^ཁོ་<prn><p3><sg><m>/he<prn><p3><sg><m>$^རྒས་<adj><comp>/older<adj><comp>$
 Translation Output: #she #the #he #older 

Sentence 6

 Original Sentence: ད་ཉིམ་ཤར་དོ་
 Intended English Translation: The sun is shining now.
 Biltrans Output: ^ད་<adv>/now<adv>$^ཉིམ་<n>/sun<n>$^ཤར་<v><iv><pres><prog>/shine<vblex><pres><prog>$
 Translation Output: now #sun #shine

Sentence 7

 Original Sentence: ང་དུས་ཚོད་ཁར་ལྷོད་ཅི་
 Intended English Translation: I arrived on time.
 Biltrans Output: ^ང་<prn><p1><sg>/I<prn><p1><sg>$^དུས་ཚོད་<n><loc>/time<n><loc>$^ལྷོད་<v><iv>/arrive<vblex>$
 Translation Output:  #I #time arrived

Sentence 8

 Original Sentence: སྲུང་འབྲི་མི་དག་པ་གཅིག་འདུག་
 Intended English Translation: There are a few story writers.
 Biltrans Output: ^སྲུང་<n>/story<n>$^འབྲི་<v><tv><vadj>/write<vblex><vadj>$^གཅིག་<num>/one<num>$^འདུག་<vbser>/is<vbser>$
 Translation Output: #story #write #one #is 

Sentence 9

 Original Sentence: རྒྱ་མཚོ་ལས་ནོར་བུ་འཐོབ་
 Intended English Translation: [We] get jewel from the ocean.
 Biltrans Output: ^རྒྱ་མཚོ་<n><abl>/ocean<n><abl>$^ནོར་བུ་<n>/pearl<n>$^འཐོབ་<v><iv>/get<vblex>$
 Translation Output: #ocean #pearl #get 

Sentence 10

 Original Sentence: བྱི་ཙི་ཚུ་ཟུང་གེ་
 Intended English Translation: Let's catch the rats.
 Biltrans Output: ^བྱི་ཙི་<n><adj><qnt>/rat<n><adj><qnt>$^ཟུང་<v><tv>/catch<vblex>$^གེ་<vaux><adh>/shall<vaux><adh>$
 Translation Output: #rat #catch #shall


  • Added 100 more stems in the bilingual dictionary:
    • Initial # of stems: 66
    • Final # of stems: 166
  • Added two more lexical selection rules.
  • Added two more structural transfer rules.

Final Evaluation

Dzongkha monolingual transducer:

  • Precision and recall against the annotated.basic corpus: We get an error.
  • Coverage over the large corpus: 3121/6718 (~0.465)
  • # of words in the large corpus: 131018 characters
  • # of stems in the transducer: 255 lexicon entries

Dzo-Eng MT:

  • WER and PER over test phrases: 83.87 %
  • WER and PER over longer corpus: 97.44 %
  • Proportion of stems translated correctly in the longer corpus: 101
  • Trimmed coverage over longer corpus: 101/132
  • Trimmed coverage over large corpus: 9698/21689
  • # of tokens in the longer corpus: 132
  • # of tokens in the large corpus: 21689