Difference between revisions of "Dhivehi and English"

From LING073
Jump to: navigation, search
(Final div → eng Evaluation)
(WER and PER Over Longer Corpus)
Line 103: Line 103:
  
 
==== WER and PER Over Longer Corpus ====
 
==== WER and PER Over Longer Corpus ====
<p>Statistics about input files
+
<pre>Statistics about input files
 
-------------------------------------------------------
 
-------------------------------------------------------
 
Number of words in reference: 565
 
Number of words in reference: 565
Line 127: Line 127:
 
-------------------------------------------------------
 
-------------------------------------------------------
 
Number of unknown words which were free rides: 0
 
Number of unknown words which were free rides: 0
Percentage of unknown words that were free rides: 0.00 %</p>
+
Percentage of unknown words that were free rides: 0.00 %</pre>
  
  

Revision as of 01:20, 8 May 2019

Resources for Machine Translation between Dhivehi and English

External Resources

  1. Dhivehi-English GitHub
  2. Dhivehi GitHub
  3. English GitHub
  4. Dhivehi-English Parallel Corpus GitHub
  5. Lexical Selection
  6. Contrastive Grammar

You can do links to pages on the wiki like this: Lexical SelectionJwashin1 (talk) 00:48, 9 April 2019 (EDT) (fixed)


Initial div → eng Evaluation

  • Coverage of Dhivehi Transducer on div.sentences.txt:
  1. Number of tokenised words in the corpus: 627
  2. Coverage: 23.92%
  • Coverage of Bilingual Transducer on div.sentences.txt:
  1. Number of tokenised words in the corpus: 630
  2. Coverage: 22.86%
Sentence Intended English Translation Lexical Transfer Output Full translation Output
މަދްރަސާ ގެއަށްވުރެ މާ ބޮޑެވެ. The school is much bigger than the house. ^މަދްރަސާ<n><nhum><sg><def><dir>/school<n><sg><def><dir>$ ^ގެ<n><nhum><sg><def><dat>/house<n><sg><def><dat>$ ^ވުރެ<post>/compared to<post>$ ^މާ<adv>/much<adv>$ ^ބޮޑު<adj>/big<adj>$ ^އެވެ<mod>/<mod>$^.<sent>/.<sent>$ #school #house #compared to much #big .
މިފަދަ އިސްލާޙެއް ވާނީ ޒަމާނީ ޑިމޮކްރެސީގެ ރޫޙާއި ޚިލާފު އިސްލާޙެއް. This type of reform would be contrary to the spirit of modern democracy. ^މިފަދަ<det><dem><deg1>/this type of<det><dem><deg1>$ ^އިސްލާޙް<n><nhum><sg><ind><dir>/reform<n><sg><ind><dir>$ ^ވާނީ<v><tv><act><fut><pot>/be<v><tv><act><fut><pot>$ ^ޒަމާނީ<adj>/modern<adj>$ ^ޑިމޮކްރެސީ<n><nhum><sg><def><gen>/democracy<n><sg><def><gen>$ ^ރޫޙާއި<n><nhum><sg><def><soc>/spirit<n><sg><def><soc>$ ^ޚިލާފު<post>/against<post>$ ^އިސްލާޙް<n><nhum><sg><ind><dir>/reform<n><sg><ind><dir>$^.<sent>/.<sent>$ #this type of #reform #be modern #democracy #spirit #against #reform.
ކުރަން ޖެހޭ ހޭދަ އިތުރުވުން. We hit spending surplus. ^ކުރަ<v><tv><act><pres><p1>/do<v><tv><act><pres><p1>$ ^ޖެހޭ<v><tv><act><pprs>/hit<v><tv><act><pprs>$ ^ހޭދަ<adj>/spending<adj>$ ^އިތުރުވުން<n><nhum><sg><def><dir>/surplus<n><sg><def><dir>$^.<sent>/.<sent>$ #do #hit #spending #surplus.
ތަރައްގީ އަކީ އާއިލާގެ އާމްދަނީ އިތުރުވުން. Improvement is family income increasing. ^ތަރައްގީ<n><nhum><sg><def><dir>/improvement<n><sg><def><dir>$ ^އަކީ<mod>/is<mod>$ ^އާއިލާ<n><nhum><sg><def><gen>/family<n><sg><def><gen>$ ^އާމްދަނީ<n><nhum><sg><def><dir>/income<n><sg><def><dir>$ ^އިތުރުވުން<v><iv><act><pres><p3>/increase<v><iv><act><pres><p3>$^.<sent>/.<sent>$ #improvement #is #family #income #increase.
މުބާރާތް މިއަދު ނިމޭނެއެވެ. The competition will end today. ^މުބާރާތް<n><nhum><sg><def><dir>/competition<n><sg><def><dir>$ ^މިއަދު<adv>/today<adv>$ ^ނިމެ<v><iv><pass><fut><p3>/end<v><iv><pass><fut><p3>$ ^އެވެ<mod>/<mod>$^.<sent>/.<sent>$ #competition today #end .
މާދަމާ އަލީ ފާހަނަ ސާފު ކުރާނެ. Tomorrow Ali will clean the bathroom. ^މާދަމާ<adv>/tomorrow<adv>$ ^އަލީ<np>/Ali<np>$ ^ފާހަނަ<n><nhum><sg><def><dir>/bathroom<n><sg><def><dir>$ ^ސާފު<adj>/clean<adj>$ ^ކުރަ<v><tv><act><fut><p3>/do<v><tv><act><fut><p3>$^.<sent>/.<sent>$ tomorrow #Ali #bathroom #clean #do.
އަހަރެން ފެތެނީ! I am sinking! ^އަހަރެން<prn><pers><p1><sg><std><dir>/I<prn><pers><p1><sg><std><dir>$ ^ފެތެ<v><iv><pass><pprs>/sink<v><iv><pass><pprs>$^!<sent>/!<sent>$ #I #sink!
މީތި ކިހާވަރަކަހް؟ How much is this? ^މީތި<prn><dem><deg1><sg><dir>/this<prn><dem><deg1><sg><dir>/it<prn><dem><deg1><sg><dir>$ ^ކިހާވަރަކަހް<itg>/how much<itg>$^؟<sent>/؟<sent>$ #this #how much#؟
އަހަރެން އެކަނި ދުކޮހް ލާ! Leave me alone! ^އަހަރެން<prn><pers><p1><sg><std><dir>/I<prn><pers><p1><sg><std><dir>$ ^އެކަނި<adj>/alone<adj>$ ^ދުކޮހް<v><tv><act><pres><p3><imp>/leave<v><tv><act><pres><p3><imp>$ ^ލާ<mod>/<mod>$^!<sent>/!<sent>$ #I alone #leave !
ކޮބާ ފާހަނަ؟ Where's the bathroom? ^ކޮބާ<itg>/where<itg>$ ^ފާހަނަ<n><nhum><sg><def><dir>/bathroom<n><sg><def><dir>$^؟<sent>/؟<sent>$ #where #bathroom#؟

Final div → eng Evaluation

Additions

  • Bilingual dictionary: added 101 stems
  • Expanded Morphology:
  1. Digits
  2. Digits (ordinal form)
  3. Numbers (citation form)
  4. Numbers (combining form)
  5. Demonstrative Determiners as prefix to nouns
  • 2 structural transfer rules (number + noun and determiner + noun)

Monolingual Transducer

Precision and Recall

Coverage Over Large Corpus

  • Number of tokenised words in the corpus: 692580
  • Coverage: 28.52%
  • Number of words in large corpus: 603356
  • Number of Unique Entries in transducer: 320

Machine Translation

WER and PER Over Longer Corpus

Statistics about input files
-------------------------------------------------------
Number of words in reference: 565
Number of words in test: 618
Number of unknown words (marked with a star) in test: 23
Percentage of unknown words: 3.72 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 589
Word error rate (WER): 104.25 %
Number of position-independent correct words: 71
Position-independent word error rate (PER): 96.81 %

Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 589
Word Error Rate (WER): 104.25 %
Number of position-independent correct words: 71
Position-independent word error rate (PER): 96.81 %

Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: 0
Percentage of unknown words that were free rides: 0.00 %