Difference between revisions of "Dhivehi and English"
From LING073
(→Final div → eng Evaluation) |
(→WER and PER Over Longer Corpus) |
||
Line 103: | Line 103: | ||
==== WER and PER Over Longer Corpus ==== | ==== WER and PER Over Longer Corpus ==== | ||
− | < | + | <pre>Statistics about input files |
------------------------------------------------------- | ------------------------------------------------------- | ||
Number of words in reference: 565 | Number of words in reference: 565 | ||
Line 127: | Line 127: | ||
------------------------------------------------------- | ------------------------------------------------------- | ||
Number of unknown words which were free rides: 0 | Number of unknown words which were free rides: 0 | ||
− | Percentage of unknown words that were free rides: 0.00 %</ | + | Percentage of unknown words that were free rides: 0.00 %</pre> |
Revision as of 01:20, 8 May 2019
Resources for Machine Translation between Dhivehi and English
Contents
External Resources
- Dhivehi-English GitHub
- Dhivehi GitHub
- English GitHub
- Dhivehi-English Parallel Corpus GitHub
- Lexical Selection
- Contrastive Grammar
You can do links to pages on the wiki like this: Lexical Selection —Jwashin1 (talk) 00:48, 9 April 2019 (EDT) (fixed)
Initial div → eng Evaluation
- Coverage of Dhivehi Transducer on
div.sentences.txt
:
- Number of tokenised words in the corpus: 627
- Coverage: 23.92%
- Coverage of Bilingual Transducer on
div.sentences.txt
:
- Number of tokenised words in the corpus: 630
- Coverage: 22.86%
Sentence | Intended English Translation | Lexical Transfer Output | Full translation Output |
---|---|---|---|
މަދްރަސާ ގެއަށްވުރެ މާ ބޮޑެވެ. | The school is much bigger than the house. | ^މަދްރަސާ<n><nhum><sg><def><dir>/school<n><sg><def><dir>$ ^ގެ<n><nhum><sg><def><dat>/house<n><sg><def><dat>$ ^ވުރެ<post>/compared to<post>$ ^މާ<adv>/much<adv>$ ^ބޮޑު<adj>/big<adj>$ ^އެވެ<mod>/<mod>$^.<sent>/.<sent>$ | #school #house #compared to much #big . |
މިފަދަ އިސްލާޙެއް ވާނީ ޒަމާނީ ޑިމޮކްރެސީގެ ރޫޙާއި ޚިލާފު އިސްލާޙެއް. | This type of reform would be contrary to the spirit of modern democracy. | ^މިފަދަ<det><dem><deg1>/this type of<det><dem><deg1>$ ^އިސްލާޙް<n><nhum><sg><ind><dir>/reform<n><sg><ind><dir>$ ^ވާނީ<v><tv><act><fut><pot>/be<v><tv><act><fut><pot>$ ^ޒަމާނީ<adj>/modern<adj>$ ^ޑިމޮކްރެސީ<n><nhum><sg><def><gen>/democracy<n><sg><def><gen>$ ^ރޫޙާއި<n><nhum><sg><def><soc>/spirit<n><sg><def><soc>$ ^ޚިލާފު<post>/against<post>$ ^އިސްލާޙް<n><nhum><sg><ind><dir>/reform<n><sg><ind><dir>$^.<sent>/.<sent>$ | #this type of #reform #be modern #democracy #spirit #against #reform. |
ކުރަން ޖެހޭ ހޭދަ އިތުރުވުން. | We hit spending surplus. | ^ކުރަ<v><tv><act><pres><p1>/do<v><tv><act><pres><p1>$ ^ޖެހޭ<v><tv><act><pprs>/hit<v><tv><act><pprs>$ ^ހޭދަ<adj>/spending<adj>$ ^އިތުރުވުން<n><nhum><sg><def><dir>/surplus<n><sg><def><dir>$^.<sent>/.<sent>$ | #do #hit #spending #surplus. |
ތަރައްގީ އަކީ އާއިލާގެ އާމްދަނީ އިތުރުވުން. | Improvement is family income increasing. | ^ތަރައްގީ<n><nhum><sg><def><dir>/improvement<n><sg><def><dir>$ ^އަކީ<mod>/is<mod>$ ^އާއިލާ<n><nhum><sg><def><gen>/family<n><sg><def><gen>$ ^އާމްދަނީ<n><nhum><sg><def><dir>/income<n><sg><def><dir>$ ^އިތުރުވުން<v><iv><act><pres><p3>/increase<v><iv><act><pres><p3>$^.<sent>/.<sent>$ | #improvement #is #family #income #increase. |
މުބާރާތް މިއަދު ނިމޭނެއެވެ. | The competition will end today. | ^މުބާރާތް<n><nhum><sg><def><dir>/competition<n><sg><def><dir>$ ^މިއަދު<adv>/today<adv>$ ^ނިމެ<v><iv><pass><fut><p3>/end<v><iv><pass><fut><p3>$ ^އެވެ<mod>/<mod>$^.<sent>/.<sent>$ | #competition today #end . |
މާދަމާ އަލީ ފާހަނަ ސާފު ކުރާނެ. | Tomorrow Ali will clean the bathroom. | ^މާދަމާ<adv>/tomorrow<adv>$ ^އަލީ<np>/Ali<np>$ ^ފާހަނަ<n><nhum><sg><def><dir>/bathroom<n><sg><def><dir>$ ^ސާފު<adj>/clean<adj>$ ^ކުރަ<v><tv><act><fut><p3>/do<v><tv><act><fut><p3>$^.<sent>/.<sent>$ | tomorrow #Ali #bathroom #clean #do. |
އަހަރެން ފެތެނީ! | I am sinking! | ^އަހަރެން<prn><pers><p1><sg><std><dir>/I<prn><pers><p1><sg><std><dir>$ ^ފެތެ<v><iv><pass><pprs>/sink<v><iv><pass><pprs>$^!<sent>/!<sent>$ | #I #sink! |
މީތި ކިހާވަރަކަހް؟ | How much is this? | ^މީތި<prn><dem><deg1><sg><dir>/this<prn><dem><deg1><sg><dir>/it<prn><dem><deg1><sg><dir>$ ^ކިހާވަރަކަހް<itg>/how much<itg>$^؟<sent>/؟<sent>$ | #this #how much#؟ |
އަހަރެން އެކަނި ދުކޮހް ލާ! | Leave me alone! | ^އަހަރެން<prn><pers><p1><sg><std><dir>/I<prn><pers><p1><sg><std><dir>$ ^އެކަނި<adj>/alone<adj>$ ^ދުކޮހް<v><tv><act><pres><p3><imp>/leave<v><tv><act><pres><p3><imp>$ ^ލާ<mod>/<mod>$^!<sent>/!<sent>$ | #I alone #leave ! |
ކޮބާ ފާހަނަ؟ | Where's the bathroom? | ^ކޮބާ<itg>/where<itg>$ ^ފާހަނަ<n><nhum><sg><def><dir>/bathroom<n><sg><def><dir>$^؟<sent>/؟<sent>$ | #where #bathroom#؟ |
Final div → eng Evaluation
Additions
- Bilingual dictionary: added 101 stems
- Expanded Morphology:
- Digits
- Digits (ordinal form)
- Numbers (citation form)
- Numbers (combining form)
- Demonstrative Determiners as prefix to nouns
- 2 structural transfer rules (number + noun and determiner + noun)
Monolingual Transducer
Precision and Recall
Coverage Over Large Corpus
- Number of tokenised words in the corpus: 692580
- Coverage: 28.52%
- Number of words in large corpus: 603356
- Number of Unique Entries in transducer: 320
Machine Translation
WER and PER Over Longer Corpus
Statistics about input files ------------------------------------------------------- Number of words in reference: 565 Number of words in test: 618 Number of unknown words (marked with a star) in test: 23 Percentage of unknown words: 3.72 % Results when removing unknown-word marks (stars) ------------------------------------------------------- Edit distance: 589 Word error rate (WER): 104.25 % Number of position-independent correct words: 71 Position-independent word error rate (PER): 96.81 % Results when unknown-word marks (stars) are not removed ------------------------------------------------------- Edit distance: 589 Word Error Rate (WER): 104.25 % Number of position-independent correct words: 71 Position-independent word error rate (PER): 96.81 % Statistics about the translation of unknown words ------------------------------------------------------- Number of unknown words which were free rides: 0 Percentage of unknown words that were free rides: 0.00 %