Difference between revisions of "Latin and Mandarin Chinese"
(→zho-lat) |
|||
(7 intermediate revisions by 2 users not shown) | |||
Line 58: | Line 58: | ||
Added 2 new structural transfer rules | Added 2 new structural transfer rules | ||
+ | |||
+ | ===lat=== | ||
+ | |||
+ | Added 6 new paradigms | ||
+ | |||
+ | Added 5 new disambiguation rules | ||
== Final Evaluation == | == Final Evaluation == | ||
Line 70: | Line 76: | ||
Recall: 100.00000% | Recall: 100.00000% | ||
+ | |||
+ | '''Coverage over large corpus:''' 100.00% | ||
+ | |||
+ | '''# of words in large corpus:''' 232,490 | ||
'''# of stems in transducer:''' ~8574 | '''# of stems in transducer:''' ~8574 | ||
+ | === lat-zho === | ||
+ | |||
+ | '''longer corpus''' | ||
+ | |||
+ | WER: 103.17% | ||
+ | |||
+ | PER: 87.53% | ||
+ | |||
+ | Number of words in reference: 441 | ||
+ | |||
+ | Number of words in test: 516 | ||
+ | |||
+ | Number of unknown words (marked with a star) in test: 139 | ||
+ | |||
+ | Percentage of unknown words: 26.94 % | ||
+ | |||
+ | Trimmed coverage: 100.00% | ||
+ | |||
+ | '''large corpus''' | ||
+ | |||
+ | Number of tokenised words in the corpus: 4997 | ||
+ | |||
+ | Coverage: 40.70% | ||
+ | |||
+ | Top unknown words in the corpus: | ||
+ | |||
+ | 95 e | ||
+ | |||
+ | 65 f | ||
+ | |||
+ | 64 The | ||
+ | |||
+ | 58 c | ||
+ | |||
+ | 50 A | ||
+ | |||
+ | 45 d | ||
+ | |||
+ | 42 C | ||
+ | |||
+ | 41 h | ||
+ | |||
+ | 37 of | ||
+ | |||
+ | 36 a | ||
+ | |||
+ | 36 b | ||
+ | |||
+ | 31 g | ||
+ | |||
+ | 29 M | ||
+ | |||
+ | 22 B | ||
+ | |||
+ | 21 AMC | ||
+ | |||
+ | 21 ISBN | ||
+ | |||
+ | 21 T | ||
+ | |||
+ | 20 the | ||
+ | |||
+ | 19 Dead | ||
+ | |||
+ | 18 s | ||
+ | |||
+ | === lat === | ||
+ | |||
+ | '''Precision and recall''' | ||
+ | |||
+ | Totals: 100 forms, 217 tp, 21 fp, 0 tn, 31 fn | ||
+ | Precision: 91.17647% | ||
+ | Recall: 87.50000% | ||
+ | |||
+ | '''Coverage over large corpus''': 51.04% | ||
+ | |||
+ | '''# of words in large corpus''': 7,279,816 | ||
+ | |||
+ | '''# of stems in transducer''': 1,034 | ||
+ | |||
+ | === zho-lat === | ||
+ | |||
+ | WER: 120.48 % | ||
+ | |||
+ | PER: 117.82 % | ||
+ | |||
+ | Proportion of stems translated correctly: 8.2% | ||
+ | |||
+ | Longer corpus trimmed coverage: | ||
+ | |||
+ | Number of tokenised words in the corpus: 404 | ||
+ | |||
+ | Coverage: 100.00% | ||
+ | |||
+ | Large corpus trimmed coverage: | ||
+ | |||
+ | Number of tokenised words in the corpus: 54696 | ||
+ | |||
+ | Coverage: 100.00% | ||
[[Category:Sp18_TranslationPairs]] | [[Category:Sp18_TranslationPairs]] |
Latest revision as of 18:20, 25 April 2018
Contents
Resources for machine translation between Latin and Mandarin Chinese
Lexical Selection Documentation
lat → zho evaluation
Sentences WER: 586.79%
Sentences PER: 586.79%
Tests WER: 609.9%
Tests PER: 609.9%
zho → lat evaluation
Sentences WER: 97.42%
Sentences PER: 93.55%
Tests WER: 95.45%
Tests PER: 89.39%
Proportion of stems translated correctly
lat.sentences.txt: 314:173
lat.tests.txt: 66:17
zho.sentences.txt: 53:0
zho.tests.txt: 10:0
Lexical Selection
One case of one-to-many mapping from Chinese to Latin has to do with the preposition 在. This is a general locative preposition in Chinese, and it can be used to mean anything from 'in' to 'above' to 'outside' to 'to in front of' to 'west of.' Usually, the object of 在 is either a place word (such as 'China' or 'the library') or a word that means 'in', 'above', etc. which is then modified by a noun, resulting in a phrase sort of equivalent to "在 the inside of the box" or "在 the top of the table." In contrast, Latin has many different prepositions which specify different locational relationships to the object of the preposition, much like English.
Another case has to do with pronouns; each Latin pronoun has many different forms for different cases. In contrast, in Chinese each pronoun has at most two different forms: one singular and one plural.
It's a bit more difficult to find one-to-many mappings from Latin to Chinese due to the nature of the beast. However, note that Chinese is a classifier language, meaning that you need a classifier or measure word in order to couple a noun with a numeral or determiner. Different nouns can have different measure words depending on the shape of the object they represent, among other factors. This is different from Latin, in which you don't have to worry about all of these different measure words when pairing a noun with a numeral; you just have to make sure the numeral agrees with the noun in terms of gender, number, and case. In that sense, this might be considered a one-to-many mapping (and sometimes literally a "one" to many mapping).
Also note that Chinese has a number of sentence-ending particles that can be used to "soften" or otherwise modulate a sentence. For example, the particle 呢 often marks a rhetorical question. In Latin, a rhetorical question would, I think, have the same form as a genuine question, making this another possible one-to-many mapping from Latin to Chinese.
Additions
zho
Added 2 new lexical selection rules
Added 2 new structural transfer rules
lat
Added 6 new paradigms
Added 5 new disambiguation rules
Final Evaluation
zho
Precision and recall
Totals: 135 forms, 356 tp, 0 fp, 0 tn, 0 fn
Precision: 100.00000%
Recall: 100.00000%
Coverage over large corpus: 100.00%
# of words in large corpus: 232,490
# of stems in transducer: ~8574
lat-zho
longer corpus
WER: 103.17%
PER: 87.53%
Number of words in reference: 441
Number of words in test: 516
Number of unknown words (marked with a star) in test: 139
Percentage of unknown words: 26.94 %
Trimmed coverage: 100.00%
large corpus
Number of tokenised words in the corpus: 4997
Coverage: 40.70%
Top unknown words in the corpus:
95 e
65 f
64 The
58 c
50 A
45 d
42 C
41 h
37 of
36 a
36 b
31 g
29 M
22 B
21 AMC
21 ISBN
21 T
20 the
19 Dead
18 s
lat
Precision and recall
Totals: 100 forms, 217 tp, 21 fp, 0 tn, 31 fn Precision: 91.17647% Recall: 87.50000%
Coverage over large corpus: 51.04%
# of words in large corpus: 7,279,816
# of stems in transducer: 1,034
zho-lat
WER: 120.48 %
PER: 117.82 %
Proportion of stems translated correctly: 8.2%
Longer corpus trimmed coverage:
Number of tokenised words in the corpus: 404
Coverage: 100.00%
Large corpus trimmed coverage:
Number of tokenised words in the corpus: 54696
Coverage: 100.00%