Difference between revisions of "Latin and Mandarin Chinese"

From LING073
Jump to: navigation, search
(zho-lat)
 
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==Resources for machine translation between [https://wikis.swarthmore.edu/ling073/Latin Latin] and [https://wikis.swarthmore.edu/ling073/Mandarin_Chinese Mandarin Chinese]==
 
==Resources for machine translation between [https://wikis.swarthmore.edu/ling073/Latin Latin] and [https://wikis.swarthmore.edu/ling073/Mandarin_Chinese Mandarin Chinese]==
 +
 +
[https://wikis.swarthmore.edu/ling073/Latin_and_Mandarin_Chinese/Lexical_selection Lexical Selection Documentation]
 +
 +
[https://wikis.swarthmore.edu/ling073/Latin_and_Mandarin_Chinese/Structural_transfer Structural transfer]
  
 
[https://github.swarthmore.edu/cdalton2/ling073-lat-zho Machine Translator]
 
[https://github.swarthmore.edu/cdalton2/ling073-lat-zho Machine Translator]
Line 6: Line 10:
  
 
[https://wikis.swarthmore.edu/ling073/Latin_and_Mandarin_Chinese/Contrastive_Grammar Contrastive Grammar]
 
[https://wikis.swarthmore.edu/ling073/Latin_and_Mandarin_Chinese/Contrastive_Grammar Contrastive Grammar]
 +
 +
== lat → zho evaluation ==
 +
 +
Sentences WER: 586.79%
 +
 +
Sentences PER: 586.79%
 +
 +
Tests WER: 609.9%
 +
 +
Tests PER: 609.9%
 +
 +
== zho → lat evaluation ==
 +
 +
Sentences WER: 97.42%
 +
 +
Sentences PER: 93.55%
 +
 +
Tests WER: 95.45%
 +
 +
Tests PER: 89.39%
 +
 +
==Proportion of stems translated correctly==
 +
 +
lat.sentences.txt: 314:173
 +
 +
lat.tests.txt: 66:17
 +
 +
zho.sentences.txt: 53:0
 +
 +
zho.tests.txt: 10:0
  
 
== Lexical Selection ==
 
== Lexical Selection ==
  
One case of one-to-many mapping between Chinese and Latin has to do with the preposition 在. This is a general locative preposition in Chinese, and it can be used to mean anything from 'in' to 'above' to 'outside' to 'to in front of' to 'west of.' Usually, the object of 在 is either a place word (such as 'China' or 'the library') or a word that means 'in', 'above', etc. which is then modified by a noun, resulting in a phrase sort of equivalent to "在 the inside of the box" or "在 the top of the table." In contrast, Latin has many different prepositions which specify different locational relationships to the object of the preposition, much like English.
+
One case of one-to-many mapping from Chinese to Latin has to do with the preposition 在. This is a general locative preposition in Chinese, and it can be used to mean anything from 'in' to 'above' to 'outside' to 'to in front of' to 'west of.' Usually, the object of 在 is either a place word (such as 'China' or 'the library') or a word that means 'in', 'above', etc. which is then modified by a noun, resulting in a phrase sort of equivalent to "在 the inside of the box" or "在 the top of the table." In contrast, Latin has many different prepositions which specify different locational relationships to the object of the preposition, much like English.
  
 
Another case has to do with pronouns; each Latin pronoun has many different forms for different cases. In contrast, in Chinese each pronoun has at most two different forms: one singular and one plural.
 
Another case has to do with pronouns; each Latin pronoun has many different forms for different cases. In contrast, in Chinese each pronoun has at most two different forms: one singular and one plural.
  
It's a bit more difficult to find one-to-many mappings from Latin to Chinese due to the nature of the beast. However, note that Chinese is a classifier language, meaning that you need a classifier or measure word in order to couple a noun with a numeral or determiner. Different nouns can have different measure words depending on the shape of the object they represent, among other factors. This is different from Latin, in which you don't have to worry about all of these different measure words when pairing a noun with a numeral; you just have to make sure the numeral agrees with the noun in terms of gender, number, and case.
+
It's a bit more difficult to find one-to-many mappings from Latin to Chinese due to the nature of the beast. However, note that Chinese is a classifier language, meaning that you need a classifier or measure word in order to couple a noun with a numeral or determiner. Different nouns can have different measure words depending on the shape of the object they represent, among other factors. This is different from Latin, in which you don't have to worry about all of these different measure words when pairing a noun with a numeral; you just have to make sure the numeral agrees with the noun in terms of gender, number, and case. In that sense, this might be considered a one-to-many mapping (and sometimes literally a "one" to many mapping).
 +
 
 +
Also note that Chinese has a number of sentence-ending particles that can be used to "soften" or otherwise modulate a sentence. For example, the particle 呢 often marks a rhetorical question. In Latin, a rhetorical question would, I think, have the same form as a genuine question, making this another possible one-to-many mapping from Latin to Chinese.
 +
 
 +
== Additions ==
 +
 
 +
=== zho ===
 +
 
 +
Added 2 new lexical selection rules
 +
 
 +
Added 2 new structural transfer rules
 +
 
 +
===lat===
 +
 
 +
Added 6 new paradigms
 +
 
 +
Added 5 new disambiguation rules
 +
 
 +
== Final Evaluation ==
 +
 
 +
=== zho ===
 +
 
 +
'''Precision and recall'''
 +
 
 +
Totals: 135 forms, 356 tp, 0 fp, 0 tn, 0 fn
 +
 
 +
Precision: 100.00000%
 +
 
 +
Recall: 100.00000%
 +
 
 +
'''Coverage over large corpus:''' 100.00%
 +
 
 +
'''# of words in large corpus:''' 232,490
 +
 
 +
'''# of stems in transducer:''' ~8574
 +
 
 +
=== lat-zho ===
 +
 
 +
'''longer corpus'''
 +
 
 +
WER: 103.17%
 +
 
 +
PER: 87.53%
 +
 
 +
Number of words in reference: 441
 +
 
 +
Number of words in test: 516
 +
 
 +
Number of unknown words (marked with a star) in test: 139
 +
 
 +
Percentage of unknown words: 26.94 %
 +
 
 +
Trimmed coverage: 100.00%
 +
 
 +
'''large corpus'''
 +
 
 +
Number of tokenised words in the corpus: 4997
 +
 
 +
Coverage: 40.70%
 +
 
 +
Top unknown words in the corpus:
 +
 
 +
95      e
 +
 
 +
65      f
 +
 
 +
64      The
 +
 
 +
58      c
 +
 
 +
50      A
 +
 
 +
45      d
 +
 
 +
42      C
 +
 
 +
41      h
 +
 
 +
37      of
 +
 
 +
36      a
 +
 
 +
36      b
 +
 
 +
31      g
 +
 
 +
29      M
 +
 
 +
22      B
 +
 
 +
21      AMC
 +
 
 +
21      ISBN
 +
 
 +
21      T
 +
 
 +
20      the
 +
 
 +
19      Dead
 +
 
 +
18      s
 +
 
 +
=== lat ===
 +
 
 +
'''Precision and recall'''
 +
 
 +
Totals: 100 forms, 217 tp, 21 fp, 0 tn, 31 fn
 +
Precision: 91.17647%
 +
Recall: 87.50000%
 +
 
 +
'''Coverage over large corpus''': 51.04%
 +
 
 +
'''# of words in large corpus''':  7,279,816
 +
 
 +
'''# of stems in transducer''': 1,034
 +
 
 +
=== zho-lat ===
 +
 
 +
WER: 120.48 %
 +
 
 +
PER: 117.82 %
 +
 
 +
Proportion of stems translated correctly: 8.2%
 +
 
 +
Longer corpus trimmed coverage:
 +
 
 +
Number of tokenised words in the corpus: 404
 +
 
 +
Coverage: 100.00%
 +
 
 +
Large corpus trimmed coverage:
 +
 
 +
Number of tokenised words in the corpus: 54696
 +
 
 +
Coverage: 100.00%
  
 
[[Category:Sp18_TranslationPairs]]
 
[[Category:Sp18_TranslationPairs]]

Latest revision as of 18:20, 25 April 2018

Resources for machine translation between Latin and Mandarin Chinese

Lexical Selection Documentation

Structural transfer

Machine Translator

Parallel Corpus

Contrastive Grammar

lat → zho evaluation

Sentences WER: 586.79%

Sentences PER: 586.79%

Tests WER: 609.9%

Tests PER: 609.9%

zho → lat evaluation

Sentences WER: 97.42%

Sentences PER: 93.55%

Tests WER: 95.45%

Tests PER: 89.39%

Proportion of stems translated correctly

lat.sentences.txt: 314:173

lat.tests.txt: 66:17

zho.sentences.txt: 53:0

zho.tests.txt: 10:0

Lexical Selection

One case of one-to-many mapping from Chinese to Latin has to do with the preposition 在. This is a general locative preposition in Chinese, and it can be used to mean anything from 'in' to 'above' to 'outside' to 'to in front of' to 'west of.' Usually, the object of 在 is either a place word (such as 'China' or 'the library') or a word that means 'in', 'above', etc. which is then modified by a noun, resulting in a phrase sort of equivalent to "在 the inside of the box" or "在 the top of the table." In contrast, Latin has many different prepositions which specify different locational relationships to the object of the preposition, much like English.

Another case has to do with pronouns; each Latin pronoun has many different forms for different cases. In contrast, in Chinese each pronoun has at most two different forms: one singular and one plural.

It's a bit more difficult to find one-to-many mappings from Latin to Chinese due to the nature of the beast. However, note that Chinese is a classifier language, meaning that you need a classifier or measure word in order to couple a noun with a numeral or determiner. Different nouns can have different measure words depending on the shape of the object they represent, among other factors. This is different from Latin, in which you don't have to worry about all of these different measure words when pairing a noun with a numeral; you just have to make sure the numeral agrees with the noun in terms of gender, number, and case. In that sense, this might be considered a one-to-many mapping (and sometimes literally a "one" to many mapping).

Also note that Chinese has a number of sentence-ending particles that can be used to "soften" or otherwise modulate a sentence. For example, the particle 呢 often marks a rhetorical question. In Latin, a rhetorical question would, I think, have the same form as a genuine question, making this another possible one-to-many mapping from Latin to Chinese.

Additions

zho

Added 2 new lexical selection rules

Added 2 new structural transfer rules

lat

Added 6 new paradigms

Added 5 new disambiguation rules

Final Evaluation

zho

Precision and recall

Totals: 135 forms, 356 tp, 0 fp, 0 tn, 0 fn

Precision: 100.00000%

Recall: 100.00000%

Coverage over large corpus: 100.00%

# of words in large corpus: 232,490

# of stems in transducer: ~8574

lat-zho

longer corpus

WER: 103.17%

PER: 87.53%

Number of words in reference: 441

Number of words in test: 516

Number of unknown words (marked with a star) in test: 139

Percentage of unknown words: 26.94 %

Trimmed coverage: 100.00%

large corpus

Number of tokenised words in the corpus: 4997

Coverage: 40.70%

Top unknown words in the corpus:

95 e

65 f

64 The

58 c

50 A

45 d

42 C

41 h

37 of

36 a

36 b

31 g

29 M

22 B

21 AMC

21 ISBN

21 T

20 the

19 Dead

18 s

lat

Precision and recall

Totals: 100 forms, 217 tp, 21 fp, 0 tn, 31 fn Precision: 91.17647% Recall: 87.50000%

Coverage over large corpus: 51.04%

# of words in large corpus: 7,279,816

# of stems in transducer: 1,034

zho-lat

WER: 120.48 %

PER: 117.82 %

Proportion of stems translated correctly: 8.2%

Longer corpus trimmed coverage:

Number of tokenised words in the corpus: 404

Coverage: 100.00%

Large corpus trimmed coverage:

Number of tokenised words in the corpus: 54696

Coverage: 100.00%