Spring 2018/Contrastive grammar

From LING073
Jump to: navigation, search

Machine Translation


Statistical Machine Translation (SMT) is what's used in most dominant MT systems out there.

SMT uses parallel text in two languages (e.g., translations of the same material into two different languages) and algorithms that attempt to find correspondences between the texts. The resulting model can then predict to some extent text in one language when it encounters previously unseen text in the other language.

SMT requires large amounts of parallel text to work well, and ~naïve models do not do well with morphologically rich languages.

  1. Can you think why this might be?


Rule-Based Machine Translation (RBMT) is of marginal popularity these days.

RBMT requires manual encoding of linguistic generalisations (e.g., hand-writing dictionaries, writing rules to change word order and grammatical properties, etc.), and does not require much parallel text for development. (Though it's good to use a certain amount for evaluation—but you don't need nearly as much for development.)

RBMT can leverage good morphological models of morphologically rich languages, and hence can perform well for morphologically rich languages. RBMT systems are much easier to develop for languages with similar structure (like closely-related languages or languages that are part of the same linguistic area), and are much harder to develop for dissimilar languages.


  1. What are the main advantages of SMT over RBMT?
  2. What are the main disadvantages of SMT as compared to RBMT?
  3. What are the main advantages of RBMT over SMT?
  4. What are the main disadvantages of RBMT as compared to SMT?

Other questions

  1. Why do you think we're using RBMT in this class?
  2. Why might we want to develop an MT system between two closely related languages?

How Apertium works

Apertium is an RBMT system that uses a pipeline consisting of a series of tools, several of them optional or interchangeable.

What constitutes a contrastive grammar for RBMT

  • Start by looking at material in the language you're translating into your language.
  • For a given way of expressing the same thing (a translation, or equivalent sentence), find differences in the parse, ignoring the content of the lexical items.
  • I.e., make sure the full version of the sentence in each language parses correctly (and has correct disambiguation), and then compare the parses side-by-side.
  • You'll want to identify any differences in word order, tags used, etc.
    • An example of a set of similar languages might be:
      "Je le mangerai": ^je/je<prn><pers><p1><sg><nom>$ ^le/le<prn><pers><p3><m><acc>$ ^mangerais/manger<v><tv><fut><p1><sg>$
      "Yo lo comeré": ^yo/yo<prn><pers><p1><sg><nom>$ ^lo/lo<prn><pers><p3><m><acc>$ ^comeré/comer<v><tv><fut><p1><sg>$
    • An example of a set of rather different languages might be:
      "Yiyemeyeceğim": ^yiyemeyeceğim/ye<v><tv><abil><neg><fut>+i<cop><aor><p1><sg>$
      "I won't be able to eat it": ^I/I<prn><pers><p1><sg><subj>$ ^won't/will<vaux>+not<adv>$ ^be/be<v><iv>$ ^able/able<adj>$ ^to eat/eat<v><tv><inf>$ ^it/it<prn>pers><p3><nt><obj>$
  • In the above example of similar languages, there are no differences in tags, only in stems. Here there is nothing to contrast. In the above example of different languages, there is quite a bit that can be contrasted—but almost too much to break down into individual points.
  • An example of languages with some minor differences might be:
    • "Je l'ai vu": ^je/je<prn><pers><p1><sg><nom>$ ^l'ai/le<prn><pers><p3><m><acc>+avoir<vaux><pres><p1><sg>$ ^vu/voir<v><tv><prc_past>$
      "Yo lo vi": ^yo/yo<prn><pers><p1><sg><nom>$ ^lo/lo<prn><pers><p3><m><acc>$ ^vi/ver<v><tv><ifi><p1><sg>$
    • Here the difference is that one language uses the auxiliary with a participle where the other language uses a simple verb form
  • Note that the following does not constitute a contrastive difference
    • "Je mange la glace": ^je/je<prn><pers><p1><sg><nom>$ ^mange/manger<v><tv><pres><p1><sg>$ ^la/le<det><def><f>$ ^glace/glace<n><f>$
      "Yo como el hielo": ^yo/yo<prn><pers><p1><sg><nom>$ ^como/comer<v><tv><pres><p1><sg>$ ^el/el<det><def><m>$ ^hielo/hielo<n><m>$
    • This is still useful to note, as it can still be implemented as a transfer rule.

The examples above using templates

  • (fra) Je le mangerai → (spa) Yo lo comeré ("I'll eat it.")
    (fra) je<prn><pers><p1><sg><nom> le<prn><pers><p3><m><acc> manger<v><tv><fut><p1><sg> → (spa) yo<prn><pers><p1><sg><nom> lo<prn><pers><p3><m><acc> comer<v><tv><fut><p1><sg>

The assignment

This assignment is due at the end of the 8th week of class (this semester: Friday, March 23rd, 2018 at 23:59, i.e., by midnight)

Getting started

  1. Figure out who and what languages you'd like to work with on MT (source and destination languages). You can chose to work in the same or different groups, and to work with the language you're already working on or with a different language. This is an opportunity to regroup, especially if you don't enjoy the language you're working on, or you're having trouble working in your current group, etc.
    • The simplest case is that the source language is the language you've been working on this semester and the destination language is English or another language you know well (are able to easily make sense of text in the language) and that has an open source transducer (e.g., in Apertium).
    • You may also choose to do MT between your language and:
      • a closely related or structurally extremely similar language (e.g., that mark similar things on the same parts of speech and/or that have similar word order)
      • a language spoken in close proximity to your language or shares similar cultural influences, even if unrelated (harder)
    • You may work in groups of 2 or 3, or in some rare cases I may allow groups of 4.
      • In groups where both languages have been worked on up to this point in the course, each language should be both a source and destination of translation.
      • Otherwise, the destination language should be a language you're comfortable with.
  2. Get additional repositories set up:
    • If you are working with someone who's done another language in this class:
      • Share the corpus, keyboard, and transducer repos with each other. I.e., give each other read access (not necessarily write access).
      • Fork the repos of the language you're translating from. Do this through the github interface.
      • Clone the forked repos locally (in your ~/ling073 directory).
      • You'll want to look at the corpus repo for this assignment, and you'll want the other stuff around for later. We'll also talk more about forking, pull requests, etc. later.
    • If you're working with another language that you know
      • Make sure you can type in the language
      • Fork the repository (on github.com) for that language from Apertium and make sure everyone in your group has write access to the fork.
      • Clone the forked repo locally (in your ~/ling073 directory).
  3. Create a page on the wiki named "LanguageX and LanguageY" (either order, and just one page).
    • Add the page to the category Category:Sp19_TranslationPairs.
    • Put a note at the top along the lines of "Resources for machine translation between LanguageX and LanguageY", where the language names link back to your pages on them.
    • Add a link on each of the language pages to this new page, making a similar note as above.

Parallel corpus

  1. Create a repository on github named ling073-xyz-abc-corpus in the course github group. Make sure both/all collaborators have full access to it. Put a link to it on the wiki page in a section title "Developed resources".
  2. Construct a parallel corpus of at least 500 characters in one of the languages (or 250 characters if a syllabic writing system). If you're doing translation to English, double this number (since you should already have some parallel text prepared).
    • The corpus will ideally consist of sentences with the same meaning. You can extract sentences from translations of the same text (e.g., bible translations or the Universal Declaration of Human Rights), or use phrasebook examples that are similar.
    • You can also try to use example sentences from your sources to try to construct grammatical sentences with similar meaning, but make note of this.
    • In extreme cases, you may include short phrases with the same meaning, but please speak with the professor about this as soon as possible.
  3. Populate two files xyz.sentences.txt and abc.sentences.txt with the parallel text, one sentence per line. I recommend adding eng.sentences.txt to keep track of what the sentences mean if you're not working with English.
  4. Include a LICENSE, AUTHORS, and MANIFEST files as before (the latter to note origin and licensing of groups of sentences and any other notes).

Contrastive grammar

  1. Create a page on the wiki named "LanguageX and LanguageY/Contrastive Grammar". Add it to the category Category:Sp19_ContrastiveGrammars. Add a link to the page under "Developed resources" on the main page.
  2. Identify at least five differences between the two languages.
    • These can be anything where the analyses of equivalent phrases look different when the roots are ignored: i.e., differences in tags used, differences in basic word order, etc.
  3. Document these differences on the wiki.
    • You'll want on the order of two or three examples of each grammatical difference.
    • Make two sections on the Contrastive Grammar page: one called "xyz-abc tests" and one called "abc-xyz tests".
      • If you're translating in one direction only, you just need "xyz-abc tests".
      • Each section will have equivalent tests, but in opposite directions.
    • Use the TransferMorphTest and TransferTest templates as used above and on the Transfer rules page.