Lexical transfer

From LING073
Jump to: navigation, search

A previous version of this document is available at Spring 2018/Lexical transfer.

Machine Translation

Corpus-based approaches

Corpus-based approaches comprise Neural Machine Translation (NMT) and Statistical Machine Translation (SMT), an are what's used in most dominant MT systems out there.

SMT use parallel text in two languages (e.g., translations of the same material into two different languages) and algorithms that attempt to find correspondences between the texts. NMT also uses parallel texts, and learns correspondences through neural nets. The resulting model of either can then predict to some extent text in one language when it encounters previously unseen text in the other language.

Both approaches requires large amounts of parallel text to work well, and ~naïve models do not do well with morphologically rich languages.

  1. Can you think why this might be?

RBMT

Rule-Based Machine Translation (RBMT) is of marginal popularity these days.

RBMT requires manual encoding of linguistic generalisations (e.g., hand-writing dictionaries, writing rules to change word order and grammatical properties, etc.), and does not require much parallel text for development. (Though it's good to use a certain amount for evaluation—but you don't need nearly as much for development.)

RBMT can leverage good morphological models of morphologically rich languages, and hence can perform well for morphologically rich languages. RBMT systems are much easier to develop for languages with similar structure (like closely-related languages or languages that are part of the same linguistic area), and are much harder to develop for dissimilar languages.

Comparison

  1. What are the main advantages of SMT over RBMT?
  2. What are the main disadvantages of SMT as compared to RBMT?
  3. What are the main advantages of RBMT over SMT?
  4. What are the main disadvantages of RBMT as compared to SMT?

Other questions

  1. Why do you think we're using RBMT in this class?
  2. Why might we want to develop an MT system between two closely related languages?

How Apertium works

Apertium is an RBMT system that uses a pipeline consisting of a series of tools, several of them optional or interchangeable.

First steps towards MT

The first step we're going to be working on is lexical transfer. This basically just consists of making a bilingual dictionary.

In Apertium, lexical transfer matches the lemma and tags from the source-language (SL) side of the dictionary and simply replaces them with the lemma and tags on the target-language (TL) side of the dictionary. Any rearranging of tags and word-order will be dealt with later, in structural transfer.

This means that for any given entry you want to add a lemma, a POS tags, and any subcategory tags, but not grammatical tags. One simple example is house<n>casa<n><f>, where the lemmas, POS tag <n>, and a subcategory tag <f> are in the lexicon, but no number tags (e.g., <sg>, <pl>) are.

Also, since many words in a given language have multiple translations into another language, there will often be ambiguity. Lexical selection (what we'll work on next week) will allow us to resolve this ambiguity.

Things that will be good to know

Adding to the dictionary

You can add entries to the apertium-xyz-abc.xyz-abc.dix file in the "main" section. This dictionary is in XML format, the basics of which you can learn about in the first chapter of this introduction to XML.

An entry might look like this, with a translation from the "left side" (in <l> tags) to the right side (in <r> tags):

 <e><p><l>үй<s n="n"/></l>  <r>house<s n="n"/></r></p></e>

Remember that you only want to include the stem and the category and subcategory tags, not any grammatical tags (i.e., no singular or plural; no cases; etc.).

If you have a word that is present in one language but not another (e.g., it's realised as a tag instead of as a word), you can add an entry like this:

 <e r="LR"><p><l>not<s n="adv"/></l>  <r></r></p></e>

The r="LR" attribute ensures that "not" gets rendered as null when translating from abc to xyz, whereas null doesn't always get realised as "not" when translating from xyz to abc (which would cause issues...). You can reverse it to r="RL" if a similar generalisation holds in the other direction.

Then:

  1. Make sure each of these stems is in the respective monolingual dictionaries ( lexc files)
  2. Compile (make, for each transducer and then the bilingual pair) and check for errors.
  3. Test, e.g. echo "үй" | apertium -d . xyz-abc and debug (see #Testing for more details)
  4. Commit!

Here's the example from class.

Testing

You can test a translation like this:

echo "үйлөр" | apertium -d . kir-eng

There are three main types of output you could get:

  • houses — yay, everything worked and you got a valid translation!
  • #house — the stems are in the dictionary and translating properly, but it can't generate the exact form in the destination language
  • *үйлөр — the word cannot be analysed by the source language transducer or is not in the dictionary.

At this stage you want one of the first two outputs for everything—no *s!

To see why #house fails, you can check what the dictionary is translating, like this:

echo "үйлөр" | apertium -d . kir-eng-biltrans

The output might look like this:

^үй<n><pl><nom>/house<n><pl><nom>$^.<sent>/.<sent>$

This is telling you that the source language transducer is analysing the input as үй<n><pl><nom> and is translating the stem correctly as house<n> and keeping the other tags. In this case, the reason that it can't generate the output form is that there is no <nom> tag in English. This is a nice grammatical difference. Later you'll want a transfer rule to remove this tag, and possibly make word order adjustments based on it.

Evaluating: calculating unknown/total words

Count the total number of tokens in the sentence/corpus (total number of words):

echo "put your sentence here" | apertium -d . xyz-abc | sed 's/ /\n/g' | wc -l

OR

cat corpus.txt | apertium -d . xyz-abc | sed 's/ /\n/g' | wc -l

Count the number of tokens not found in the dictionary (number of unknown words):

echo "put your sentence here" | apertium -d . xyz-abc | sed 's/ /\n/g' | grep '*' | wc -l

OR

cat corpus.txt | apertium -d . xyz-abc | sed 's/ /\n/g' | grep '*' | wc -l

Your goal is to decrease the ratio of unknown words to total number of words, ideally to 0.

Additional guidance

If you get a segfault, there are two things to try:

  1. Try make clean and run make and try again.
  2. If that doesn't help, try this:
    1. Remove the following blocks from modes.xml in your translation pair:
        <program name="apertium-tagger -g $2">
           <file name="abc-xyz.prob"/>
        </program>
        <program name="apertium-tagger -g $2">
          <file name="xyz-abc.prob"/>
        </program>
    2. Change all instances of "cg-proc -w" to "cg-proc -w -1 -n"
    3. Run make and try again.

The assignment

This assignment is due at the end of the 9th week of class (this semester: end of the day Friday, April 9th, 2021, at midnight)

Figure out what language pair you'll work on

  1. Figure out who and what languages you'd like to work with on MT (source and destination languages). You can chose to work in the same or different groups, and to work with the language you're already working on or with a different language. This is an opportunity to regroup, especially if you don't enjoy the language you're working on, or you're having trouble working in your current group, etc.
    • The simplest case is that the source language is the language you've been working on this semester and the destination language is English or another language you know well (are able to easily make sense of text in the language) and that has an open source transducer (e.g., in Apertium). In this case, your language pair would be xyz-eng.
    • You may also choose to do MT between your language and:
      • a closely related or structurally extremely similar language (e.g., that mark similar things on the same parts of speech and/or that have similar word order)
      • a language spoken in close proximity to your language or shares similar cultural influences, even if unrelated (harder)
      • a random other language that another group in the class has been working on (usually hardest)
    • You may work in groups of 2 or 3, or in some rare cases I may allow groups of 4.
      • In groups of 3 or 4 where both languages have been worked on up to this point in the course, each language should be both a source and destination of translation (will double the work for future assignments). Each person on the group should work on MT to the language they've been working with so far.
      • Otherwise, you should choose a destination language that's a language you're comfortable with.
      • In general, the destination language you work on MT for (the "target language") should be the language of the pair that you're most comfortable with.
  2. Get additional repositories set up:
    • If you are working with someone who's done another language in this class:
      • Share the corpus, keyboard, and transducer repos with each other. I.e., give each other read access (not necessarily write access).
      • Fork the repos that were shared with you. Do this through the github interface.
      • Clone the forked repos locally (in your ~/ling073 directory).
      • You'll want to look at the corpus repo for this assignment, and probably make sure you can type in the language too too, and you'll want the other stuff around for later. We'll also talk more about forking, pull requests, etc. as needed.

Set up the language pair

  1. Get the transducer for the second language you're working with.
    • Apertium's github repository has transducers for a lot of languages.
      • I recommend that you fork the transducer on github (i.e., copy the project to your own github account) so that you can make changes to it easily as needed (and potentially submit those changes back to Apertium at some point). For this you'll need a github.com account, you'll need to set up an ssh key for it, and you'll need to make sure both members of your group have write access to it.
      • Alternatively, you can just clone the repository, make an empty repository of the same name in our class github group, add that repository as an additional remote to your cloned copy, and push to it and set it as the default remote.
    • After forking the repository as appropriate, clone it locally into the same directory as your ling073-xyz repository (not "inside" your transducer repository, but "next to" it).
    • You can keep the name, e.g. apertium-abc--i.e., you don't need to rename it to ling073-abc.
    • Make sure to initialise (./autogen.sh) and compile (make) it.
  2. Bootstrap a translation pair to work on in your group. See apertium-init#Bootstrapping a translation pair.
  3. Make sure you're in the AUTHORS file and that the COPYING file reflects an open source license that you're okay with. Commit the language pair and add a link on the Language1 and Language2 wiki page.

Start a wiki page

Create a page on the wiki named "LanguageX and LanguageY" (either order, and just one page).

  • Add the page to the category Category:Sp21_TranslationPairs and the categories for each of the languages.
  • Put a note at the top along the lines of "Resources for machine translation between LanguageX and LanguageY", where the language names link back to your pages on them.
  • Add a link on each of the language pages to this new page (under a category like "Resources developed for LanguageX"), making a similar note as above.
  • Make a section called "External resources" and link to the following:
    • The repository for your new language pair
    • The repository for each transducer
    • The repository with the corpus (see next section)

Assemble a small parallel corpus

  1. Create a repository named ling073-xyz-abc-corpus in the course github group. Make sure both/all collaborators have full access to it. Put a link to it on the wiki page in a section title "Developed resources".
  2. Construct a parallel corpus of at least 1000 characters in the source language (500 characters if a syllabic writing system). Note that the parallel text you already have can be leveraged here. If you're working with two unfamiliar languages (e.g., joining forces with another group in the class to work on translation between languages you've both developed transducers for), halve this number.
    • The corpus should consist of sentences with the same meaning, one per line. You can extract sentences from translations of the same text (e.g., bible translations or the Universal Declaration of Human Rights), or use phrasebook examples that are similar.
    • If you're working with two unfamiliar languages, you can also try to use example sentences from your sources to try to construct grammatical sentences with similar meaning, but make note of this.
    • In extreme cases, you may include short phrases (as opposed to full sentences) with the same meaning, but please speak with the professor about this as soon as possible.
  3. Populate two files xyz.sentences.txt and abc.sentences.txt with the parallel text, one sentence per line.
    • If you're not working with English or another language you read easily, I recommend adding eng.sentences.txt as well to keep track of what the sentences mean.
  4. Include a LICENSE, AUTHORS, and MANIFEST files as before (the latter to note origin and licensing of groups of sentences and any other notes).

Add to the lexicon

  1. Expand your transducer by adding all the stems and morphology needed to make at least 10 of the sentences (in the translation source language) in your parallel corpus analyse fully.
  2. Add translations for each stem in those same sentences to the apertium-xyz-abc.xyz-abc.dix file.
  3. Make some decisions to bring your tagsets closer together for the two transducers, as possible—e.g., if the transducers use different tags for what amount to the same thing, or if you are or are not marking a particular feature. Examples:
    • You may have used any of <v>, <vb>, <vblex>, <verb> for verbs—it would be good (but not necessary) if both languages used the same tag.
    • If one transducer marks transitivity on verbs and one doesn't, you might add that.
    • If one marks <sg> versus <pl>, and the other just has <pl> or lack of it, figure out if there's a linguistic basis for this decision in each language. It may be that there is, in which case you can deal with it later; but if the languages don't differ in how they do things, pick one and make both transducers do this.
    • You probably will want to bring your grammatical information tags closer together, e.g., <nz>/<nmz> and <ger> could potentially be combined.
    • However, you probably shouldn't [need to] change an existing transducer in major ways, and you shouldn't override [non-arbitrary] decisions you've already made about the tagset for your language just to make it more similar to an existing language!
  4. Find more words (examine dictionaries, texts) so that there's a total of at least 50 words, with at least 10 nouns and 10 verbs.
    • Make sure all the words that you've added to the bilingual dictionary are in the monolingual dictionaries as well (i.e., add anything that's missing).

Evaluate

  1. Create a section on your Language1 and Language2 wiki page called xyz → abc evaluation.
  2. Report the coverage (number of forms analysed) of your monolingual transducer on your xyz.sentences.txt file.
  3. Report the coverage of your bilingual transducer (xyz-abc.automorf.bin) on the same file.
    • You'll need the coverage-lt.sh script. You can save it in your repo by clicking raw, copying the entirety of the script, running cat > coverage-lt.sh, pasting the script, hitting Enter and then Ctrl+c. Then you'll want to run chmod +x coverage-lt.sh so that you can run it directly. Then simply do ./coverage-lt.sh corpus.txt analyser.bin.
  4. Finally, for each of the 10 sentences you worked on, post on the wiki page the following:
    • The original sentence,
    • The intended English translation,
    • The output of your xyz-abc-biltrans (lexical transfer) mode,
    • The output of your xyz-abc (full translation) mode.

Make sure to commit and push your language pairs, corpora (including sentences files), and any changes you made to the transducers!