Lexical transfer

From LING073
Jump to: navigation, search

Bootstrapping a language pair

  1. To bootstrap a language pair whose primary function is to translate from language xyz to language abc, do the following: apertium-init -a1 hfst -a2 hfst xyz-abc
  2. Rename the directory to ling073-xyz-abc.
  3. Commit the entire thing to git the same way you did your transducer after bootstrapping it (unless you did it wrong).
  4. Initialise it (in the new directory) with the following command: ./autogen.sh --with-lang1=/path/to/ling073-xyz --with-lang2=/path/to/ling073-abc
  5. Compile with make as always.

Adding to the dictionary

You can add entries to the apertium-xyz-abc.xyz-abc.dix file in the "main" section. This dictionary is in XML format, the basics of which you can learn about in the first chapter of this introduction to XML.

An entry might look like this, with a translation from the "left side" (in <l> tags) to the right side (in <r> tags):

 <e><p><l>үй<s n="n"/></l>  <r>house<s n="n"/></r></p></e>

You only want to include the stem and the category and subcategory tags, not any grammatical tags (i.e., no singular or plural; no cases; etc.).

If you have a word that is present in one language but not another (e.g., it's realised as a tag instead of as a word), you can add an entry like this:

 <e r="LR"><p><l>not<s n="adv"/></l>  <r></r></p></e>

The r="LR" attribute ensures that "not" gets rendered as null when translating from abc to xyz, whereas null doesn't always get realised as "not" when translating from xyz to abc (which would cause issues...). You can reverse it to r="RL" if a similar generalisation holds in the other direction.

Then:

  1. Make sure each of these stems is in the respective monolingual dictionaries (lexc files)
  2. Compile (make, for each transducer and then the bilingual pair) and check for errors.
  3. Test, e.g. echo "үй" | apertium -d . xyz-abc and debug (see #Testing for more details)
  4. Commit!

Testing

You can test a translation like this:

echo "үйлөр" | apertium -d . kir-eng

There are three main types of output you could get:

  • houses — yay, everything worked and you got a valid translation!
  • #house — the stems are in the dictionary and translating properly, but it can't generate the exact form in the destination language
  • *үйлөр — the word cannot be analysed by the source language transducer or is not in the dictionary.

At this stage you want one of the first two outputs for everything—no *s!

To see why #house fails, you can check what the dictionary is translating, like this:

echo "үйлөр" | apertium -d . kir-eng-biltrans

The output might look like this:

^үй<n><pl><nom>/house<n><pl><nom>$^.<sent>/.<sent>$

This is telling you that the source language transducer is analysing the input as үй<n><pl><nom> and is translating the stem correctly as house<n> and keeping the other tags. In this case, the reason that it can't generate the output form is that there is no <nom> tag in English. This is a nice grammatical difference. Later you'll want a transfer rule to remove this tag, and possibly make word order adjustments based on it.

Evaluating

Unknown/total words

Count the total number of tokens in the sentence/corpus (total number of words):

echo "put your sentence here" | apertium -d . xyz-abc | sed 's/ /\n/g' | wc -l

OR

cat corpus.txt | apertium -d . xyz-abc | sed 's/ /\n/g' | wc -l

Count the number of tokens not found in the dictionary (number of unknown words):

echo "put your sentence here" | apertium -d . xyz-abc | sed 's/ /\n/g' | grep '*' | wc -l

OR

cat corpus.txt | apertium -d . xyz-abc | sed 's/ /\n/g' | grep '*' | wc -l

Your goal is to decrease the ratio of unknown words to total number of words, ideally to 0.

Scrape a mini test corpus

  1. First make sure you have scrapeTransferTests. You should clone the tools repo (or git pull to update it, if you already have it cloned from other assignments). Then run sudo make. Test that running scrapeTransferTests gives you information on using the tool.
  2. Scrape the transferTests from your contrastive grammar page into a small parallel corpus. E.g., scrapeTransferTests -p abc-xyz "Language1_and_Language2/Contrastive_Grammar" will result in an abc.tests.txt and xyz.tests.txt file that contain the respective sides of any transferTests on your contrastive grammar page specified as being for abc to xyz translation.
  3. Add these two files to your bilingual corpus.

WER

WER or word error rate is a measure of how different two texts are. You will want to know how different the translation your translation pair performs (the "test translation") is from the known good translation of phrases in your parallel corpus (the "reference translation"). PER is the same measurement, just not sensitive to position in a phrase. I.e., a correct translation of every word but in an entirely wrong word order will give you low WER but high PER.

To test WER and PER:

  1. First make sure you have apertium-eval-translator. You should clone the tools repo (or git pull to update it, if you already have it cloned from other assignments). Then run sudo make. Test that running apertium-eval-translator gives you information on using the tool.
  2. You need two files: one test translation, and one reference translation. The reference translation is the parallel text in your corpus, e.g. abc.tests.txt. To get a test translation, run the source text through apertium and direct the output into a new file, e.g. cat xyz.tests.txt | apertium -d . xyz-abc > xyz-abc.tests.txt. You should add the [final] test translation to your repository.
  3. Just run apertium-eval-translator -r abc.tests.txt -t xyz-abc.tests.txt and you should get WER and PER, among other useful numbers.

The assignment

This assignment is due at the end of the 9th week of class (this semester: Friday, March 24th, 2017, at noon at midnight at the end of Saturday, March 25th, 2017)

  1. Bootstrap a language pair. (The person you're working with will bootstrap the reverse pair, or if you're confident of your git skills, you can work on the same pair together.)
  2. Make sure you're in the AUTHORS file and that the COPYING file reflects an open source license that you're okay with. Commit both language pairs and add a link each on the Language1 and Language2 wiki page.
  3. Add all the words for the transfer tests (from the last assignment) to analyse to dix.
    • And make sure each analyser can analyse all sentences correctly, which includes adding the words to the relevant lexc files.
  4. Make some decisions to bring your tagsets closer together, as possible—e.g., if you used different tags for what amount to the same thing, or if you are or are not marking a particular feature. Examples:
    • You may have used any of <v>, <vb>, <vblex>, <verb> for verbs—it would be good if both languages used the same tag.
    • If one of you marked transitivity on verbs and one of you didn't, you should probably add that.
    • If one of you mark <sg> versus <pl>, and the other just has <pl> or lack of it, figure out if there's a linguistic basis for this decision in each language. It may be that there is, in which case you can deal with it later; but if the languages don't differ in how they do things, pick one and make both transducers do this.
    • You probably will want to bring your grammatical information tags closer together, e.g., <nz>/<nmz> and <ger> could potentially be combined.
  5. Find more words (examine dictionaries, texts) so that there's a total of at least 50 words, with at least 10 nouns and 10 verbs.
    • Make sure all the words are in monolingual dictionaries.
    • Find at least two cases in each direction of a one-to-many mapping, and document on the wiki page for the translation pair in a new section called "lexical selection".
  6. Evaluate in the following ways, adding these numbers to your Language1 and Language2 wiki page in a section called xyz → abc evaluation. Run on both your tests.txt files (scrape a mini test corpus) and your sentences.txt files.
    • WER and PER (probably close to 100%)
    • proportion of stems translated correctly (hopefully close to 100% for tests.txt files)
  7. Make sure to commit and push your language pairs, corpora (including new tests files, per WER section above), and any changes you made to the transducers!