Difference between revisions of "Spring 2018/Lexical transfer"

From LING073
Jump to: navigation, search
(The assignment)
(Bootstrapping a language pair)
Line 3: Line 3:
 
# Rename the directory to <code>ling073-xyz-abc</code>.
 
# Rename the directory to <code>ling073-xyz-abc</code>.
 
# Commit the entire thing to git the same way you did your [[Bootstrapping_a_transducer#Commit_to_github|transducer after bootstrapping it]] (unless you did it wrong).
 
# Commit the entire thing to git the same way you did your [[Bootstrapping_a_transducer#Commit_to_github|transducer after bootstrapping it]] (unless you did it wrong).
# Initialise it (in the new directory) with the following command: <code>./auotgen --with-lang1=/path/to/ling073-xyz --with-lang1=/path/to/ling073-abc</code>
+
# Initialise it (in the new directory) with the following command: <code>./autogen --with-lang1=/path/to/ling073-xyz --with-lang1=/path/to/ling073-abc</code>
 
# Compile with <code>make</code> as always.
 
# Compile with <code>make</code> as always.
  

Revision as of 19:52, 18 March 2017

Bootstrapping a language pair

  1. To bootstrap a language pair whose primary function is to translate from language xyz to language abc, do the following: apertium-init -a1 hfst -a2 hfst xyz-abc
  2. Rename the directory to ling073-xyz-abc.
  3. Commit the entire thing to git the same way you did your transducer after bootstrapping it (unless you did it wrong).
  4. Initialise it (in the new directory) with the following command: ./autogen --with-lang1=/path/to/ling073-xyz --with-lang1=/path/to/ling073-abc
  5. Compile with make as always.

Adding to the dictionary

You can add entries to the apertium-xyz-abc.xyz-abc.dix file in the "main" section. This dictionary is in XML format, the basics of which you can learn about in the first chapter of this introduction to XML.

An entry might look like this, with a translation from the "left side" (in <l> tags) to the right side (in <r> tags):

 <e><p><l>үй<s n="n"/></l>  <r>house<s n="n"/></r></p></e>

You only want to include the stem and the category and subcategory tags, not any grammatical tags (i.e., no singular or plural; no cases; etc.).

Then:

  1. Make sure each of these stems is in the respective monolingual dictionaries (lexc files)
  2. Compile (make) and check for errors.
  3. Test, e.g. echo "үй" | apertium -d . xyz-abc and debug (see #Testing for more details)
  4. Commit!

Testing

You can test a translation like this:

echo "үйлөр" | apertium -d . kir-eng

There are three main types of output you could get:

  • houses — yay, everything worked and you got a valid translation!
  • #house — the stems are in the dictionary and translating properly, but it can't generate the exact form in the destination language
  • *үйлөр — the word cannot be analysed by the source language transducer or is not in the dictionary.

At this stage you want one of the first two outputs for everything—no *s!

To see why #house fails, you can check what the dictionary is translating, like this:

echo "үйлөр" | apertium -d . kir-eng-biltrans

The output might look like this:

^үй<n><pl><nom>/house<n><pl><nom>$^.<sent>/.<sent>$

This is telling you that the source language transducer is analysing the input as үй<n><pl><nom> and is translating the stem correctly as house<n> and keeping the other tags. In this case, the reason that it can't generate the output form is that there is no <nom> tag in English. This is a nice grammatical difference. Later you'll want a transfer rule to remove this tag, and possibly make word order adjustments based on it.

Evaluating

Count the total number of tokens in the sentence:

echo "put your sentence here" | apertium -d . xyz-abc | sed 's/ /\n/g' | wc -l

Count the number of tokens not found in the dictionary:

echo "put your sentence here" | apertium -d . xyz-abc | sed 's/ /\n/g' | grep '*' | wc -l

Your goal is to decrease the ratio of unknown words to total number of words, ideally to 0.

TODO: WER, PER

TODO: scraping and storing transfer tests

The assignment

This assignment is due at the end of the 9th week of class (this semester: Friday, March 24th at noon)

  1. Bootstrap a language pair. (The person you're working with will bootstrap the reverse pair)
  2. Make sure you're in the AUTHORS file and that the COPYING file reflects an open source license that you're okay with. Commit both language pairs and add a link each on the Langauge1 and Language2 wiki page.
  3. Add all the words for the transfer tests (from the last assignment) to analyse to dix.
    • And make sure each analyser can analyse all sentences correctly, which includes adding the words to the relevant lexc files.
  4. Make some decisions to bring your tagsets closer together, as possible—e.g., if you used different tags for what amount to the same thing, or if you are or are not marking a particular feature. Examples:
    • You may have used any of <v>, <vb>, <vblex>, <verb> for verbs—it would be good if both languages used the same tag.
    • If one of you marked transitivity on verbs and one of you didn't, you should probably add that.
    • If one of you mark <sg> versus <pl>, and the other just has <pl> or lack of it, figure out if there's a linguistic basis for this decision in each language. It may be that there is, in which case you can deal with it later; but if the languages don't differ in how they do things, pick one and make both transducers do this.
    • You probably will want to bring your grammatical information tags closer together, e.g., <nz>/<nmz> and <ger> could potentially be combined.
  5. Find more words (examine dictionaries, texts) so that there's a total of at least 50 words, with at least 10 nouns and 10 verbs.
    • Make sure all the words are in monolingual dictionaries.
    • Find at least two cases in each direction of a one-to-many mapping.
  6. Evaluate in the following ways:
    • WER, PER on tests (probably 0%)
    • number of stems translated correctly on tests (hopefully close to 100%)
    • number of forms in corpus with translated stems
    • Put these numbers on your Language1 and Language2 wiki page in a section called xyz → abc evaluation.
  7. Make sure to commit and push your language pairs and any changes you made to the transducers!