Difference between revisions of "Polished RBMT system"

From LING073
Jump to: navigation, search
Line 7: Line 7:
 
# '''Before you begin''', add a "structural_transfer" tag to your transducer repositories and your translation pair repository/ies.
 
# '''Before you begin''', add a "structural_transfer" tag to your transducer repositories and your translation pair repository/ies.
 
# Set up some '''new corpora''' based on existing ones:
 
# Set up some '''new corpora''' based on existing ones:
## Combine your <code>sentences</code> and <code>tests</code> corpora so you have a new longer parallel corpus.  Name the files <code>abc.longer.txt</code> and <code>xyz.longer.txt</code>.
+
#* Combine your <code>sentences</code> and <code>tests</code> corpora so you have a new longer parallel corpus.  Name the files <code>abc.longer.txt</code> and <code>xyz.longer.txt</code>.
## Make a large corpus of a bunch of raw text in your language.  The more the better.  This step may simply consist of you cleaning up and/or combining the existing corpora from the [[initial corpus assembly]] assignment.  The bigger this corpus is the better.  Call it <code>abc.corpus.large.txt</code> (in your monolingual corpus repo) and add notes to your <code>MAINFEST</code> file about where the text comes from.
+
#* Make a large corpus of a bunch of raw text in your language.  The more the better.  This step may simply consist of you cleaning up and/or combining the existing corpora from the [[initial corpus assembly]] assignment.  The bigger this corpus is the better.  Call it <code>abc.corpus.large.txt</code> (in your monolingual corpus repo) and add notes to your <code>MAINFEST</code> file about where the text comes from.
## A hand-annotated monolingual corpus of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your <code>abc.corpus.basic.txt</code> file, ideally sentences you understand / have English glosses of.  Call this <code>abc.annotated.basic.txt</code> and put it in your monolingual corpus repository.
+
#* A hand-annotated monolingual corpus of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your <code>abc.corpus.basic.txt</code> file, ideally sentences you understand / have English glosses of.  Call this <code>abc.annotated.basic.txt</code> and put it in your monolingual corpus repository.
 
# If you've been working on separate MT pairs, '''combine your MT pairs''' into one repository (which you both have full access to), making sure to incorporate all of the following:
 
# If you've been working on separate MT pairs, '''combine your MT pairs''' into one repository (which you both have full access to), making sure to incorporate all of the following:
 
#* All entries from both dictionaries in a single <code>.dix</code> file.  Make sure all translations are in the default direction of the pair (e.g., <code>abc-xyz</code>) and that <code>r="RL"</code> or <code>"LR"</code> attributes are set up for the right direction.
 
#* All entries from both dictionaries in a single <code>.dix</code> file.  Make sure all translations are in the default direction of the pair (e.g., <code>abc-xyz</code>) and that <code>r="RL"</code> or <code>"LR"</code> attributes are set up for the right direction.

Revision as of 14:20, 5 April 2017

Hand-annotating corpora

Measuring precision and recall

The assignment

  1. Before you begin, add a "structural_transfer" tag to your transducer repositories and your translation pair repository/ies.
  2. Set up some new corpora based on existing ones:
    • Combine your sentences and tests corpora so you have a new longer parallel corpus. Name the files abc.longer.txt and xyz.longer.txt.
    • Make a large corpus of a bunch of raw text in your language. The more the better. This step may simply consist of you cleaning up and/or combining the existing corpora from the initial corpus assembly assignment. The bigger this corpus is the better. Call it abc.corpus.large.txt (in your monolingual corpus repo) and add notes to your MAINFEST file about where the text comes from.
    • A hand-annotated monolingual corpus of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your abc.corpus.basic.txt file, ideally sentences you understand / have English glosses of. Call this abc.annotated.basic.txt and put it in your monolingual corpus repository.
  3. If you've been working on separate MT pairs, combine your MT pairs into one repository (which you both have full access to), making sure to incorporate all of the following:
    • All entries from both dictionaries in a single .dix file. Make sure all translations are in the default direction of the pair (e.g., abc-xyz) and that r="RL" or "LR" attributes are set up for the right direction.
    • Both lrx files are there and have the right names.
    • All transfer files for both directions (up to 6 files) are there and have the right names and content.
    • Also make sure that there are no compiled binaries or other compiled files committed to the repo. If needed, use the apertium-init script to bootstrap a new pair to get the list of just the files that need to be in the repo, and use the the tricks presented in removing binaries from transducer repo to clean it up.
  4. Expand your MT pair in x/y of the following ways, listing on the wiki (where):
    • At least 100 more stems in the bilingual dictionary.
    • Expanded your morphology to cover at least 6 more elements of some paradigm(s). This can be anything from additional verb or noun morphology, to adding all the forms of all the determiners (articles, demonstratives, etc.), to implementing nominal morphology on adjectives (e.g., if your language allows adjectives to be substantivised, which you'll want to add a tag for too).
    • At least 4 more twol rules that make your (analysis and) generation cleaner.
    • Disambiguation
    • Lexical selection
    • Structural transfer
  5. When you are done with the above, document the following measures:
    1. For each transducer:
      • Number of stems
      • Precision and recall which corpus,
      • Coverage, which corpus
      • The size of the corpus and number of stems in transducer.
    2. For MT in each direction:
      • WER and PER over which corpus
      • trimmed coverage
      • The number of stems in the small corpus