Difference between revisions of "Polished RBMT system"

From LING073
Jump to: navigation, search
(The assignment)
(The assignment)
Line 78: Line 78:
 
# Set up some '''new corpora''' based on existing ones:
 
# Set up some '''new corpora''' based on existing ones:
 
#* Combine your <code>sentences</code> and <code>tests</code> corpora so you have a new '''longer parallel corpus'''.  Name the files <code>abc.longer.txt</code> and <code>xyz.longer.txt</code>.
 
#* Combine your <code>sentences</code> and <code>tests</code> corpora so you have a new '''longer parallel corpus'''.  Name the files <code>abc.longer.txt</code> and <code>xyz.longer.txt</code>.
#* Make a '''large monolingual corpus''' of a bunch of raw text in your language.  The more the better.  This step may simply consist of you cleaning up and/or combining the existing corpora from the [[initial corpus assembly]] assignment.  See if you can get it over 100K words.  The bigger this corpus is the better.  Call it <code>abc.corpus.large.txt</code> (in your monolingual corpus repo) and add notes to your <code>MAINFEST</code> file about where the text comes from.
+
#* Make a '''large monolingual corpus''' of a bunch of raw text in your language.  The more the better.  This step may simply consist of you cleaning up and/or combining the existing corpora from the [[initial corpus assembly]] assignment.  See if you can get it over 100K words.  The bigger this corpus is the better.  Call it <code>abc.corpus.large.txt</code> (in your monolingual corpus repo) and add notes to your <code>MANIFEST</code> file about where the text comes from.
 
#* A '''hand-annotated monolingual corpus''' of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your <code>abc.corpus.basic.txt</code> file, ideally sentences you understand / have English glosses of.  Put the sentences you want to annotate in <code>abc.annotated.raw.txt</code> and dump this to <code>abc.annotated.basic.txt</code> to annotate it in CG format.  Add these files to your monolingual corpus repository.
 
#* A '''hand-annotated monolingual corpus''' of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your <code>abc.corpus.basic.txt</code> file, ideally sentences you understand / have English glosses of.  Put the sentences you want to annotate in <code>abc.annotated.raw.txt</code> and dump this to <code>abc.annotated.basic.txt</code> to annotate it in CG format.  Add these files to your monolingual corpus repository.
 
# If you've been working on separate MT pairs, '''combine your MT pairs''' into one repository (which you both have full access to), making sure to incorporate all of the following:
 
# If you've been working on separate MT pairs, '''combine your MT pairs''' into one repository (which you both have full access to), making sure to incorporate all of the following:

Revision as of 06:01, 24 April 2017

Hand-annotating corpora

First you want to analyze your corpus and output to CG format:

cat corpus.txt | apertium -d . xyz-morph | cg-conv -a > corpus.out.txt

Your new file probably now looks something like this:

"<This>"
	"this" det dem sg
	"this" prn dem sg
"<is>"
	"be" vbser pres p3 sg
"<my>"
	"I" prn p1 sg pos
"<house>"
	"house" n sg
"<.>"
	"." sent
"<I>"
	"I" prn p1 sg subj
"<live>"
	"live" vblex inf
	"live" vblex past
"<here>"
	"*here"
"<..>"
	".." sent

In this example, you might note a few fixes:

  • "here" isn't being analysed; it should have an adverb reading
  • "house" should have a verb reading
  • "live" should have an adjective reading
  • "live" isn't the past tense form of this verb

The following annotation makes these corrections:

"<This>"
	"this" det dem sg
	"this" prn dem sg
"<is>"
	"be" vbser pres p3 sg
"<my>"
	"I" prn p1 sg pos
"<house>"
	"house" n sg
	"house" vblex tv inf
"<.>"
	"." sent
"<I>"
	"I" prn p1 sg subj
"<live>"
	"live" vblex inf
	"live" adj
"<here>"
	"here" adv
"<..>"
	".." sent

Note: There should be no unknown words ("analyses" with *) when you're done.

Measuring precision and recall

Precision and recall are measures of how accurate a transducer is. Precision is the number of returned analyses that are correct, and recall is the number of correct analyses that are returned.

In the example above of hand annotation, the precision is 90% (there are 9 true positives and 1 false positive), meaning that 90% of the returned analyses were correct. Recall is lower, at 75% (there are 9 true positives and 3 false negatives), meaning that only 75% of the correct analyses were returned.

There is a script in the tools repo called precisionRecall. You can update the repo (git pull) and run sudo make to ensure that you have this script installed on your system. You can then run precisionRecall referencecorpus.txt annotatedcorpus.txt.

Measuring trimmed coverage

Measuring trimmed coverage is just the same as measuring coverage, but with the appropriate "trimmed" transducer (e.g., xyz-abc.automorf.bin).

The assignment

This assignment is due at the end of week 12 (this semester, at the end of the day on Tuesday, 18 April 2017, before class).

  1. Before you begin, make sure all previous assignments are done, and add a "structural_transfer" tag to your transducer repositories and your translation pair repository/ies to mark the end of previous assignments.
  2. Set up some new corpora based on existing ones:
    • Combine your sentences and tests corpora so you have a new longer parallel corpus. Name the files abc.longer.txt and xyz.longer.txt.
    • Make a large monolingual corpus of a bunch of raw text in your language. The more the better. This step may simply consist of you cleaning up and/or combining the existing corpora from the initial corpus assembly assignment. See if you can get it over 100K words. The bigger this corpus is the better. Call it abc.corpus.large.txt (in your monolingual corpus repo) and add notes to your MANIFEST file about where the text comes from.
    • A hand-annotated monolingual corpus of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your abc.corpus.basic.txt file, ideally sentences you understand / have English glosses of. Put the sentences you want to annotate in abc.annotated.raw.txt and dump this to abc.annotated.basic.txt to annotate it in CG format. Add these files to your monolingual corpus repository.
  3. If you've been working on separate MT pairs, combine your MT pairs into one repository (which you both have full access to), making sure to incorporate all of the following:
    • All entries from both dictionaries in a single .dix file. Make sure all translations are in the default direction of the pair (e.g., abc-xyz) and that r="RL" or "LR" attributes are set up for the right direction.
    • Both lrx files are there and have the right names.
    • All transfer files for both directions (up to 6 files) are there and have the right names and content.
    • Also—again!—make sure that there are no compiled binaries or other compiled files committed to the repo. If needed, use the apertium-init script to bootstrap a new pair to get the list of just the files that need to be in the repo, and use the the tricks presented in removing binaries from transducer repo to clean it up.
    • Make it clear on the pair's wiki page which repository link is the final resting place of all this (but leave a link to the other repo and don't remove it). Also add a link to the final repo in the README in the superseded repo, with a note that the code has been merged into the other pair.
  4. Expand your MT pair in at least four of the following ways for each translation direction, listing in a "Final evaluation" section on the language pair's wiki page what you did (move existing evaluation sections under a new section called "Initial evaluation"), and for every rule (for all of the following except adding stems), list an example of what output was improved.
    • At least 100 more stems in the bilingual dictionary (and monolingual dictionaries). This counts for both translation directions.
    • Expanded your morphology to cover at least 6 more elements of some paradigm(s). This can be anything from additional verb or noun morphology, to adding all the forms of all the determiners (articles, demonstratives, etc.), to implementing nominal morphology on adjectives (e.g., if your language allows adjectives to be substantivised, which you'll want to add a tag for too).
    • At least 4 more twol rules that make your (analysis and) generation cleaner.
    • At least 4 new disambiguation rules that make the output of your tagger more accurate.
    • At least 3 new lexical selection rules that make more of the right stems transfer over.
    • At least 3 new transfer rules that make more of the output of your MT system closer to an acceptable target translation.
  5. When you are done with the above, document the following measures in the "Final evaluation" section of the pair's wiki page:.
    • For each transducer:
      • Precision and recall against the annotated.basic corpus,
      • Coverage over the large corpus,
      • The number of words in the large corpus,
      • The number of stems in the transducer.
    • For MT in each direction:
      • WER and PER over longer corpus.
      • The proportion of stems translated correctly in the longer corpus.
      • Trimmed coverage over longer and large corpora.
      • The number of tokens in longer and large corpora.