Difference between revisions of "Polished RBMT system"

From LING073
Jump to: navigation, search
m (The assignment)
 
(48 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Hand-annotating corpora ==
+
''If you're looking for information on corpus annotation and evaluation (precision/recall), that has moved to the [[Midterm overview]] page''
  
== Measuring precision and recall ==
+
== Measuring trimmed coverage ==
  
== The assignment ==
+
Apertium RBMT systems use a "trimmed" version of the base transducer for each language in the MT pair.  Each transducer is trimmed to only include entries found in the bilingual dictionary (<code>.dix</code>).  The trimmed transducers are almost always smaller than the main transducers.  Measuring coverage of the trimmed transducer gives you some idea of the ceiling of MT accuracy—or, at least, the percentage of forms in a corpus the MT system can analyse, regardless of translation accuracy.
  
'''Before you begin''', add a "structural_transfer" tag to your transducer repositories and your translation pair repository/ies.
+
Measuring trimmed coverage is just the same as measuring coverage, but with the appropriate "trimmed" transducer (e.g., <code>xyz-abc.automorf.bin</code>).  You'll need to use the <code>coverage-ltproc</code> script instead of <code>coverage-hfst</code>.
  
Set up some '''new corpora''' based on existing ones:
+
== The assignment ==
# Combine your <code>sentences</code> and <code>tests</code> corpora so you have a new longer parallel corpus.  Name the files <code>abc.longer.txt</code> and <code>xyz.longer.txt</code>.
+
This assignment is due at the end of week 12 (this semester, at the end of the day on '''Friday, 29 April 2022, before midnight''').
# Make a large corpus of a bunch of raw text in your language.  The more the better.  This step may simply consist of you cleaning up and/or combining the existing corpora from the [[initial corpus assembly]] assignment.  The bigger this corpus is the better.  Call it <code>abc.corpus.large.txt</code> (in your monolingual corpus repo) and add notes to your <code>MAINFEST</code> file about where the text comes from.
 
# A hand-annotated monolingual corpus of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your <code>abc.corpus.basic.txt</code> file, ideally sentences you understand / have English glosses of.  Call this <code>abc.annotated.basic.txt</code> and put it in your monolingual corpus repository.
 
 
 
If you've been working on separate MT pairs, '''combine your MT pairs''' into one repository (which you both have full access to), making sure to incorporate all of the following:
 
* All entries from both dictionaries in a single <code>.dix</code> file.  Make sure all translations are in the default direction of the pair (e.g., <code>abc-xyz</code>) and that <code>r="RL"</code> or <code>"LR"</code> attributes are set up for the right direction.
 
* Both <code>lrx</code> files are there and have the right names.
 
* All transfer files for both directions (up to 6 files) are there and have the right names and content.
 
* Also make sure that there are no compiled binaries or other compiled files committed to the repo.  If needed, use the <code>apertium-init</code> script to bootstrap a new pair to get the list of just the files that need to be in the repo, and use the the tricks presented in [[removing binaries from transducer repo]] to clean it up.
 
  
'''Expand your MT pair''' in {{InlineComment|x/y}} of the following ways, listing on the wiki ({{InlineComment|where}}):
+
# '''Before you begin''', make sure all previous assignments are done, and add a "structural_transfer" tag to your transducer repositories and your translation pair repository/ies to mark the end of previous assignments.
* At least '''100 more stems''' in the bilingual dictionary.
+
#* Also, please remove all binaries from all repositories!  See [[removing binaries from transducer repo]].
* '''Expanded your morphology''' to cover '''at least 6 more elements''' of some paradigm(s).  This can be anything from additional verb or noun morphology, to adding all the forms of all the determiners (articles, demonstratives, etc.), to implementing nominal morphology on adjectives (e.g., if your language allows adjectives to be substantivised, which you'll want to add a tag for too).
+
# Set up some '''new corpora''' based on existing ones:
* At least '''4 more twol rules''' that make your (analysis and) generation cleaner.
+
#* Combine and merge your <code>sentences</code> and <code>tests</code> corpora so you [hopefully, but not necessarily] have a new '''longer parallel corpus'''.  Name the files <code>abc.longer.txt</code> and <code>xyz.longer.txt</code>.
* {{InlineComment|Disambiguation}}
+
#* Make a '''large monolingual corpus''' of a bunch of raw text in your language.  The more the better.  This step may simply consist of you cleaning up and/or combining the existing corpora from the [[initial corpus assembly]] assignment.  See if you can get it over 100K words.  The bigger this corpus is the better.  Call it <code>abc.corpus.large.txt</code> (in your monolingual corpus repo) and add notes to your <code>MANIFEST</code> file about where the text comes from.
* {{InlineComment|Lexical selection}}
+
# '''Expand your MT pair''' in at least '''three''' of the following ways for each translation direction that you're working on, listing in a "Final evaluation" section on the language pair's wiki page what you did (move existing evaluation sections under a new section called "Initial evaluation"), and for every rule (for all of the following except adding stems), list an example of what output was improved.
* {{InlineComment|Structural transfer}}
+
#* At least '''100 more stems''' in the bilingual dictionary (and monolingual dictionaries as needed). This counts for both translation directions, if you are working on two-way translation.  (If you've automated creation of your bilingual dictionary, then you can use this task to clean up 100 stems—at least 50 entries in your bilingual dictionary should be different than before—and properly categorise those 100 stems into the monolingual dictionary.)
 +
#* '''Expand your morphology''' to cover '''at least 5 more elements''' of some paradigm(s).  This can be anything from additional verb or noun morphology, to adding all the forms of all the determiners (articles, demonstratives, etc.), to implementing nominal morphology on adjectives (e.g., if your language allows adjectives to be substantivised, which you'll want to add a tag for too).
 +
#* At least '''4 more twol rules''' that make your (analysis and) generation cleaner for an additional way.
 +
#* At least '''3 new disambiguation rules''' that make the output of your tagger more accurate.
 +
#* At least '''2 new lexical selection rules''' that make more of the right stems transfer over.
 +
#* At least '''2 new transfer rules''' that make more of the output of your MT system closer to an acceptable target translation.
 +
# When you are done with the above:
 +
#* Document which of the above options you completed on the pair's wiki page (in a section like "Additions").  You don't have to list the words or rules you added, but do list that you added ''n'' words or ''n'' transfer rules or the like.
 +
#* Add a "polished RBMT system" tag to your repo.
 +
#* '''document the following measures''' in the "Final evaluation" section of the pair's wiki page:.
 +
#** For your monolingual transducer:
 +
#*** Updated precision and recall against the (updated) <code>eval.test</code> and <code>eval.gold</code> files (see [[Midterm overview#Evaluating your transducer]]),
 +
#*** Coverage over the <code>large</code> corpus,
 +
#*** The number of words in the <code>large</code> corpus,
 +
#*** The number of stems in the transducer.
 +
#** For MT in the direction(s) you developed (<code>abc-xyz</code> and potentially <code>xyz-abc</code>):
 +
#*** WER and PER over <code>longer</code> corpus.
 +
#*** The proportion of stems translated correctly in the <code>longer</code> corpus.
 +
#*** Trimmed coverage over <code>longer</code> and <code>large</code> corpora.
 +
#*** The number of tokens in <code>longer</code> and <code>large</code> corpora.
  
When you are done with the above, '''document the following measures''':
+
[[Category:Assignments]]
# For each transducer:
+
[[Category:Tutorials]]
#* Number of stems
 
#* Precision and recall {{InlineComment|which corpus}},
 
#* Coverage, {{InlineComment|which corpus}}
 
#* The size of the corpus and number of stems in transducer.
 
# For MT in each direction:
 
#* WER and PER over {{InlineComment|which corpus}}
 
#* {{InlineComment|trimmed coverage}}
 
#* The number of stems in {{InlineComment|the small corpus}}
 

Latest revision as of 09:09, 26 April 2022

If you're looking for information on corpus annotation and evaluation (precision/recall), that has moved to the Midterm overview page

Measuring trimmed coverage

Apertium RBMT systems use a "trimmed" version of the base transducer for each language in the MT pair. Each transducer is trimmed to only include entries found in the bilingual dictionary (.dix). The trimmed transducers are almost always smaller than the main transducers. Measuring coverage of the trimmed transducer gives you some idea of the ceiling of MT accuracy—or, at least, the percentage of forms in a corpus the MT system can analyse, regardless of translation accuracy.

Measuring trimmed coverage is just the same as measuring coverage, but with the appropriate "trimmed" transducer (e.g., xyz-abc.automorf.bin). You'll need to use the coverage-ltproc script instead of coverage-hfst.

The assignment

This assignment is due at the end of week 12 (this semester, at the end of the day on Friday, 29 April 2022, before midnight).

  1. Before you begin, make sure all previous assignments are done, and add a "structural_transfer" tag to your transducer repositories and your translation pair repository/ies to mark the end of previous assignments.
  2. Set up some new corpora based on existing ones:
    • Combine and merge your sentences and tests corpora so you [hopefully, but not necessarily] have a new longer parallel corpus. Name the files abc.longer.txt and xyz.longer.txt.
    • Make a large monolingual corpus of a bunch of raw text in your language. The more the better. This step may simply consist of you cleaning up and/or combining the existing corpora from the initial corpus assembly assignment. See if you can get it over 100K words. The bigger this corpus is the better. Call it abc.corpus.large.txt (in your monolingual corpus repo) and add notes to your MANIFEST file about where the text comes from.
  3. Expand your MT pair in at least three of the following ways for each translation direction that you're working on, listing in a "Final evaluation" section on the language pair's wiki page what you did (move existing evaluation sections under a new section called "Initial evaluation"), and for every rule (for all of the following except adding stems), list an example of what output was improved.
    • At least 100 more stems in the bilingual dictionary (and monolingual dictionaries as needed). This counts for both translation directions, if you are working on two-way translation. (If you've automated creation of your bilingual dictionary, then you can use this task to clean up 100 stems—at least 50 entries in your bilingual dictionary should be different than before—and properly categorise those 100 stems into the monolingual dictionary.)
    • Expand your morphology to cover at least 5 more elements of some paradigm(s). This can be anything from additional verb or noun morphology, to adding all the forms of all the determiners (articles, demonstratives, etc.), to implementing nominal morphology on adjectives (e.g., if your language allows adjectives to be substantivised, which you'll want to add a tag for too).
    • At least 4 more twol rules that make your (analysis and) generation cleaner for an additional way.
    • At least 3 new disambiguation rules that make the output of your tagger more accurate.
    • At least 2 new lexical selection rules that make more of the right stems transfer over.
    • At least 2 new transfer rules that make more of the output of your MT system closer to an acceptable target translation.
  4. When you are done with the above:
    • Document which of the above options you completed on the pair's wiki page (in a section like "Additions"). You don't have to list the words or rules you added, but do list that you added n words or n transfer rules or the like.
    • Add a "polished RBMT system" tag to your repo.
    • document the following measures in the "Final evaluation" section of the pair's wiki page:.
      • For your monolingual transducer:
        • Updated precision and recall against the (updated) eval.test and eval.gold files (see Midterm overview#Evaluating your transducer),
        • Coverage over the large corpus,
        • The number of words in the large corpus,
        • The number of stems in the transducer.
      • For MT in the direction(s) you developed (abc-xyz and potentially xyz-abc):
        • WER and PER over longer corpus.
        • The proportion of stems translated correctly in the longer corpus.
        • Trimmed coverage over longer and large corpora.
        • The number of tokens in longer and large corpora.