Difference between revisions of "Polished RBMT system"
From LING073
Line 24: | Line 24: | ||
# When you are done with the above, '''document the following measures''': | # When you are done with the above, '''document the following measures''': | ||
#* For each transducer: | #* For each transducer: | ||
− | |||
#** Precision and recall against the <code>annotated.basic</code> corpus, | #** Precision and recall against the <code>annotated.basic</code> corpus, | ||
#** Coverage against the <code>large</code> corpus, | #** Coverage against the <code>large</code> corpus, | ||
#** The size of the <code>large</code> corpus, | #** The size of the <code>large</code> corpus, | ||
− | #** The number of stems in transducer. | + | #** The number of stems in the transducer. |
#* For MT in each direction: | #* For MT in each direction: | ||
− | #** WER and PER over | + | #** WER and PER over <code>longer</code> corpus. |
− | #** | + | #** Trimmed coverage over <code>longer</code> and <code>large</code> corpora. |
− | #** The number of stems in | + | #** The number of stems in <code>longer</code> and <code>large</code> corpora. |
[[Category:Assignments]] | [[Category:Assignments]] | ||
[[Category:Tutorials]] | [[Category:Tutorials]] |
Revision as of 01:11, 6 April 2017
Hand-annotating corpora
Measuring precision and recall
The assignment
- Before you begin, add a "structural_transfer" tag to your transducer repositories and your translation pair repository/ies.
- Set up some new corpora based on existing ones:
- Combine your
sentences
andtests
corpora so you have a new longer parallel corpus. Name the filesabc.longer.txt
andxyz.longer.txt
. - Make a large corpus of a bunch of raw text in your language. The more the better. This step may simply consist of you cleaning up and/or combining the existing corpora from the initial corpus assembly assignment. The bigger this corpus is the better. Call it
abc.corpus.large.txt
(in your monolingual corpus repo) and add notes to yourMAINFEST
file about where the text comes from. - A hand-annotated monolingual corpus of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your
abc.corpus.basic.txt
file, ideally sentences you understand / have English glosses of. Call thisabc.annotated.basic.txt
and put it in your monolingual corpus repository.
- Combine your
- If you've been working on separate MT pairs, combine your MT pairs into one repository (which you both have full access to), making sure to incorporate all of the following:
- All entries from both dictionaries in a single
.dix
file. Make sure all translations are in the default direction of the pair (e.g.,abc-xyz
) and thatr="RL"
or"LR"
attributes are set up for the right direction. - Both
lrx
files are there and have the right names. - All transfer files for both directions (up to 6 files) are there and have the right names and content.
- Also make sure that there are no compiled binaries or other compiled files committed to the repo. If needed, use the
apertium-init
script to bootstrap a new pair to get the list of just the files that need to be in the repo, and use the the tricks presented in removing binaries from transducer repo to clean it up.
- All entries from both dictionaries in a single
- Expand your MT pair in at least four of the following ways, listing (on the wiki (where?)) what you did, and for every rule (for all of the following except adding stems), list an example of what output was improved.
- At least 100 more stems in the bilingual dictionary (and monolingual dictionaries).
- Expanded your morphology to cover at least 6 more elements of some paradigm(s). This can be anything from additional verb or noun morphology, to adding all the forms of all the determiners (articles, demonstratives, etc.), to implementing nominal morphology on adjectives (e.g., if your language allows adjectives to be substantivised, which you'll want to add a tag for too).
- At least 4 more twol rules that make your (analysis and) generation cleaner.
- At least 4 new disambiguation rules that make the output of your tagger more accurate.
- At least 3 new lexical selection rules that make more of the right stems transfer over.
- At least 3 new transfer rules that make more of the output of your MT system closer to an acceptable target translation.
- When you are done with the above, document the following measures:
- For each transducer:
- Precision and recall against the
annotated.basic
corpus, - Coverage against the
large
corpus, - The size of the
large
corpus, - The number of stems in the transducer.
- Precision and recall against the
- For MT in each direction:
- WER and PER over
longer
corpus. - Trimmed coverage over
longer
andlarge
corpora. - The number of stems in
longer
andlarge
corpora.
- WER and PER over
- For each transducer: