Polished RBMT system
From LING073
Revision as of 14:30, 5 April 2017 by Jwashin1 (talk | contribs) (Created page with "== Hand-annotating corpora == == The assignment == Set up some '''new corpora''' based on existing ones: # Combine your <code>sentences</code> and <code>tests</code> corpora...")
Hand-annotating corpora
The assignment
Set up some new corpora based on existing ones:
- Combine your
sentences
andtests
corpora so you have a new longer parallel corpus. Name the filesabc.longer.txt
andxyz.longer.txt
. - Make a large corpus of a bunch of raw text in your language. The more the better. This step may simply consist of you cleaning up and/or combining the existing corpora from the initial corpus assembly assignment. The bigger this corpus is the better. Call it
abc.corpus.large.txt
(in your monolingual corpus repo) and add notes to yourMAINFEST
file about where the text comes from. - A hand-annotated monolingual corpus of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your
abc.corpus.basic.txt
file, ideally sentences you understand / have English glosses of. Call thisabc.annotated.basic.txt
and put it in your monolingual corpus repository.
Expand your MT pair in x/y of the following ways, listing on the wiki (where):
- At least 100 more stems in the bilingual dictionary
- Expanded morphology to cover ...
- Disambiguation
- Lexical selection
- Structural transfer
When you are done with the above, document the following measures:
- For each transducer:
- Number of stems
- Precision and recall which corpus,
- Coverage, which corpus
- The size of the corpus and number of stems in transducer.
- For MT in each direction:
- WER and PER over which corpus
- trimmed coverage
- The number of stems in the small corpus