Difference between revisions of "Polished RBMT system"
(→Measuring precision and recall) |
(→Hand-annotating corpora) |
||
(27 intermediate revisions by 2 users not shown) | |||
Line 33: | Line 33: | ||
* "live" isn't the past tense form of this verb | * "live" isn't the past tense form of this verb | ||
− | The following annotation makes these corrections | + | The following annotation makes these corrections. This is sometimes called a "gold standard". |
<pre> | <pre> | ||
Line 64: | Line 64: | ||
[[:wikipedia:Precision and recall|Precision and recall]] are measures of how accurate a transducer is. Precision is the number of returned analyses that are correct, and recall is the number of correct analyses that are returned. | [[:wikipedia:Precision and recall|Precision and recall]] are measures of how accurate a transducer is. Precision is the number of returned analyses that are correct, and recall is the number of correct analyses that are returned. | ||
+ | In the example above of hand annotation, the precision is 90% (there are 9 true positives and 1 false positive), meaning that 90% of the returned analyses were correct. Recall is lower, at 75% (there are 9 true positives and 3 false negatives), meaning that only 75% of the correct analyses were returned. | ||
− | + | There is a script in the [[Using the tools on your own system#Misc tools|tools repo]] called <code>precisionRecall</code>. You can update the repo (<code>git pull</code>) and run <code>sudo make</code> to ensure that you have this script installed on your system. You can then run <code>precisionRecall referencecorpus.txt annotatedcorpus.txt</code>. | |
+ | |||
+ | == Measuring trimmed coverage == | ||
+ | Measuring trimmed coverage is just the same as measuring coverage, but with the appropriate "trimmed" transducer (e.g., <code>xyz-abc.automorf.bin</code>). | ||
== The assignment == | == The assignment == | ||
+ | This assignment is due at the end of week 12 (this semester, at the end of the day on '''Friday, 19 April 2019, before midnight'''). | ||
− | # '''Before you begin''', add a "structural_transfer" tag to your transducer repositories and your translation pair repository/ies to mark the end of previous assignments. | + | # '''Before you begin''', make sure all previous assignments are done, and add a "structural_transfer" tag to your transducer repositories and your translation pair repository/ies to mark the end of previous assignments. |
+ | #* Also, please remove all binaries from all repositories! See [[removing binaries from transducer repo]]. | ||
# Set up some '''new corpora''' based on existing ones: | # Set up some '''new corpora''' based on existing ones: | ||
− | #* Combine your <code>sentences</code> and <code>tests</code> corpora so you have a new '''longer parallel corpus'''. Name the files <code>abc.longer.txt</code> and <code>xyz.longer.txt</code>. | + | #* Combine and merge your <code>sentences</code> and <code>tests</code> corpora so you [hopefully, but not necessarily] have a new '''longer parallel corpus'''. Name the files <code>abc.longer.txt</code> and <code>xyz.longer.txt</code>. |
− | #* Make a '''large monolingual corpus''' of a bunch of raw text in your language. The more the better. This step may simply consist of you cleaning up and/or combining the existing corpora from the [[initial corpus assembly]] assignment. See if you can get it over 100K words. The bigger this corpus is the better. Call it <code>abc.corpus.large.txt</code> (in your monolingual corpus repo) and add notes to your <code> | + | #* Make a '''large monolingual corpus''' of a bunch of raw text in your language. The more the better. This step may simply consist of you cleaning up and/or combining the existing corpora from the [[initial corpus assembly]] assignment. See if you can get it over 100K words. The bigger this corpus is the better. Call it <code>abc.corpus.large.txt</code> (in your monolingual corpus repo) and add notes to your <code>MANIFEST</code> file about where the text comes from. |
#* A '''hand-annotated monolingual corpus''' of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your <code>abc.corpus.basic.txt</code> file, ideally sentences you understand / have English glosses of. Put the sentences you want to annotate in <code>abc.annotated.raw.txt</code> and dump this to <code>abc.annotated.basic.txt</code> to annotate it in CG format. Add these files to your monolingual corpus repository. | #* A '''hand-annotated monolingual corpus''' of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your <code>abc.corpus.basic.txt</code> file, ideally sentences you understand / have English glosses of. Put the sentences you want to annotate in <code>abc.annotated.raw.txt</code> and dump this to <code>abc.annotated.basic.txt</code> to annotate it in CG format. Add these files to your monolingual corpus repository. | ||
− | + | # '''Expand your MT pair''' in at least '''three''' of the following ways for each translation direction that you're working on, listing in a "Final evaluation" section on the language pair's wiki page what you did (move existing evaluation sections under a new section called "Initial evaluation"), and for every rule (for all of the following except adding stems), list an example of what output was improved. | |
− | + | #* At least '''100 more stems''' in the bilingual dictionary (and monolingual dictionaries as needed). This counts for both translation directions, if you are working on two-way translation. (If you've automated creation of your bilingual dictionary, then you can use this task to clean up 100 stems—at least 50 entries in your bilingual dictionary should be different than before—and properly categorise those 100 stems into the monolingual dictionary.) | |
− | + | #* '''Expand your morphology''' to cover '''at least 5 more elements''' of some paradigm(s). This can be anything from additional verb or noun morphology, to adding all the forms of all the determiners (articles, demonstratives, etc.), to implementing nominal morphology on adjectives (e.g., if your language allows adjectives to be substantivised, which you'll want to add a tag for too). | |
− | + | #** If you're using an HFST transducer in your MT system, you can additionally add at least '''4 more twol rules''' that make your (analysis and) generation cleaner for an additional way. | |
− | + | #* At least '''3 new disambiguation rules''' that make the output of your tagger more accurate. | |
− | # '''Expand your MT pair''' in at least ''' | + | #* At least '''2 new lexical selection rules''' that make more of the right stems transfer over. |
− | #* At least '''100 more stems''' in the bilingual dictionary (and monolingual dictionaries). | + | #* At least '''2 new transfer rules''' that make more of the output of your MT system closer to an acceptable target translation. |
− | #* ''' | + | # When you are done with the above: |
− | #* | + | #* Document which of the above options you completed on the pair's wiki page (in a section like "Additions"). You don't have to list the words or rules you added, but do list that you added ''n'' words or ''n'' transfer rules or the like. |
− | #* At least ''' | + | #* Add a "polished RBMT system" tag to your repo. |
− | #* At least ''' | + | #* '''document the following measures''' in the "Final evaluation" section of the pair's wiki page:. |
− | #* At least ''' | + | #** For each monolingual transducer: |
− | # When you are done with the above, '''document the following measures''': | + | #*** Precision and recall against the <code>annotated.basic</code> corpus, |
− | #* For each transducer: | + | #*** Coverage over the <code>large</code> corpus, |
− | #** Precision and recall against the <code>annotated.basic</code> corpus, | + | #*** The number of words in the <code>large</code> corpus, |
− | #** Coverage | + | #*** The number of stems in the transducer. |
− | #** The | + | #** For MT in each direction: |
− | #** The number of stems in the transducer. | + | #*** WER and PER over <code>longer</code> corpus. |
− | #* For MT in each direction: | + | #*** The proportion of stems translated correctly in the <code>longer</code> corpus. |
− | #** WER and PER over <code>longer</code> corpus. | + | #*** Trimmed coverage over <code>longer</code> and <code>large</code> corpora. |
− | #** Trimmed coverage over <code>longer</code> and <code>large</code> corpora. | + | #*** The number of tokens in <code>longer</code> and <code>large</code> corpora. |
− | #** The number of | ||
[[Category:Assignments]] | [[Category:Assignments]] | ||
[[Category:Tutorials]] | [[Category:Tutorials]] |
Revision as of 14:41, 30 April 2019
Contents
Hand-annotating corpora
First you want to analyze your corpus and output to CG format:
cat corpus.txt | apertium -d . xyz-morph | cg-conv -a > corpus.out.txt
Your new file probably now looks something like this:
"<This>" "this" det dem sg "this" prn dem sg "<is>" "be" vbser pres p3 sg "<my>" "I" prn p1 sg pos "<house>" "house" n sg "<.>" "." sent "<I>" "I" prn p1 sg subj "<live>" "live" vblex inf "live" vblex past "<here>" "*here" "<..>" ".." sent
In this example, you might note a few fixes:
- "here" isn't being analysed; it should have an adverb reading
- "house" should have a verb reading
- "live" should have an adjective reading
- "live" isn't the past tense form of this verb
The following annotation makes these corrections. This is sometimes called a "gold standard".
"<This>" "this" det dem sg "this" prn dem sg "<is>" "be" vbser pres p3 sg "<my>" "I" prn p1 sg pos "<house>" "house" n sg "house" vblex tv inf "<.>" "." sent "<I>" "I" prn p1 sg subj "<live>" "live" vblex inf "live" adj "<here>" "here" adv "<..>" ".." sent
Note: There should be no unknown words ("analyses" with *) when you're done.
Measuring precision and recall
Precision and recall are measures of how accurate a transducer is. Precision is the number of returned analyses that are correct, and recall is the number of correct analyses that are returned.
In the example above of hand annotation, the precision is 90% (there are 9 true positives and 1 false positive), meaning that 90% of the returned analyses were correct. Recall is lower, at 75% (there are 9 true positives and 3 false negatives), meaning that only 75% of the correct analyses were returned.
There is a script in the tools repo called precisionRecall
. You can update the repo (git pull
) and run sudo make
to ensure that you have this script installed on your system. You can then run precisionRecall referencecorpus.txt annotatedcorpus.txt
.
Measuring trimmed coverage
Measuring trimmed coverage is just the same as measuring coverage, but with the appropriate "trimmed" transducer (e.g., xyz-abc.automorf.bin
).
The assignment
This assignment is due at the end of week 12 (this semester, at the end of the day on Friday, 19 April 2019, before midnight).
- Before you begin, make sure all previous assignments are done, and add a "structural_transfer" tag to your transducer repositories and your translation pair repository/ies to mark the end of previous assignments.
- Also, please remove all binaries from all repositories! See removing binaries from transducer repo.
- Set up some new corpora based on existing ones:
- Combine and merge your
sentences
andtests
corpora so you [hopefully, but not necessarily] have a new longer parallel corpus. Name the filesabc.longer.txt
andxyz.longer.txt
. - Make a large monolingual corpus of a bunch of raw text in your language. The more the better. This step may simply consist of you cleaning up and/or combining the existing corpora from the initial corpus assembly assignment. See if you can get it over 100K words. The bigger this corpus is the better. Call it
abc.corpus.large.txt
(in your monolingual corpus repo) and add notes to yourMANIFEST
file about where the text comes from. - A hand-annotated monolingual corpus of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your
abc.corpus.basic.txt
file, ideally sentences you understand / have English glosses of. Put the sentences you want to annotate inabc.annotated.raw.txt
and dump this toabc.annotated.basic.txt
to annotate it in CG format. Add these files to your monolingual corpus repository.
- Combine and merge your
- Expand your MT pair in at least three of the following ways for each translation direction that you're working on, listing in a "Final evaluation" section on the language pair's wiki page what you did (move existing evaluation sections under a new section called "Initial evaluation"), and for every rule (for all of the following except adding stems), list an example of what output was improved.
- At least 100 more stems in the bilingual dictionary (and monolingual dictionaries as needed). This counts for both translation directions, if you are working on two-way translation. (If you've automated creation of your bilingual dictionary, then you can use this task to clean up 100 stems—at least 50 entries in your bilingual dictionary should be different than before—and properly categorise those 100 stems into the monolingual dictionary.)
- Expand your morphology to cover at least 5 more elements of some paradigm(s). This can be anything from additional verb or noun morphology, to adding all the forms of all the determiners (articles, demonstratives, etc.), to implementing nominal morphology on adjectives (e.g., if your language allows adjectives to be substantivised, which you'll want to add a tag for too).
- If you're using an HFST transducer in your MT system, you can additionally add at least 4 more twol rules that make your (analysis and) generation cleaner for an additional way.
- At least 3 new disambiguation rules that make the output of your tagger more accurate.
- At least 2 new lexical selection rules that make more of the right stems transfer over.
- At least 2 new transfer rules that make more of the output of your MT system closer to an acceptable target translation.
- When you are done with the above:
- Document which of the above options you completed on the pair's wiki page (in a section like "Additions"). You don't have to list the words or rules you added, but do list that you added n words or n transfer rules or the like.
- Add a "polished RBMT system" tag to your repo.
- document the following measures in the "Final evaluation" section of the pair's wiki page:.
- For each monolingual transducer:
- Precision and recall against the
annotated.basic
corpus, - Coverage over the
large
corpus, - The number of words in the
large
corpus, - The number of stems in the transducer.
- Precision and recall against the
- For MT in each direction:
- WER and PER over
longer
corpus. - The proportion of stems translated correctly in the
longer
corpus. - Trimmed coverage over
longer
andlarge
corpora. - The number of tokens in
longer
andlarge
corpora.
- WER and PER over
- For each monolingual transducer: