Evaluating your transducer
One of the standard ways of evaluating a morphological transducer is to hand-annotate analyses of valid forms in a language, and compare the output of the transducer to this "gold standard". Precision and recall are standard ways to quantify the comparison.
Creating a gold standard
First you want to analyze your corpus and output to CG format:
cat corpus.txt | apertium -d . xyz-morph | cg-conv -a > corpus.out.txt
Your new file probably now looks something like this:
"<This>" "this" det dem sg "this" prn dem sg "<is>" "be" vbser pres p3 sg "<my>" "I" prn p1 sg pos "<house>" "house" n sg "<.>" "." sent "<I>" "I" prn p1 sg subj "<live>" "live" vblex inf "live" vblex past "<here>" "*here" "<..>" ".." sent
In this example, you might note a few fixes:
- "here" isn't being analysed; it should have an adverb reading
- "house" should have a verb reading
- "live" should have an adjective reading
- "live" isn't the past tense form of this verb
The following annotation makes these corrections. This is sometimes called a "gold standard".
"<This>" "this" det dem sg "this" prn dem sg "<is>" "be" vbser pres p3 sg "<my>" "I" prn p1 sg pos "<house>" "house" n sg "house" vblex tv inf "<.>" "." sent "<I>" "I" prn p1 sg subj "<live>" "live" vblex inf "live" adj "<here>" "here" adv "<..>" ".." sent
Note: There should be no unknown words ("analyses" with *) in your gold standard when you're done annotating it.
Measuring precision and recall
Precision and recall are measures of how accurate a transducer is. Precision is the number of returned analyses that are correct, and recall is the number of correct analyses that are returned.
In the example above of hand annotation, the precision is 90% (there are 9 true positives and 1 false positive), meaning that 90% of the returned analyses were correct. Recall is lower, at 75% (there are 9 true positives and 3 false negatives), meaning that only 75% of the correct analyses were returned.
There is a script in the tools repo called
precisionRecall. You can update the repo (
git pull) and run
sudo make to ensure that you have this script installed on your system. You can then run
precisionRecall referencecorpus.txt annotatedcorpus.txt.
This assignment is due in class on Thursday of the 8th week (this semester, 9:55 on Thursday, March 23rd, 2023).
This assignment has two parts:
- evaluate your morphological transducer, and
- create a short presentation on your language and transducer.
Evaluate your morphological transducer by creating a hand-annotated monolingual corpus of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your
abc.corpus.basic.txt file, ideally sentences you understand / have English glosses of.
- Start by putting the sentences you want to annotate in
wc -mto count the number of characters.
- Use the command listed earlier to analyse your corpus and output it in CG format. Output to
- Copy the test file to
- Add these three new files to your monolingual corpus repository.
- Annotate the gold file (recommended to commit and push to git frequently). Do not fix errors in your transducer as you go.
- Measure precision and recall using the script to compare the test file to the gold file.
For your midterm overview, you will need to speak for about 5 minutes and outline the following:
- What language you worked on and some quick facts about it (where it's spoken, the size of the community that uses it, any prominent languages it might be related to),
- General morphological (lots of prefixes? suffixes? reduplication?) and phonological/orthographic (lots of phonological changes? accent markings? syllabic orthography?) properties of the language,
- Some evaluation metrics:
- What sort of coverage you're currently getting on your corpus (as a percentage of forms analysing, and as number of tokens being analysed out of total), and what did the most to increase that number in the last couple couple weeks,
- The accuracy of your transducer (precision and recall, as you measured above),
- The mean ambiguity over your corpus with and without disambiguation.
- A couple aspects of the course that you've found the most challenging (finding resources? understanding what constitutes normative orthography? understanding the grammar? learning to work with lexd or twol? implementing prefixes/infixes? dealing with too many characters in the orthography? not being able to find ambiguous forms?),
- Discuss some specific aspect(s) of the orthography/grammar that was/were particularly challenging to understand or deal with (one of the following):
- You can briefly outline a grammar point (what the alternation is) and what you had to do to implement it computationally.
- Alternatively, feel free to share a problem that's caused you lots of grief and/or that you still haven't solved.
You may create a visual aid to support your presentation. I recommend creating a single page, either on the wiki or a PDF slide/"poster" with five sections that you can step through quickly. This year, please add a slide to the shared deck (let me know if you have trouble accessing it).
Each point can be covered very briefly—you only have about 5 minutes!
The point of this assignment is for the class to get a picture of what their fellow students' work is like. Everyone's working on different languages, so this should give some sense of the diversity of problems encountered. On the other hand, everyone's struggled with one thing or another, so there will be some familiarity in others' overviews.