Difference between revisions of "Midterm overview"

From LING073
Jump to: navigation, search
(The assignment)
(The assignment)
Line 77: Line 77:
  
 
This assignment is due in class on Thursday of the 7th week (this semester, '''8:30 on Thursday, March 7th, 2019''').
 
This assignment is due in class on Thursday of the 7th week (this semester, '''8:30 on Thursday, March 7th, 2019''').
 +
 +
This assignment has two parts: '''evaluate your morphological transducer''' and '''create a short presentation on your language and transducer'''.
  
 
Evaluate your morphological transducer by creating a '''hand-annotated monolingual corpus''' of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your <code>abc.corpus.basic.txt</code> file, ideally sentences you understand / have English glosses of.
 
Evaluate your morphological transducer by creating a '''hand-annotated monolingual corpus''' of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your <code>abc.corpus.basic.txt</code> file, ideally sentences you understand / have English glosses of.

Revision as of 11:36, 5 February 2022

Evaluating your transducer

One of the standard ways of evaluating a morphological transducer is to hand-annotate analyses of valid forms in a language, and compare the output of the transducer to this "gold standard". Precision and recall are standard ways to quantify the comparison.

Creating a gold standard

First you want to analyze your corpus and output to CG format:

cat corpus.txt | apertium -d . xyz-morph | cg-conv -a > corpus.out.txt

Your new file probably now looks something like this:

"<This>"
	"this" det dem sg
	"this" prn dem sg
"<is>"
	"be" vbser pres p3 sg
"<my>"
	"I" prn p1 sg pos
"<house>"
	"house" n sg
"<.>"
	"." sent
"<I>"
	"I" prn p1 sg subj
"<live>"
	"live" vblex inf
	"live" vblex past
"<here>"
	"*here"
"<..>"
	".." sent

In this example, you might note a few fixes:

  • "here" isn't being analysed; it should have an adverb reading
  • "house" should have a verb reading
  • "live" should have an adjective reading
  • "live" isn't the past tense form of this verb

The following annotation makes these corrections. This is sometimes called a "gold standard".

"<This>"
	"this" det dem sg
	"this" prn dem sg
"<is>"
	"be" vbser pres p3 sg
"<my>"
	"I" prn p1 sg pos
"<house>"
	"house" n sg
	"house" vblex tv inf
"<.>"
	"." sent
"<I>"
	"I" prn p1 sg subj
"<live>"
	"live" vblex inf
	"live" adj
"<here>"
	"here" adv
"<..>"
	".." sent

Note: There should be no unknown words ("analyses" with *) in your gold standard when you're done annotating it.

Measuring precision and recall

Precision and recall are measures of how accurate a transducer is. Precision is the number of returned analyses that are correct, and recall is the number of correct analyses that are returned.

In the example above of hand annotation, the precision is 90% (there are 9 true positives and 1 false positive), meaning that 90% of the returned analyses were correct. Recall is lower, at 75% (there are 9 true positives and 3 false negatives), meaning that only 75% of the correct analyses were returned.

There is a script in the tools repo called precisionRecall. You can update the repo (git pull) and run sudo make to ensure that you have this script installed on your system. You can then run precisionRecall referencecorpus.txt annotatedcorpus.txt.


The assignment

This assignment is due in class on Thursday of the 7th week (this semester, 8:30 on Thursday, March 7th, 2019).

This assignment has two parts: evaluate your morphological transducer and create a short presentation on your language and transducer.

Evaluate your morphological transducer by creating a hand-annotated monolingual corpus of sentences (see above) covering at least 1000 characters (500 for syllabic scripts) of your abc.corpus.basic.txt file, ideally sentences you understand / have English glosses of.

  1. Start by putting the sentences you want to annotate in abc.annotated.raw.txt. Use wc -m to count the number of characters.
  2. Use the command listed earlier to analyse your corpus and output it in CG format. Output to abc.annotated.test.txt.
  3. Copy the test file to abc.annotated.gold.txt
  4. Add these three new files to your monolingual corpus repository.
  5. Annotate the gold file (recommended to commit and push to git frequently). Do not fix errors in your transducer as you go.
  6. Measure precision and recall using the script to compare the test file to the gold file.

For your midterm overview, you will need to speak for about 5 to 10 minutes and outline the following:

  • What language you worked on and some quick facts about it (where it's spoken, the size of the community that uses it, any prominent languages it might be related to),
  • General morphological (lots of prefixes? suffixes? reduplication?) and phonological/orthographic (lots of phonological changes? accent markings? syllabic orthography?) properties of the language,
  • What sort of coverage you're currently getting on your corpus (as a percentage of forms analysing, and as number of tokens being analysed out of total), and what did the most to increase that number in the last couple couple weeks,
  • A couple aspects of the course that you've found the most challenging (finding resources? understanding what constitutes normative orthography? understanding the grammar? learning to work with lexc or twol? implementing prefixes/infixes? dealing with too many characters in the orthography?),
  • Some specific aspect(s) of the orthography/grammar that was/were particularly challenging to understand or deal with.
    • You should briefly outline the grammar point (what the alternation is) and what you had to do to implement it computationally.
    • Also feel free to share a problem that's caused you lots of grief and/or that you still haven't solved.

The point of this assignment is for the class to get a picture of what their fellow students' work is like. Everyone's working on different languages, so this should give some sense of the diversity of problems encountered. On the other hand, everyone's struggled with one thing or another, so there will be some familiarity in others' overview.