Morphological disambiguator

From LING073
Jump to: navigation, search

Why we need disambiguation

Imagine you have a word that has two different tagsets, e.g.

^houses/house<n><pl>/house<v><tv><p3><sg>$
^this/this<det><dem><sg>/this<pron><dem><sg>$

Normally your analyser will randomly choose one of these. Now imagine you have a sentence where the wrong analysis is chosen (in this case, twice!):

^The/The<det><def>$ ^motel/motel<n><sg>$ ^houses/house<n><pl>$ ^this/this<pron><dem><sg>$ ^dog/dog<n><sg>$^./.<sent>$

The goal of a disambiguator is to choose the correct analysis based on the surrounding words.

Using Constraint Grammar to disambiguate

Constraint Grammar (CG) is a formalism that can be used to formalise context sensitive constraints to select or remove analyses from the list of possible analyses.

The structure of a CG file

At the top you define delimiters. You can then add lists (which are kind of like sets in twol). After that you can have any number of constraints. You can also put lists and constraints in sections. In the original specification for constraint grammar, all the constraints in each section are applied simultaneously, but the sections are applied in order and lists are always global. In the version we use (VISL CG3), the constraints are applied in strict order within the sections.

Constraints

Think of constraints as a series of heuristics to whittle down the number of analyses until you have as little ambiguity as possible.

The most commonly used constraints are to either remove a reading under a certain condition or select a reading under a certain condition. The syntax is e.g., REMOVE <target> [contextual tests] ;.

Targets can be defined as list names—like Nominals—or individual tag names, in parentheses—e.g., (n pl). This target identifies the reading to select or remove.

Contextual tests come together in ()s. All tests must match in order for the constraint to apply. Inside the parentheses, the first definition is a position: 0 means the current cohort, -1 means the previous cohort, etc. The rest of the test is what should be matched. If you want to match any member of the Nominals set in the previous position, it would be (-1 Nominals) You can match tag sequences by adding an extra set of parentheses, e.g. (1 (n pl)) would match a plural noun in the following position. You can match specific baseforms with quotation marks, e.g. (1 ("house" n)) would match any form of the noun "house" in the following position.

Classroom activity: implement example at top of page. Solution.

Useful commands

The file that your disambiguation constraints are defined in is the apertium-xyz.xyz.rlx file.

Getting output before disambiguation

example command:

echo "The motel houses this dog" | apertium -d . eng-morph

example output:

^The/the<det><def>$ ^motel/motel<n><sg>$ ^houses/house<n><pl>/house<v><tv><p3><sg>$ ^this/this<det><dem><sg>/this<prn><dem><sg>$ ^dog/dog<n><sg>$^./.<sent>$

Getting output after disambiguation

example command:

echo "The motel houses this dog" | apertium -d . eng-tagger

example output:

^The<det><def>$ ^motel<n><sg>$ ^house<v><tv><p3><sg>$ ^this<det><dem><sg>$ ^dog<n><sg>$^.<sent>$

Seeing which constraints are doing what

example command:

echo "The motel houses this dog" | apertium -d . eng-disam

example output:

"<The>"
	"the" det def
"<motel>"
	"motel" n sg
"<houses>"
	"house" v tv p3 sg
;	"house" n pl REMOVE:26
"<this>"
	"this" det dem sg
;	"this" prn dem sg REMOVE:29
"<dog>"
	"dog" n sg
"<.>"
	"." sent


Calculating ambiguity

The ambiguity of a corpus is the average number of analyses provided per token. You can get the number of tokens like this:

cat corpus.txt | lt-proc /path/to/xyz.automorf.bin | sed 's/$\W*\^/$\n^/g' | wc -l

You can get the total number of analyses in your corpus like this:

cat corpus.txt | lt-proc /path/to/xyz.automorf.bin | sed 's/$\W*\^/$\n^/g' | cut -f2- -d'/' | sed 's/\//\n/g' | wc -l

You can get the total number of analyses in your corpus after disambiguation like this:

cat corpus.txt | lt-proc /path/to/xyz.automorf.bin | cg-proc /path/to/xyz.rlx.bin | sed 's/$\W*\^/$\n^/g' | cut -f2- -d'/' | sed 's/\//\n/g' | wc -l


Automating it a little further, you can put something like the following in a file and run it as a shell script. The first three lines need to be replaced with the path to your corpus, your compiled analyser, and your compiled disambiguator, respectively.

CORPUS=/path/to/xyz.corpus.txt
MORPH=/path/to/xyz.automorf.bin
RLX=/path/to/xyz.rlx.bin # you might need to remove ".bin"
TOKENS=`cat ${CORPUS} | lt-proc ${MORPH} | sed 's/$\W*\^/$\n^/g' | wc -l`
ANALYSES=`cat ${CORPUS} | lt-proc ${MORPH} | sed 's/$\W*\^/$\n^/g' | cut -f2- -d'/' | sed 's/\//\n/g' | wc -l`
DISAMB=`cat ${CORPUS} | lt-proc ${MORPH} | cg-proc ${RLX} | sed 's/$\W*\^/$\n^/g' | cut -f2- -d'/' | sed 's/\//\n/g' | wc -l`
AMBIGPRE=`calc $ANALYSES/$TOKENS`
AMBIGPOST=`calc $DISAMB/$TOKENS`
echo "Ambiguity before disambiguation: ${AMBIGPRE}"
echo "Ambiguity after disambiguation: ${AMBIGPOST}"

It's recommended that you put this script in tests/ and name it disambiguation-test.sh.

Checking what forms are ambiguous in the corpus and transducer

You can get all of the words in the corpus along with their analyses (or lack thereof) like this:

cat corpus.txt | lt-proc /path/to/xyz.automorf.bin | sed 's/$\W*\^/$\n^/g'

You can get all of the words in the corpus that have more than one analysis like this:

cat corpus.txt | lt-proc /path/to/xyz.automorf.bin | sed 's/$\W*\^/$\n^/g' | grep '\/.*\/'

You can find all forms in your transducer like this:

hfst-expand xyz.automorf.hfst

You can count ambiguity in your transducer by comparing the number of all forms to the number of unique forms.

Number of all forms:

hfst-expand xyz.automorf.hfst | wc -l

Number of unique forms:

hfst-expand xyz.automorf.hfst | cut -f1 -d':' | sort -u | wc -l

If the number's different you'll want to compare the lists:

hfst-expand xyz.automorf.hfst | cut -f1 -d':' | sort > /tmp/totalforms
hfst-expand xyz.automorf.hfst | cut -f1 -d':' | sort -u > /tmp/uniqforms
diff /tmp/totalforms /tmp/uniqforms

Then you'll want to see what the multiple analyses are of any forms that show up there.

The assignment

This assignment is due at noon on Friday of the 7th week of class (this semester, 12:00, Friday, March 3rd, 2017). On the Thursday before that, you will also be expected to present a brief midterm overview to the class.

Getting set up

  1. If you're still working on your generator, finish that first.
  2. Commit and push your code at this point, and then add a tag "generator":
    • git tag generator
    • Then push the tag to the repo: git push origin generator
  3. Make a Language/Disambiguation page on the wiki, linked to from your main Language page, with a section titled something like "initial evaluation of ambiguity".
  4. If you haven't yet, add some punctuation to your transducer.
    • You can look at the Punctuation lexicon an existing the Turkmen lexc file.
    • Make sure to define your new multicharacter symbols (i.e., the new tags being used).
  5. Find the "xyz-tagger" section of your modes.xml file. Change cg-proc -w to cg-proc -1 -n -w (but not in the "xyz-disam" section!). Recompile, commit, push.

Identify ambiguity

  1. Add some more of the most common stems identified by a coverage test.
  2. Measure the level of ambiguity in your corpus.
    • If ambiguity is 1 (i.e., there is no ambiguity), then find any form in the language that can have more than one analysis. Check a grammar book and dictionary and think creatively.
      • Implement this into your transducer.
      • Recompile, and measure level of ambiguity again.
    • If ambiguity is greater than 1 (i.e., there are at least some ambiguous forms), then figure out what form is ambiguous.
  3. List the level of ambiguity into the "initial evaluation of ambiguity" section on the disambig eval wiki page.
  4. Find two sentences for two possible readings of any ambiguous term. Make sure all the words are in your transducer with correct analyses. Put them on the wiki page, highlighting the correct reading of the ambiguous form in each sentence. For example, if the word was "this", examples might be:
    • ^This/this<det><dem><sg>/this<prn><dem><sg>$ ^is/be<v><iv><pres><p3><sg>$ ^a/a<det><ind>$ ^dog/dog<n><sg>$^./.<sent>$
    • ^I/I<prn><pers><p1><sg><sbj>$ ^saw/see<v><tv><past>$ ^this/this<det><dem><sg>/this<prn><dem><sg>$ ^dog/dog<n><sg>$^./.<sent>$
    • Note that it's best to use real sentences, e.g. from your corpus (or any of the sources that went into the corpus). If you absolutely must, you may create examples, but make note of this!

Write constraints

  1. Once you have identified examples of ambiguity, come up with prose forms of the constraints you'll need to write to get the correct form in each of the two sentences.
    • Be as linguistically sound as you can, so that the constraints will apply in the broadest possible range of scenarios.
  2. Using the prose forms of the constraints as guidance (and e.g., providing them as comments in the code), implement CG versions of the constraints.
    • Be sure to define any sets/lists you might need.

Finish up

  1. Once you get your disambiguator working as expected, commit your code
  2. Also, add a link to the git repository on the disambiguator wiki page, and vice versa (i.e., a link to the wiki page as a comment in the .rlx file in the git repo)
  3. Measure ambiguity again and add that number to the wiki page under something like "final evaluation of ambiguity".

More resources