Dependency syntax

From LING073
Jump to: navigation, search

Dependency syntax is a framework for syntax which considers relations between words to be "dependencies". It happens to be easy to deal with for various computational linguistics applications: any given token has a "head" (what it depends on) and a "relation" to that head. E.g., in "a house", a determiner/article depends on a noun (the noun is the head to the determiner), and the relation is one of det (depending on the annotation standard used).

Universal Dependencies (UD) is a project aimed at designing a standardised set of dependency syntax annotation principles for use across languages, with community-contributed language-specific annotation guidelines. It's become pretty popular, and recently released its 2nd version of annotation principles.


Guidance

Annotation of a single sentence

You start with a sentence:

Everyone in this class works on a different language.

You run it through your disambiguator and manually disambiguate as needed so that each word has only one analysis (in CG) format:

"<Everyone>"
	"everyone" prn ind mf sg
"<in>"
	"in" pr
"<this>"
	"this" det dem sg
"<class>"
	"class" n sg
"<works>"
	"work" vblex pres p3 sg
"<on>"
	"on" adv
"<a>"
	"a" det ind sg
"<different>"
	"different" adj
"<language>"
	"language" n sg
"<.>"
	"." sent

Then you consider each word and what other word it depends on and how. E.g., "Everyone" has an nsubj dependency on "works" (i.e., it's its nominal subject). You then encode these relationships in the sentence:

"<Everyone>"
	"everyone" prn ind mf sg @nsubj #1->5
"<in>"
	"in" pr @case #2->4
"<this>"
	"this" det dem sg @det #3->4
"<class>"
	"class" n sg @nmod #4->1
"<works>"
	"work" vblex pres p3 sg @root #5->0
"<on>"
	"on" adv @compound:prt #6->5
"<a>"
	"a" det ind sg @det #7->9
"<different>"
	"different" adj @amod #8->9
"<language>"
	"language" n sg @obj #9->5
"<.>"
	"." sent @punct #10->5

When plotted as a tree, this sentence looks like the following:

Dependency tree of "Everyone in this class works on a different language.

Training a model

UDPipe is a Free / Open Source tool (should be on the virtual machine) that can train a parser for (read: make a model for how to best guess) UD relations. It will use whatever information you give it (including lemmas, morphological tags, etc.), and can be trained to guess these as well.

The following command trains a UD parser on corpus.conllu and outputs the model to abc.model.udpipe:

cat corpus.conllu | udpipe --tagger=none --tokenizer=none --train abc.model.udpipe

Testing a model

The following command parses corpus.conllu using the previously trained abc.model.udpipe:

udpipe --parse abc.model.udpipe corpus.conllu

With the additional switch --accuracy, it will return measures of its accuracy:

udpipe --accuracy --parse abc.model.udpipe corpus.conllu

The two main metrics it returns are the following:

  • Label Attachment Score (LAS): the percentage of tokens that the parser assigned the right dependency head and relation to.
  • Unlabeled Attachment Score (UAS): the percentage of tokens that the parser assigned the right dependency head to.

Removing morphological information

In conllu format, column 3 is for lemmas, column 5 is for POS tags, and column 6 is for other morphological tags. All values in these columns need to be blanked to _ to create a corpus with no morphological information.

The following command will output a corpus with columns 3, 5, and 6 replaced with _:

sed -r 's/[^\t]+/_/3 ; s/[^\t]+/_/5 ; s/[^\t]+/_/6' corpus.conllu

To output it to a new file, just use >:

sed -r 's/[^\t]+/_/3 ; s/[^\t]+/_/5 ; s/[^\t]+/_/6' corpus.conllu > new.corpus.conllu

Converting between conllu and CG formats

Try the scripts available in Fran's ud-scripts repo, in particular vislcg3-to-conllu.py and conllu-to-vislcg.py. (And let someone know and/or file a bug report if you notice any problems.)

To convert to conllu format from CG (where "prefix" is just an arbitrary code to prefix to sentence ids):

cat corpus.txt | vislcg3-to-conllu.py prefix > corpus.conllu

To convert to CG format from conllu:

cat corpus.conllu | conllu-to-vislcg.py > corpus.txt

The conversion may be lossy so don't expect a 1:1 roundtrip.

Useful links

The assignment

This assignment is due before the Tuesday class of the last week of classes (this semester, 11:20am on Tuesday, 25 April 2017)

Prepare your corpora

  1. Copy your abc.annotated.basic.txt corpus (from the previous assignment) to abc.annotated.ud.txt, and hand-disambiguate it. In other words, remove all but the correct analysis for each form (i.e., the one that's right in that sentence). Add, commit, and push this corpus.
  2. Create a second corpus of at least 500 characters (or 250 for syllabic writing systems), one sentence per line, and call it abc.annotated2.raw.txt. Run it through your tagger and convert to CG format as before. You can skip the "all and only all possible analyses" step that you did for precision and recall, and just correct it to have the correct analysis for the given sentence. Call this new file abc.annotated2.ud.txt, and add, commit, and push these two files.

Annotate your corpora for dependencies and post-process

  1. Add dependency annotation (heads and relations) to both abc.annotated.ud.txt and abc.annotated2.ud.txt. Commit and push.
  2. Convert these to conllu format, in files with .conllu extensions instead of .txt extensions. Commit and push.
  3. Remove all morphological information' from abc.annotated.ud.conllu and call it abc.annotated.nomorph.conllu. Add, commit, and push.

Train and evaluate a parser

  1. Train two parsers: one on abc.annotated.ud.conllu and one on abc.annotated.nomorph.conllu. Call the models abc.withmorph.udpipe and abc.nomorph.udpipe, respectively. Put them in the dev/ folder of your repository and commit them.
  2. Try parsing both abc.annotated.ud.txt and abc.annotated2.ud.txt with each of the models you made. Make note of the UAS and LAS scores.

Documentation

  1. Make a page on the wiki called Language/Universal_Dependencies, add it to the category sp17_UD and link to it as a resource from your main wiki page for the language.
  2. Add an "Evaluation" section, and make a grid that includes the UAS and LAS scores from your four parsing attempts above, as well as the number of forms (and sentences) in each corpus.
  3. Add a "Dependency relations" section, and make a subsection for each of five dependency relations that you used at least twice in your annotation. For each relation, provide:
    • A description of the relation, noting various ways it might be used in the language,
    • Two examples of the relation from your corpus, preferably illustrating what you described.