Dependency syntax

From LING073
Revision as of 01:53, 18 April 2017 by Jwashin1 (talk | contribs) (Useful links)

Jump to: navigation, search

Dependency syntax is a framework for syntax which considers relations between words to be "dependencies". It happens to be easy to deal with for various computational linguistics applications: any given token has a "head" (what it depends on) and a "relation" to that head. E.g., in "a house", a determiner/article depends on a noun (the noun is the head to the determiner), and the relation is one of det (depending on the annotation standard used).

Universal Dependencies is a project aimed at designing a standardised set of dependency syntax annotation principles for use across languages, with community-contributed language-specific annotation guidelines.


Guidance

Annotation of a single sentence

You start with a sentence:

Everyone in this class works on a different language.

You run it through your disambiguator and manually disambiguate as needed so that each word has only one analysis (in CG) format:

"<Everyone>"
	"everyone" prn ind mf sg
"<in>"
	"in" pr
"<this>"
	"this" det dem sg
"<class>"
	"class" n sg
"<works>"
	"work" vblex pres p3 sg
"<on>"
	"on" adv
"<a>"
	"a" det ind sg
"<different>"
	"different" adj
"<language>"
	"language" n sg
"<.>"
	"." sent

Then you consider each word and what other word it depends on and how. E.g., "Everyone" has an nsubj dependency on "works" (i.e., it's its nominal subject). You then encode these relationships in the sentence:

"<Everyone>"
	"everyone" prn ind mf sg @nsubj #1->5
"<in>"
	"in" pr @case #2->4
"<this>"
	"this" det dem sg @det #3->4
"<class>"
	"class" n sg @nmod #4->1
"<works>"
	"work" vblex pres p3 sg @root #5->0
"<on>"
	"on" adv @compound:prt #6->5
"<a>"
	"a" det ind sg @det #7->9
"<different>"
	"different" adj @amod #8->9
"<language>"
	"language" n sg @obj #9->5
"<.>"
	"." sent @punct #10->5

When plotted as a tree, this sentence looks like the following:

Dependency tree of "Everyone in this class works on a different language.

Evaluation

Two metrics:

  • Label Attachment Score (LAS): the percentage of tokens that the parser assigned the right dependency head and relation to.
  • Unlabeled Attachment Score (UAS): the percentage of tokens that the parser assigned the right dependency head to.

Training a model

udpipe commands

Testing a model

udpipe commands


Remove morphological information

In conllu format, column 3 is for lemmas, column 5 is for POS tags, and column 6 is for other morphological tags. All values in these columns need to be blanked to _ to create a corpus with no morphological information.

add sed one-liner

Useful links

The assignment

Prepare your corpora

  1. Copy your abc.annotated.basic.txt corpus (from the previous assignment) to abc.annotated.ud.txt, and hand-disambiguate it. In other words, remove all but the correct analysis for each form (i.e., the one that's right in that sentence). Add, commit, and push this corpus.
  2. Create a second corpus of at least 500 characters (or 250 for syllabic writing systems), one sentence per line, and call it abc.annotated2.raw.txt. Run it through your tagger and convert to CG format as before. You can skip the "all and only all possible analyses" step that you did for precision and recall, and just correct it to have the correct analysis for the given sentence. Call this new file abc.annotated2.ud.txt, and add, commit, and push these two files.

Annotate your corpora for dependencies and post-process

  1. Add dependency annotation (heads and relations) to both abc.annotated.ud.txt and abc.annotated2.ud.txt. Commit and push.
  2. Convert these to conllu format, in files with .conllu extensions instead of .txt extensions. Commit and push.
  3. Remove all morphological information' from abc.annotated.ud.conllu and call it abc.annotated.nomorph.conllu. Add, commit, and push.

Train and evaluate a parser

  1. Train two parsers: one on abc.annotated.ud.conllu and one on abc.annotated.nomorph.conllu. Call the models abc.withmorph.udpipe
and abc.nomorph.udpipe, respectively.

Document your findings