Dependency syntax is a framework for syntax which considers relations between words to be "dependencies". It happens to be easy to deal with for various computational linguistics applications: any given token has a "head" (what it depends on) and a "relation" to that head. E.g., in "a house", a determiner/article depends on a noun (the noun is the head to the determiner), and the relation is one of
det (depending on the annotation standard used).
Universal Dependencies (UD) is a project aimed at designing a standardised set of dependency syntax annotation principles for use across languages, with community-contributed language-specific annotation guidelines. It's become pretty popular, and recently released its 2nd version of annotation principles.
- 1 Guidance
- 2 The assignment
Annotation of a single sentence
You start with a sentence:
Everyone in this class works on a different language.
You run it through your disambiguator and manually disambiguate as needed so that each word has only one analysis (in CG) format:
"<Everyone>" "everyone" prn ind mf sg "<in>" "in" pr "<this>" "this" det dem sg "<class>" "class" n sg "<works>" "work" vblex pres p3 sg "<on>" "on" adv "<a>" "a" det ind sg "<different>" "different" adj "<language>" "language" n sg "<.>" "." sent
Then you consider each word and what other word it depends on and how. E.g., "Everyone" has an
nsubj dependency on "works" (i.e., it's its nominal subject). You then encode these relationships in the sentence:
"<Everyone>" "everyone" prn ind mf sg @nsubj #1->5 "<in>" "in" pr @case #2->4 "<this>" "this" det dem sg @det #3->4 "<class>" "class" n sg @nmod #4->1 "<works>" "work" vblex pres p3 sg @root #5->0 "<on>" "on" adv @compound:prt #6->5 "<a>" "a" det ind sg @det #7->9 "<different>" "different" adj @amod #8->9 "<language>" "language" n sg @obj #9->5 "<.>" "." sent @punct #10->5
When plotted as a tree, this sentence looks like the following:
Training a model
UDPipe is a Free / Open Source tool (should be on the virtual machine) that can train a parser for (read: make a model for how to best guess) UD relations. It will use whatever information you give it (including lemmas, morphological tags, etc.), and can be trained to guess these as well.
The following command trains a UD parser on
corpus.conllu and outputs the model to
cat corpus.conllu | udpipe --tagger=none --tokenizer=none --train abc.model.udpipe
Testing a model
The following command parses
corpus.conllu using the previously trained
udpipe --parse abc.model.udpipe corpus.conllu
With the additional switch
--accuracy, it will return measures of its accuracy:
udpipe --accuracy --parse abc.model.udpipe corpus.conllu
The two main metrics it returns are the following:
- Label Attachment Score (LAS): the percentage of tokens that the parser assigned the right dependency head and relation to.
- Unlabeled Attachment Score (UAS): the percentage of tokens that the parser assigned the right dependency head to.
Removing morphological information
In conllu format, column 3 is for lemmas, column 5 is for POS tags, and column 6 is for other morphological tags. All values in these columns need to be blanked to
_ to create a corpus with no morphological information.
The following command will output a corpus with columns 3, 5, and 6 replaced with
sed -r 's/[^\t]+/_/3 ; s/[^\t]+/_/5 ; s/[^\t]+/_/6' corpus.conllu
To output it to a new file, just use
sed -r 's/[^\t]+/_/3 ; s/[^\t]+/_/5 ; s/[^\t]+/_/6' corpus.conllu > new.corpus.conllu
Converting between conllu and CG formats
Try the scripts available in Fran's ud-scripts repo, in particular
conllu-to-vislcg.py. (And let someone know and/or file a bug report if you notice any problems.)
To convert to conllu format from CG (where "prefix" is just an arbitrary code to prefix to sentence ids):
cat corpus.txt | vislcg3-to-conllu.py prefix > corpus.conllu
To convert to CG format from conllu:
cat corpus.conllu | conllu-to-vislcg.py > corpus.txt
The conversion may be lossy so don't expect a 1:1 roundtrip.
- annotated examples from class
- Jonathan's UD visualiser tool.
- Fran's UDPipe notes.
- All 37 UD relations (and documentation and examples of each).
- Dep search — An interface for searching for dependency relations in existing UD corpora.
This assignment is due before the Tuesday class of the last week of classes (this semester, 11:20am on Tuesday, 25 April 2017)
Prepare your corpora
- Copy your
abc.annotated.basic.txtcorpus (from the previous assignment) to
abc.annotated.ud.txt, and hand-disambiguate it. In other words, remove all but the correct analysis for each form (i.e., the one that's right in that sentence). Add, commit, and push this corpus.
- Create a second corpus of at least 500 characters (or 250 for syllabic writing systems), one sentence per line, and call it
abc.annotated2.raw.txt. Run it through your tagger and convert to CG format as before. You can skip the "all and only all possible analyses" step that you did for precision and recall, and just correct it to have the correct analysis for the given sentence. Call this new file
abc.annotated2.ud.txt, and add, commit, and push these two files.
Annotate your corpora for dependencies and post-process
- Add dependency annotation (heads and relations) to both
abc.annotated2.ud.txt. Commit and push.
- Convert these to conllu format, in files with
.conlluextensions instead of
.txtextensions. Commit and push.
- Remove all morphological information' from
abc.annotated.ud.conlluand call it
abc.annotated.nomorph.conllu. Add, commit, and push.
Train and evaluate a parser
- Train two parsers: one on
abc.annotated.ud.conlluand one on
abc.annotated.nomorph.conllu. Call the models
abc.nomorph.udpipe, respectively. Put them in the
dev/folder of your repository and commit them.
- Try parsing both
abc.annotated2.ud.txtwith each of the models you made. Make note of the UAS and LAS scores.
- Make a page on the wiki called
Language/Universal_Dependencies, add it to the category sp17_UD and link to it as a resource from your main wiki page for the language.
- Add an "Evaluation" section, and make a grid that includes the UAS and LAS scores from your four parsing attempts above, as well as the number of forms (and sentences) in each corpus.
- Add a "Dependency relations" section, and make a subsection for each of five dependency relations that you used at least twice in your annotation. For each relation, provide:
- A description of the relation, noting various ways it might be used in the language,
- Two examples of the relation from your corpus, preferably illustrating what you described.