Difference between revisions of "Dependency syntax"

From LING073
Jump to: navigation, search
m (The assignment)
(Documentation)
 
(One intermediate revision by the same user not shown)
Line 89: Line 89:
  
 
=== Converting between conllu and CG formats ===
 
=== Converting between conllu and CG formats ===
 +
'''UD Annotatrix should convert reliably between conllu and CG format''', but if not, you can follow this section.
 +
 
Try the scripts available in Fran's [https://github.com/ftyers/ud-scripts ud-scripts] repo, in particular <code>vislcg3-to-conllu.py</code> and <code>conllu-to-vislcg.py</code>.  (And let someone know and/or file a bug report if you notice any problems.)  
 
Try the scripts available in Fran's [https://github.com/ftyers/ud-scripts ud-scripts] repo, in particular <code>vislcg3-to-conllu.py</code> and <code>conllu-to-vislcg.py</code>.  (And let someone know and/or file a bug report if you notice any problems.)  
  
Line 125: Line 127:
  
 
=== Documentation ===
 
=== Documentation ===
# Make a page on the wiki called <code>Language/Universal_Dependencies</code>, add it to the category [[:Category:sp18_UD|sp18_UD]] and link to it as a resource from your main wiki page for the language.
+
# Make a page on the wiki called <code>Language/Universal_Dependencies</code>, add it to the category [[:Category:sp19_UD|sp19_UD]] and link to it as a resource from your main wiki page for the language.
 
# Add an "Evaluation" section, and make a grid that includes the UAS and LAS scores from your four parsing attempts above, as well as the number of forms (and sentences) in each corpus.
 
# Add an "Evaluation" section, and make a grid that includes the UAS and LAS scores from your four parsing attempts above, as well as the number of forms (and sentences) in each corpus.
 
# Add a "Dependency relations" section, and make a subsection for each of five dependency relations that you used at least twice in your annotation.  For each relation, provide:
 
# Add a "Dependency relations" section, and make a subsection for each of five dependency relations that you used at least twice in your annotation.  For each relation, provide:

Latest revision as of 09:20, 23 April 2019

Dependency syntax is a framework for syntax which considers relations between words to be "dependencies". It happens to be easy to deal with for various computational linguistics applications: any given token has a "head" (what it depends on) and a "relation" to that head. E.g., in "a house", a determiner/article depends on a noun (the noun is the head to the determiner), and the relation is one of det (depending on the annotation standard used).

Universal Dependencies (UD) is a project aimed at designing a standardised set of dependency syntax annotation principles for use across languages, with community-contributed language-specific annotation guidelines. It's become pretty popular, and is now in its 2nd version of annotation principles.


Guidance

Annotation of a single sentence

You start with a sentence:

Everyone in this class works on a different language.

You run it through your disambiguator and manually disambiguate as needed so that each word has only one analysis (in CG) format:

"<Everyone>"
	"everyone" prn ind mf sg
"<in>"
	"in" pr
"<this>"
	"this" det dem sg
"<class>"
	"class" n sg
"<works>"
	"work" vblex pres p3 sg
"<on>"
	"on" adv
"<a>"
	"a" det ind sg
"<different>"
	"different" adj
"<language>"
	"language" n sg
"<.>"
	"." sent

Then you consider each word and what other word it depends on and how. E.g., "Everyone" has an nsubj dependency on "works" (i.e., it's its nominal subject). You then encode these relationships in the sentence:

"<Everyone>"
	"everyone" prn ind mf sg @nsubj #1->5
"<in>"
	"in" pr @case #2->4
"<this>"
	"this" det dem sg @det #3->4
"<class>"
	"class" n sg @nmod #4->1
"<works>"
	"work" vblex pres p3 sg @root #5->0
"<on>"
	"on" adv @compound:prt #6->5
"<a>"
	"a" det ind sg @det #7->9
"<different>"
	"different" adj @amod #8->9
"<language>"
	"language" n sg @obj #9->5
"<.>"
	"." sent @punct #10->5

When plotted as a tree, this sentence looks like the following:

Dependency tree of "Everyone in this class works on a different language."

Training a model

UDPipe is a Free / Open Source tool (should be on the virtual machine) that can train a parser for (read: make a model for how to best guess) UD relations. It will use whatever information you give it (including lemmas, morphological tags, etc.), and can be trained to guess these as well.

The following command trains a UD parser on corpus.conllu and outputs the model to abc.model.udpipe:

cat corpus.conllu | udpipe --tagger=none --tokenizer=none --train abc.model.udpipe

Testing a model

The following command parses corpus.conllu using the previously trained abc.model.udpipe:

udpipe --parse abc.model.udpipe corpus.conllu

With the additional switch --accuracy, it will return measures of its accuracy:

udpipe --accuracy --parse abc.model.udpipe corpus.conllu

The two main metrics it returns are the following:

  • Label Attachment Score (LAS): the percentage of tokens that the parser assigned the right dependency head and relation to.
  • Unlabeled Attachment Score (UAS): the percentage of tokens that the parser assigned the right dependency head to.

Removing morphological information

In conllu format, column 3 is for lemmas, column 5 is for POS tags, and column 6 is for other morphological tags. All values in these columns need to be blanked to _ to create a corpus with no morphological information.

The following command will output a corpus with columns 3, 5, and 6 replaced with _:

sed -r 's/[^\t]+/_/3 ; s/[^\t]+/_/5 ; s/[^\t]+/_/6' corpus.conllu

To output it to a new file, just use >:

sed -r 's/[^\t]+/_/3 ; s/[^\t]+/_/5 ; s/[^\t]+/_/6' corpus.conllu > new.corpus.conllu

Converting between conllu and CG formats

UD Annotatrix should convert reliably between conllu and CG format, but if not, you can follow this section.

Try the scripts available in Fran's ud-scripts repo, in particular vislcg3-to-conllu.py and conllu-to-vislcg.py. (And let someone know and/or file a bug report if you notice any problems.)

To convert to conllu format from CG (where "prefix" is just an arbitrary code to prefix to sentence ids):

cat corpus.txt | vislcg3-to-conllu.py prefix > corpus.conllu

To convert to CG format from conllu:

cat corpus.conllu | conllu-to-vislcg.py > corpus.txt

The conversion may be lossy so don't expect a 1:1 roundtrip.

Useful links

The assignment

This assignment is due before the Tuesday class of the last week of classes (this semester, 8:30 a.m. on Tuesday, 30 April 2019)

Prepare your corpora

  1. If you're doing a different language than you've already worked on, choose a language that does not have any UD corpora available. Ideally it's a language for which a morphological transducer is available. You'll need to be able to assemble a small corpus, figure out what the sentences mean, and generally identify the parts of speech of different words. You may work alone or in pairs.
  2. Copy (or create) your abc.annotated.basic.txt corpus (from the previous assignment) to abc.annotated.ud.txt, and hand-disambiguate it. In other words, remove all but the correct analysis for each form (i.e., the one that's right in that sentence). Add, commit, and push this corpus.
  3. Create a second corpus of at least 500 characters (or 250 for syllabic writing systems), one sentence per line, and call it abc.annotated2.raw.txt. Run it through your tagger and convert to CG format as before. You can skip the "all and only all possible analyses" step that you did for precision and recall, and just correct it to have the correct analysis for the given sentence. Call this new file abc.annotated2.ud.txt, and add, commit, and push these two files.

Annotate your corpora for dependencies and post-process

  1. Add dependency annotation (heads and relations) to both abc.annotated.ud.txt and abc.annotated2.ud.txt. Commit and push.
  2. Convert these to conllu format, in files with .conllu extensions instead of .txt extensions. Commit and push.
  3. Copy abc.annotated.ud.conllu to abc.annotated.nomorph.conllu, and remove all morphological information from the latter. Add, commit, and push.

Train and evaluate a parser

  1. Train two parsers: one on abc.annotated.ud.conllu and one on abc.annotated.nomorph.conllu. Call the models abc.withmorph.udpipe and abc.nomorph.udpipe, respectively. Put them in the dev/ folder of your repository and commit them.
  2. Try parsing both abc.annotated.ud.txt and abc.annotated2.ud.txt with each of the models you made. Make note of the UAS and LAS scores.

Documentation

  1. Make a page on the wiki called Language/Universal_Dependencies, add it to the category sp19_UD and link to it as a resource from your main wiki page for the language.
  2. Add an "Evaluation" section, and make a grid that includes the UAS and LAS scores from your four parsing attempts above, as well as the number of forms (and sentences) in each corpus.
  3. Add a "Dependency relations" section, and make a subsection for each of five dependency relations that you used at least twice in your annotation. For each relation, provide:
    • A description of the relation, noting various ways it might be used in the language,
    • Two examples of the relation from your corpus, preferably illustrating what you described.