Morphological analyser

Morphological transducers

A morphological transducer is just a directed graph. It consists of nodes (numbered below) and arcs (with labels), with a starting node (0 below) and an ending node (16 below).

You follow the arcs that are available from your input. The only acceptable paths are ones that start from starting node and end at the ending node. You may match your input to either side of the arc's label (separated by : above), and the other side is returned as output.

In the transducer above, the left side is the form and the right side is the analysis. If you match your input to the left side (the form), then your output will be the right side (the analysis)—this is morphological analysis. Likewise, if you follow the transducer by matching your input to the right side (the analysis) and output the left side (the form), then you are performing morphological generation.

An example of a complete path is w:w o:o l:l v:f e:<n> s:<pl>. The left/form side of this spells wolves and the right/analysis side of this spells wolf<n><pl>. Mapping between one and the other is as simple as taking one as input and following the path—by outputting the other side of each arc, you will get the other as output!

Question: What are all the possible paths provided by this transducer?

The formalism we use (lexc)

Transducers are pretty cool, and quite efficient... for computers. Following paths by hand is tedious, and drawing a transducer for anything more complex than the example above is torture. See the transducer below for Tuvan.

This transducer provides the combinations of about 8 case marker, 5 possessive morphemes, and the plural marker for three Tuvan nouns.

An example is өг>{L}{A}р>{i}м>{D}{A}н mapping to өг<n><pl><px1sg><abl>, meaning "from my houses". The analysis side is clear to anyone familiar with tags (and knowing that "өг" means "house"). The form side is actually something that will get fixed by morphophonology, which we'll worry about later (for now: letters like {L} can be realised in a variety of ways, and > is used as a morpheme boundary); the actual orthographic form is өглеримден.

Question: How can we quantify the complexity of this graph?

Fortunately, we don't have to draw this graph by hand. We can simply define the various sections of it and link them together with a straightforward formalism called lexc. A section of a lexc file that corresponds (mostly) to the graph above looks like the following:

LEXICON CASES

%<nom%>: CLITICS-COPULA ;
%<gen%>:%>%{N%}{I%}ң # ;
%<acc%>:%>%{N%}%{I%} # ;
%<dat%>:%>%{G%}%{A%} # ;
%<loc%>:%>%{D%}%{A%} CLITICS-COPULA ;
%<abl%>:%>%{D%}%{A%}н # ;
%<all%>:%>%{J%}е # ;
%<all%>:%>%{D%}%{I%}в%{A%} # ; ! Dir/LR

LEXICON POSSESSION

%<px1sg%>:%>%{i%}м CASES ;
%<px2sg%>:%>%{i%}ң CASES ;
%<px3sp%>:%>%{z%}%{I%}%{n%} CASES ;
%<px1pl%>:%>%{i%}в%{I%}с CASES ;
%<px2pl%>:%>%{i%}ң%{A%}р CASES ;

LEXICON N-INFL-COMMON

CASES ;
POSSESSION ;
LEXICON SUBST

N-INFL-COMMON ;

%<pl%>:%>%{L%}%{A%}р N-INFL-COMMON ;

LEXICON N1

%<n%>%<attr%>: # ;
%<n%>: SUBST ;

LEXICON Nouns

өг:өг N1 ; ! "yurt"
аът:аът N1 ; ! "horse"
ном:ном N1 ; ! "book"

Questions

• What is % doing?
• What is ! doing?
• What is : doing?
• How are the continuation lexica (LEXICONs) connected?
• What is ; doing?
• What is # doing?
• What is mentioned in this code that isn't in the graph above?
• What is not mentioned in this code that is in the graph above?
• Can you match sections of the graph to sections of the code?

Phonology

The symbols like {L} above will need to be realised as different characters in different context.

For any symbols in your language that will be realised in different ways in different environments, you'll want to set up such an "archiphoneme". Use an uppercase letter for something that just has different forms, and use a lowercase letter for something that is inserted or deleted (i.e., is sometimes realised as nothing).

For now, it will suffice to define all the ways in which each archiphoneme surfaces by making a list in your twol file. This essentially allows all of the options to surface, which means you will be able to analyse incorrect forms as well as correct ones. Later, when you make a generator, you'll write rules to constrain where each of the symbols can occur.

Defining symbols

Don't forget to define all your symbols (archiphonemes like {L}, and tags like <pl>) in the lexc file! And define your archiphoneme symbols in the twol file, each with all its possible outputs.

So your twol file should contain an Alphabet section, which lists all the characters of the alphabet, and then all the archiphonemes with all their realisations. You will also want the > morpheme separator and some punctuation marks, all escaped. A condensed example for Tuvan follows:

Alphabet

А Б В Г Д Е Ё Ж З И Й К Л М Н Ң О Ө П Р С Т У Ү Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
а б в г д е ё ж з и й к л м н ң о ө п р с т у ү ф х ц ч ш щ ъ ы ь э ю я

%{A%}:а %{A%}:е
%{L%}:л %{L%}:н
%{i%}:0 %{i%}:ы %{i%}:и %{i%}:у %{i%}:ү

%.
%-

%>:0
;

Starting point

You'll need a Root lexicon in your lexc file. Bootstrapping a new language module per the instructions will create this for you, but don't forget that it's a thing!

Morphology that isn't suffixes

You may have noticed that analyses are generally in the form stem + POS tags + subcategory tags + function tags. What if some of your functional morphology occurs before the stem?

You can certainly implement that in lexc, but there's a problem: your tags will occur in the middle of the analysis. So instead of something like do<v><tv><rep><prc> ↔ redoing, you'd get something like <rep>do<v><tv><prc> ↔ redoing. This is undesirable.

Currently, the best way to handle this is documented in two places on the Apertium wiki: apertium:Replacement for flag diacritics and apertium:Morphotactic constraints with twol. You will also need to modify your Makefile to look more like the Makefile for Chukchi in terms of the twoc stuff, replacing ckt with the code for your language. You will then have to reconfigure your module (./autogen.sh) before recompiling (make).

Evaluation

Individual forms

To test whether/how your analyser is analysing a form, you can run the following:

echo "form" | apertium -d /path/to/analyser/ xyz-morph

An example might be the following:

apertium-tyv\$ echo өглеримден | apertium -d . tyv-morph
^өглеримден/өг<n><pl><px1sg><abl>\$^./.<sent>\$

This output means that for the form өглеримден there is one analysis: өг<n><pl><px1sg><abl>. A form with multiple analysis would have them separated by /, like the following:

^өг/өг<n><nom>/өг<n><attr>/өг<n><nom>+э<cop><aor><p3><sg>\$^./.<sent>\$

A form with no analyses in the transducer will just return the form with an * before it, like the following:

^өглеримнен/*өглеримнен\$^./.<sent>\$

A long list of forms with known analyses

To test whether your analyser is analysing forms correctly, you can put your analyses into a yaml file and use morph-test or aq-morftest:

morph-test -csi xyz.yaml | most

or

aq-morftest -csi xyz.yaml | most

Coverage over a corpus

To test coverage over a corpus, you can use aq-covtest:

aq-covtest xyz.corpus.basic.txt /path/to/xyz.automorf.bin

The assignment

This assignment will be due on Thursday of the 5th week of class before class starts (this semester: 11:20am on Thursday, February 16th, 2017).

This assignment is to develop a morphological analyser that implements a good deal of the basic morphology of your language.

Getting set up:

1. Bootstrap a transducer for your language.
2. Initialise the module (./autogen.sh), and compile it (make).
• If this is successful, you should have several "modes" available; run apertium -d . -l to see.
• One mode should be an xyz-morph mode; this is your analyser. Check it by running echo "houses" | apertium -d . xyz-morph , which should give you a morphological analysis of the word "houses".
3. Add all of the tags you came up with during the Grammar documentation assignment to the Multichar_Symbols section of the apertium-xyz.xyz.lexc file. Provide a symbol, and a brief comment explaining what the symbol means.
4. Add all the characters of your language's orthography to the Alphabet section of the apertium-xyz.xyz.twol file. You may need to add archiphonemes later.
5. Use the morphTests2yaml script to create a yaml test file in a subdirectory called tests. Commit this file to the git repo. (You can remove blank sections if you like, and if they appear in the file.) There should be at least 50 tests in this file—make sure you have enough.

The hard stuff:

1. Build your morphological transducer, adding all of the stems from your Grammar documentation assignment, categorised correctly, so that at least half of your tests pass. You'll need to build up the morphotactics too
2. Create a page on the wiki Language/Transducer that links to the code and has Evaluation and Notes sections.
• In the evaluation section, put the current coverage of your combined corpus and the top unknown words, using aq-covtest.
• In the notes section, say what tests still don't work and why.

Housekeeping:

1. Add yourself to the AUTHORS file.
2. Make sure the COPYING file contains an open-source license to your liking (default should be GPL3).
3. Add links to the transducer repo and wiki page to the list of resources you developed for your language on the language's page on this wiki.

Sanity checks before submitting:

1. Did you commit just the initial files created by bootstrap before you initialised or compiled the module? If not, start over with bootstrapping, being sure copy over any files you've changed.
2. Did you commit your updates to lexc and twol files? And the yaml test file?
3. Do you have at least 50 tests in the tests file? Do at least half of them pass a morph-test?
4. If you have trouble analysing or compiling, are all your tags and symbols defined in both lexc and twol files?