Difference between revisions of "Morphological analyser"

From LING073
Jump to: navigation, search
(Morphology that isn't suffixes)
(Evaluation)
 
(41 intermediate revisions by 3 users not shown)
Line 13: Line 13:
 
'''Question''': What are all the possible paths provided by this transducer?
 
'''Question''': What are all the possible paths provided by this transducer?
  
== The formalism we use (lexc) ==
+
== The formalism we use (lexd) ==
  
 
Transducers are pretty cool, and quite efficient... for computers.  Following paths by hand is tedious, and drawing a transducer for anything more complex than the example above is torture.  See the transducer below for Tuvan.
 
Transducers are pretty cool, and quite efficient... for computers.  Following paths by hand is tedious, and drawing a transducer for anything more complex than the example above is torture.  See the transducer below for Tuvan.
Line 25: Line 25:
 
'''Question''': How can we quantify the complexity of this graph? <!-- number of nodes/arcs, number of possible paths, ... -->
 
'''Question''': How can we quantify the complexity of this graph? <!-- number of nodes/arcs, number of possible paths, ... -->
  
Fortunately, we don't have to draw this graph by hand.  We can simply define the various sections of it and link them together with a straightforward formalism called '''lexc'''.  A section of a lexc file that corresponds (mostly) to the graph above looks like the following:
+
Fortunately, we don't have to draw this graph by hand.  We can simply define the various sections of it and link them together with a straightforward formalism called '''lexd'''.  A section of a lexd file that corresponds to the graph above looks like the following:
  
 
<pre>
 
<pre>
LEXICON CASES
+
PATTERNS
  
%<nom%>: CLITICS-COPULA ;
+
N-Stems [ <n>: ] [ <pl>:>{L}{A}r ]? Possession? Cases
%<gen%>:%>%{N%}{I%}ң # ;
+
%<acc%>:%>%{N%}%{I%} # ;
+
%<dat%>:%>%{G%}%{A%} # ;
+
%<loc%>:%>%{D%}%{A%} CLITICS-COPULA ;
+
%<abl%>:%>%{D%}%{A%}н # ;
+
%<all%>:%>%{J%}е # ;
+
%<all%>:%>%{D%}%{I%}в%{A%} # ; ! Dir/LR
+
  
  
LEXICON POSSESSION
+
LEXICON N-Stems
  
%<px1sg%>:%>%{i%}м CASES ;
+
өг:өг    # "yurt"
%<px2sg%>:%>%{i%}ң CASES ;
+
аът:аът  # "horse"
%<px3sp%>:%>%{z%}%{I%}%{n%} CASES ;
+
ном:ном  # "book"
%<px1pl%>:%>%{i%}в%{I%}с CASES ;
+
%<px2pl%>:%>%{i%}ң%{A%}р CASES ;
+
  
 +
LEXICON Possession
  
LEXICON N-INFL-COMMON
+
<px1sg>:>{i}м
 +
<px2sg>:>{i}ң
 +
<px3sp>:>{z}{I}{n}
 +
<px1pl>:>{i}в{I}с
 +
<px2pl>:>{i}ң{A}р
  
CASES ;
+
LEXICON Cases
POSSESSION ;
+
  
 
+
<nom>:
LEXICON SUBST
+
<gen>:>{N}{I}ң
 
+
<acc>:>{N}I
N-INFL-COMMON ;
+
<dat>:>{G}{A}
%<pl%>:%>%{L%}%{A%}р N-INFL-COMMON ;
+
<loc>:>{D}{A}
 
+
<abl>:>{D}{A}н
 
+
<all>:>{J}е
LEXICON N1
+
<all>:>{D}{I}в{A}  # Dir/LR
 
+
%<n%>%<attr%>: # ;
+
%<n%>: SUBST ;
+
 
+
 
+
LEXICON Nouns
+
 
+
өг:өг N1 ; ! "yurt"
+
аът:аът N1 ; ! "horse"
+
ном:ном N1 ; ! "book"
+
 
</pre>
 
</pre>
  
 
'''Questions'''
 
'''Questions'''
* What is <code>%</code> doing? <!-- escape character -->
+
* What is <code>#</code> doing? <!-- sets off comments -->
* What is <code>!</code> doing? <!-- sets off comments -->
+
 
* What is <code>:</code> doing? <!-- separates form from analysis -->
 
* What is <code>:</code> doing? <!-- separates form from analysis -->
* How are the continuation lexica (LEXICONs) connected? <!-- with reference at end of line to next lexicon -->
+
* What are the <code>?</code>s doing? <!-- says the thing before is optional -->
* What is <code>;</code> doing? <!-- end of line -->
+
* How are the lists of parts (LEXICONs) combined into a whole? <!-- with reference at end of line to next lexicon -->
* What is <code>#</code> doing? <!-- end of path -->
+
<!-- * What is mentioned in this code that isn't in the graph above? --><!-- CLITICS-COPULA -->
* What is mentioned in this code that isn't in the graph above? <!-- CLITICS-COPULA -->
+
 
* What is not mentioned in this code that is in the graph above? <!-- a starting place -->
 
* What is not mentioned in this code that is in the graph above? <!-- a starting place -->
* Can you match sections of the graph to sections of the code? <!-- LEXICON N1 = node 5, etc. -->
+
* Can you match sections of the graph to sections of the code? <!-- LEXICON N-Stems = node 5, etc. -->
  
 
== Additional nuances ==
 
== Additional nuances ==
Line 96: Line 79:
 
=== Defining symbols ===
 
=== Defining symbols ===
  
Don't forget to define all your symbols (archiphonemes like {{archiphon|L}}, and tags like {{tag|pl}}) in the lexc file!  And define your archiphoneme symbols in the twol file, each with all its possible outputs.
+
You'll want to define your archiphoneme symbols in the twol file, each with all its possible outputs.
  
 
So your twol file should contain an <code>Alphabet</code> section, which lists all the characters of the alphabet, and then all the archiphonemes with all their realisations.  You will also want the <code>&gt;</code> morpheme separator and some punctuation marks, all escaped.  A condensed example for Tuvan follows:
 
So your twol file should contain an <code>Alphabet</code> section, which lists all the characters of the alphabet, and then all the archiphonemes with all their realisations.  You will also want the <code>&gt;</code> morpheme separator and some punctuation marks, all escaped.  A condensed example for Tuvan follows:
Line 117: Line 100:
  
 
</pre>
 
</pre>
 
  
 
=== Starting point ===
 
=== Starting point ===
  
You'll need a <code>Root</code> lexicon in your lexc file.  Bootstrapping a new language module per the instructions will create this for you, but don't forget that it's a thing!
+
You'll need at least one <code>PATTERNS</code> section in your lexd file.  Bootstrapping a new language module per the instructions will create this for you, but don't forget that it's a thing!
  
 
=== Morphology that isn't suffixes ===
 
=== Morphology that isn't suffixes ===
  
You may have noticed that analyses are generally in the form '''stem''' + '''POS tags''' + '''subcategory tags''' + '''function tags'''.  What if some of your functional morphology occurs before the stem?
+
You may remember from last week that we discussed that analyses are generally in the form '''stem''' + '''POS tags''' + '''subcategory tags''' + '''function tags'''.  What if some of your functional morphology occurs before the stem?
  
You can certainly implement that in lexc, but there's a problem: your tags will occur in the middle of the analysis.  So instead of something like {{morphTest|do{{tag|v}}{{tag|tv}}{{tag|rep}}{{tag|prc}}|redoing}}, you'd get something like {{morphTest|{{tag|rep}}do{{tag|v}}{{tag|tv}}{{tag|prc}}|redoing}}.  This is undesirable.
+
You can certainly implement that as shown above, but there's a problem: your tags will occur in the middle of the analysis.  That is, instead of something like {{morphTest|do{{tag|v}}{{tag|tv}}{{tag|rep}}{{tag|prc}}|redoing}}, you'd get something like {{morphTest|{{tag|rep}}do{{tag|v}}{{tag|tv}}{{tag|prog}}|redoing}}.  This is undesirable in terms of keeping track of which of your tags are for what.
  
Currently, the best way to handle this is documented in two places on the Apertium wiki: [[:apertium:Replacement for flag diacritics]] and [[:apertium:Morphotactic constraints with twol]].  The rules go in a new file, <code>apertium-xyz.xyz.twoc</code>. You will also need to modify your <code>Makefile.am</code> to look more like the [http://svn.code.sf.net/p/apertium/svn/incubator/apertium-ckt/Makefile.am Makefile for Chukchi] in terms of the <code>twoc</code> stuff (replacing <code>ckt</code> with the code for your language)You will then have to reconfigure your module (<code>./autogen.sh</code>) before recompiling (<code>make</code>).
+
Fortunately lexd offers a trick for handling such things! You can list the left and right side of a given LEXICON in different parts of a PATTERN, and the pieces will be matched.  The following is a simple example (which assumes that the "re" prefix in English should be treated as productive and inflectional, which it probably shouldn't):
 
+
'''A slightly more generalised version of this solution''': The <code>twoc</code> file should include all your tags (both in angle brackets and square brackets) in the alphabet.  Then add a set named e.g. <code>Features</code> with all square-bracket tags.  You can then add a <code>Rule</code> that just '''removes any path without features that match'''.  I.e., you only get the forms that have both a plus and minus version of a given feature.  A short example is provided below:
+
  
 
<pre>
 
<pre>
Alphabet
+
PATTERNS
  
  %[%-nt%]:0 %[%-m%]:0 %[%-f%]:0 %[%-pl%]:0
+
Verbs
  %[%+nt%]:0 %[%+m%]:0 %[%+f%]:0 %[%+pl%]:0
+
  
;
+
PATTERN Verbs
  
Rules
+
V-Base V-Tenses
  
"Remove paths without matching suffix feature"
+
PATTERN V-Base
Fx:0 /<= _ ;
+
  except
+
      _ :* Fy:0 ;
+
  where Fx in ( %[%-nt%] %[%-m%] %[%-f%] %[%-pl%] )
+
        Fy in ( %[%+nt%] %[%+m%] %[%+f%] %[%+pl%] )
+
  matched ;
+
</pre>
+
  
If you can ever have forms with an odd number of feature tags output from lexc (e.g., a path where there's only a <code>%[%+m%]</code> form with no <code>%-</code> feature of any sort before it), you'll need another rule to get rid those paths too, something like a reverse of the above rule.
+
V-Stems(1) [<v>:] V-Stems(2):
<pre>
+
:V-Prefixes V-Stems(1) [<v>:] V-Stems(2): V-Prefixes:
"Remove paths without matching prefix feature"
+
Fx:0 /<= _ ;
+
  except
+
      Fy:0 :* _ ;
+
  where Fy in ( %[%-m%] )
+
        Fx in ( %[%+m] )
+
matched ;
+
</pre>
+
  
A matching <code>lexc</code> file, using gender circumfixes in Avar, might look like this:
+
LEXICON V-Prefixes
  
<pre>
+
<rep>:re>
Multichar_Symbols
+
<rev>:un>
  
%<aor%>
+
LEXICON V-Tenses
%<nt%>
+
%<m%>
+
%<f%>
+
%<pl%>
+
! etc.
+
%[%+nt%]
+
%[%+m%]
+
%[%+f%]
+
%[%+pl%]
+
  
LEXICON Root
+
<prog>:>ing
 +
<past>:>{e}d
  
Prefixes ;
+
LEXICON V-Stems(2)
  
 +
do:do        <tv>
 +
write:write  <tv>
 +
</pre>
 +
 +
'''Questions:'''
 +
* What are the forms output by this transducer?
 +
* (How many forms are there?)
 +
* What are the numbers in <code>()</code>?
 +
* What part of the file adds the prefixes and what part of the file adds the prefix tags?
 +
* What are the different <code>PATTERN</code> groups for?
 +
 +
==== another example ====
 +
Here's a better example, showing (productive, inflectional) gender agreement on verbs in Avar.
 +
 +
<pre>
 +
PATTERNS
 +
 +
Verbs
  
LEXICON AOR
+
PATTERN Verbs
  
%<aor%>%<nt%>%[%+nt%]:уна # ;
+
:V-Gender V-Stems(1) [<v>:] V-Stems(2): V-Tense V-Gender:
%<aor%>%<m%>%[%+m%]:уна # ;
+
%<aor%>%<f%>%[%+f%]:уна # ;
+
%<aor%>%<pl%>%[%+pl%]:уна # ;
+
  
 +
LEXICON V-Tense
  
LEXICON Prefixes
+
<aor>:>уна
  
%[%-nt%]:б%> Verbs ;
+
LEXICON V-Gender
%[%-m%]:в%> Verbs ;
+
%[%-f%]:й%> Verbs ;
+
%[%-pl%]:р%> Verbs ;
+
  
 +
<nt>:б>
 +
<m>:в>
 +
<f>:й>
 +
<pl>:р>
  
LEXICON Verbs
+
LEXICON V-Stems(2)
  
бицине%<v%>%<tv%>:иц AOR ; ! "говорить"
+
бицине:иц    <tv> # "говорить"
 
</pre>
 
</pre>
  
Line 218: Line 191:
 
See [[Morphological analyser/Exercises]].
 
See [[Morphological analyser/Exercises]].
  
The work we did on this in class is available on Swarthmore's github at [https://github.swarthmore.edu/Ling073-sp17/ling073-eng ling073-sp17/ling073-eng].
+
The work we did on this in class is available on Swarthmore's github at [https://github.swarthmore.edu/Ling073-sp21/ling073-eng ling073-sp21/ling073-eng].
  
 
== Evaluation ==
 
== Evaluation ==
Line 227: Line 200:
 
  apertium-tyv$ echo өглеримден | apertium -d . tyv-morph
 
  apertium-tyv$ echo өглеримден | apertium -d . tyv-morph
 
  ^өглеримден/өг<n><pl><px1sg><abl>$^./.<sent>$
 
  ^өглеримден/өг<n><pl><px1sg><abl>$^./.<sent>$
 +
You can also do it this way:
 +
apertium-tyv$ echo өглеримден | hfst-proc tyv.automorph.hfst
 
This output means that for the form <code>өглеримден</code> there is one analysis: <code>өг{{tag|n}}{{tag|pl}}{{tag|px1sg}}{{tag|abl}}</code>.  A form with multiple analysis would have them separated by <code>/</code>, like the following:
 
This output means that for the form <code>өглеримден</code> there is one analysis: <code>өг{{tag|n}}{{tag|pl}}{{tag|px1sg}}{{tag|abl}}</code>.  A form with multiple analysis would have them separated by <code>/</code>, like the following:
 
  ^өг/өг<n><nom>/өг<n><attr>/өг<n><nom>+э<cop><aor><p3><sg>$^./.<sent>$
 
  ^өг/өг<n><nom>/өг<n><attr>/өг<n><nom>+э<cop><aor><p3><sg>$^./.<sent>$
 
A form with no analyses in the transducer will just return the form with an <code>*</code> before it, like the following:
 
A form with no analyses in the transducer will just return the form with an <code>*</code> before it, like the following:
 
  ^өглеримнен/*өглеримнен$^./.<sent>$
 
  ^өглеримнен/*өглеримнен$^./.<sent>$
 +
 +
=== See full contents of analyser ===
 +
 +
hfst-expand xyz.automorf.hfst
  
 
=== A long list of forms with known analyses ===
 
=== A long list of forms with known analyses ===
Line 239: Line 218:
  
 
=== Coverage over a corpus ===
 
=== Coverage over a corpus ===
To test coverage over a corpus, you can use <code>[[apertium-quality|aq-covtest]]</code>:
+
To test coverage over a corpus, you can use <code>coverage-hfst</code> or <code>[[apertium-quality|aq-covtest]]</code>:
 +
coverage-hfst xyz.corpus.basic.txt /path/to/xyz.automorf.hfst
 +
 
 
  aq-covtest xyz.corpus.basic.txt /path/to/xyz.automorf.bin
 
  aq-covtest xyz.corpus.basic.txt /path/to/xyz.automorf.bin
 +
 +
=== Generating forms ===
 +
If you need to test how a form generates, you can do something like the following:
 +
echo "^house<n><pl>$" | apertium -d . -f none xyz-gener
 +
This will return all forms currently being generated, e.g. <code>houses/housees</code>
 +
 +
=== Counting lexicon entries ===
 +
You can run the following command to list lexicons and counts in them all:
 +
lexd -x apertium-xyz.xyz.lexd > /dev/null
 +
 +
Note that a number of these lexicons are probably related to your morphology.  If you're counting "lexical entries", you should probably exclude the morphology ones.
  
 
== The assignment ==
 
== The assignment ==
This assignment will be due on Thursday of the 5th week of class before class starts (this semester: '''11:20am on Thursday, February 16th, 2017''').
+
This assignment will be due at the end of the day Friday of the 5th week of class (this semester: '''23:59 on Friday, March 12th, 2021''').
  
 
This assignment is to develop a morphological analyser that implements a good deal of the basic morphology of your language.
 
This assignment is to develop a morphological analyser that implements a good deal of the basic morphology of your language.
  
 
=== Getting set up ===
 
=== Getting set up ===
# [[Bootstrapping a transducer|Bootstrap a transducer]] for your language.
+
# Bootstrap a transducer for your language using <code>apertium-init</code> (installed on the lab machines):
# Initialise the module (<code>./autogen.sh</code>), and compile it (<code>make</code>).
+
#: <code>apertium-init -a lexd --with-spellrelax --prefix=ling073 xyz</code>
 +
# Go into the new directory (<code>cd ling073-xyz</code>), initialise the module (<code>./autogen.sh</code>), and compile it (<code>make</code>).
 
#* If this is successful, you should have several "modes" available; run <code>apertium -d . -l</code> to see.
 
#* If this is successful, you should have several "modes" available; run <code>apertium -d . -l</code> to see.
 
#* One mode should be an <code>xyz-morph</code> mode; this is your analyser.  Check it by running <code>echo "houses" | apertium -d . xyz-morph</code> , which should give you a morphological analysis of the word "houses".
 
#* One mode should be an <code>xyz-morph</code> mode; this is your analyser.  Check it by running <code>echo "houses" | apertium -d . xyz-morph</code> , which should give you a morphological analysis of the word "houses".
 
# {{highlight|Integrate any comments I've provided to you on your grammar documentation page so that all of your morphTests are in good order.  See the sanity checks at [[Grammar documentation#Sanity checks]] to check the main things.}}
 
# {{highlight|Integrate any comments I've provided to you on your grammar documentation page so that all of your morphTests are in good order.  See the sanity checks at [[Grammar documentation#Sanity checks]] to check the main things.}}
# Add ''all'' of the tags you came up with during the [[Grammar documentation]] assignment to the <code>Multichar_Symbols</code> section of the <code>apertium-xyz.xyz.lexc</code> file.  Provide a symbol, and a brief comment explaining what the symbol means.
+
# Augment the commented section at the top of the <code>apertium-xyz.xyz.lexd</code> file with any tags you came up with during the [[Grammar documentation]] assignment that aren't there already.  Provide a symbol, and a brief comment explaining what the symbol means.
 
# Add ''all'' the characters of your language's orthography to the <code>Alphabet</code> section of the <code>apertium-xyz.xyz.twol</code> file.  You may need to add archiphonemes later.
 
# Add ''all'' the characters of your language's orthography to the <code>Alphabet</code> section of the <code>apertium-xyz.xyz.twol</code> file.  You may need to add archiphonemes later.
# Use the [[morphTests2yaml]] script to create a yaml test file in a subdirectory called tests.  Commit this file to the git repo.  (You can remove blank sections if you like, and if they appear in the file.)  There should be at least 50 tests in this file—make sure you have enough.
+
# Use the <code>[[morphTests2yaml]]</code> script (installed on the lab machines) to create a yaml test file in a subdirectory called <code>tests/</code>.  Commit this file to the git repo.  (You can remove blank sections if you like, and if they appear in the file.)  There should be at least 50 tests in this file—make sure you have enough.
 +
# Create a new ''empty'' repo (that is, don't check the README option) in [https://github.swarthmore.edu/Ling073-sp21/ the course's GitHub organisation] with the name <code>ling073-xyz</code> (with your language's code in place of <code>xyz</code>).  Then add the SSH link as a remote origin in your initialised module and push the module to the GitHub repo:
 +
#: <code>git remote add origin git@github.swarthmore.edu:Ling073-sp21/ling073-xyz.git</code>
 +
#: <code>git push --set-upstream origin master</code>
  
 
=== The hard stuff ===
 
=== The hard stuff ===
# Build your morphological transducer, adding all of the stems from your Grammar documentation assignment, categorised correctly, so that '''at least half of your tests pass'''.  You'll need to build up the morphotactics too.
+
# '''Build your morphological transducer''', adding all of the stems from your Grammar documentation assignment, categorised correctly, so that '''at least half of your tests pass'''.  You'll need to build up the morphotactics too.
 
#* If too many of your grammar points are too hard to implement at this point (e.g., require some rules to change some characters to other characters), then you can skip one or two of them and instead add more "easy" forms to your transducer.
 
#* If too many of your grammar points are too hard to implement at this point (e.g., require some rules to change some characters to other characters), then you can skip one or two of them and instead add more "easy" forms to your transducer.
# Create a page on the wiki <code>Language/Transducer</code> that links to the code and has Evaluation and Notes sections.
+
#* Alternatively "hard-code" some forms, but add a comment in the lexd file near the relevant forms indicating that they need further work, and mention it on the wiki page (see below).
 +
#* Also don't forget to clean up your grammar page.  Don't forget to rerun the scraping script to get a fresh version of the tests file.  If you're happy with your grammar page and don't expect to change it much, feel free to "clean up" your tests file manually.
 +
# '''Create a page on the wiki''' <code>Language/Transducer</code> that links to the code and has Evaluation and Notes sections.
 
#* In the Notes section, say what tests still don't work and why.
 
#* In the Notes section, say what tests still don't work and why.
#* Add the page to the category [[:Category:Sp17_Transducers]].
+
#* Add the page to the category [[:Category:Sp21_Transducers]].
  
 
=== Evaluation ===
 
=== Evaluation ===
Line 268: Line 266:
  
 
Evaluate coverage on your corpus and add the one of the most frequent unanalysed words:
 
Evaluate coverage on your corpus and add the one of the most frequent unanalysed words:
# Use <code>[[apertium-quality|aq-covtest]]</code> to see how many forms in your basic corpus are analysed, and what the top unknown forms are.
+
# Use <code>coverage-hfst</code> or <code>[[apertium-quality|aq-covtest]]</code> (as above) to see how many forms in your basic corpus are analysed, and what the top unknown forms are.
 
#* Make note of the coverage at this point
 
#* Make note of the coverage at this point
 
# Make a new <code>yaml</code> file in your tests directory with the top unanalysed words, and name it something like <code>commonwords.yaml</code>.  For the analysis side, just put an {{tag|unk}} tag (for "unknown") after each form.  Don't forget to commit this to your git repository.
 
# Make a new <code>yaml</code> file in your tests directory with the top unanalysed words, and name it something like <code>commonwords.yaml</code>.  For the analysis side, just put an {{tag|unk}} tag (for "unknown") after each form.  Don't forget to commit this to your git repository.
 
# Figure out what the analyses of '''at least three''' of these words should be, using the resources you have available (grammar books, etc.), and update the analysis side of your yaml file accordingly.
 
# Figure out what the analyses of '''at least three''' of these words should be, using the resources you have available (grammar books, etc.), and update the analysis side of your yaml file accordingly.
 
# Add '''at least one''' of these analyses to your transducer so that the test passes.
 
# Add '''at least one''' of these analyses to your transducer so that the test passes.
# Rerun <code>aq-covtest</code> to see by how much your coverage improved.
+
# Rerun <code>coverage-hfst</code> <code>aq-covtest</code> to see by how much your coverage improved.
 
#* Add a note to the notes section of the additional top word(s) you added, and the resulting change in coverage (e.g., «by adding "{{morphTest|and{{tag|cnjcoo}}|and}}" to the transducer, coverage went from 9.76% to 12.32%»)
 
#* Add a note to the notes section of the additional top word(s) you added, and the resulting change in coverage (e.g., «by adding "{{morphTest|and{{tag|cnjcoo}}|and}}" to the transducer, coverage went from 9.76% to 12.32%»)
  
 
In the Evaluation section on the wiki page, add the following:
 
In the Evaluation section on the wiki page, add the following:
* Total number of stems in the transducer.  You can use the lexccounter script, or count the stems manually.  (For languages with non-suffixational morphology, you'll probably need to count the stems manually.)
+
* Total number of stems in the transducer.  You can use the following method, or count the stems manually.
 +
*: <code>lexd -x apertium-xyz.xyz.lexd > /dev/null</code> (then add the counts for the relevant individual lexicons)
 
* Current coverage over your combined corpus
 
* Current coverage over your combined corpus
 
* The current list of top unknown words returned by <code>aq-covtest</code>
 
* The current list of top unknown words returned by <code>aq-covtest</code>
Line 290: Line 289:
  
 
=== Sanity checks before submitting ===
 
=== Sanity checks before submitting ===
# Did you commit ''just'' the initial files created by bootstrap ''before'' you initialised or compiled the module?  If not, start over with bootstrapping, being sure copy over any files you've changed.
+
# Did you commit ''just'' the initial files created by bootstrap ''before'' you initialised or compiled the module?  If not, start over with bootstrapping, being sure copy over any files you've changed.  Or use [[Removing binaries from transducer repo|this method]].
# Did you commit your updates to <code>lexc</code> and <code>twol</code> files?  And the <code>yaml</code> test file?
+
# Did you commit your updates to <code>lexd</code> and <code>twol</code> files?  And the <code>yaml</code> test files?
 
# Do you have at least 50 tests in the main tests file?  Do at least half of them pass a morph-test?
 
# Do you have at least 50 tests in the main tests file?  Do at least half of them pass a morph-test?
 
# Did you add everything asked for to the wiki page (evaluation, etc.) and your repo (e.g., both yaml files).
 
# Did you add everything asked for to the wiki page (evaluation, etc.) and your repo (e.g., both yaml files).
# If you have trouble analysing or compiling, are all your tags and symbols defined in both lexc and twol files?
+
# If you have trouble analysing or compiling, are all your tags and symbols (full alphabet) defined in your twol file?
  
  

Latest revision as of 14:18, 20 May 2021

Morphological transducers

A morphological transducer is just a directed graph. It consists of nodes (numbered below) and arcs (with labels), with a starting node (0 below) and an ending node (16 below).

Simple transducer.png

You follow the arcs that are available from your input. The only acceptable paths are ones that start from starting node and end at the ending node. You may match your input to either side of the arc's label (separated by : above), and the other side is returned as output.

In the transducer above, the left side is the form and the right side is the analysis. If you match your input to the left side (the form), then your output will be the right side (the analysis)—this is morphological analysis. Likewise, if you follow the transducer by matching your input to the right side (the analysis) and output the left side (the form), then you are performing morphological generation.

An example of a complete path is w:w o:o l:l v:f e:<n> s:<pl>. The left/form side of this spells wolves and the right/analysis side of this spells wolf<n><pl>. Mapping between one and the other is as simple as taking one as input and following the path—by outputting the other side of each arc, you will get the other as output!

Question: What are all the possible paths provided by this transducer?

The formalism we use (lexd)

Transducers are pretty cool, and quite efficient... for computers. Following paths by hand is tedious, and drawing a transducer for anything more complex than the example above is torture. See the transducer below for Tuvan.

Tuvan transducer.png

This transducer provides the combinations of about 8 case marker, 5 possessive morphemes, and the plural marker for three Tuvan nouns.

An example is өг>{L}{A}р>{i}м>{D}{A}н mapping to өг<n><pl><px1sg><abl>, meaning "from my houses". The analysis side is clear to anyone familiar with tags (and knowing that "өг" means "house"). The form side is actually something that will get fixed by morphophonology, which we'll worry about later (for now: letters like {L} can be realised in a variety of ways, and > is used as a morpheme boundary); the actual orthographic form is өглеримден.

Question: How can we quantify the complexity of this graph?

Fortunately, we don't have to draw this graph by hand. We can simply define the various sections of it and link them together with a straightforward formalism called lexd. A section of a lexd file that corresponds to the graph above looks like the following:

PATTERNS

N-Stems [ <n>: ] [ <pl>:>{L}{A}r ]? Possession? Cases


LEXICON N-Stems

өг:өг     # "yurt"
аът:аът   # "horse"
ном:ном   # "book"

LEXICON Possession

<px1sg>:>{i}м
<px2sg>:>{i}ң
<px3sp>:>{z}{I}{n}
<px1pl>:>{i}в{I}с
<px2pl>:>{i}ң{A}р

LEXICON Cases

<nom>:
<gen>:>{N}{I}ң
<acc>:>{N}I
<dat>:>{G}{A}
<loc>:>{D}{A}
<abl>:>{D}{A}н
<all>:>{J}е
<all>:>{D}{I}в{A}   # Dir/LR

Questions

  • What is # doing?
  • What is : doing?
  • What are the ?s doing?
  • How are the lists of parts (LEXICONs) combined into a whole?
  • What is not mentioned in this code that is in the graph above?
  • Can you match sections of the graph to sections of the code?

Additional nuances

Phonology

The symbols like {L} above will need to be realised as different characters in different context.

For any symbols in your language that will be realised in different ways in different environments, you'll want to set up such an "archiphoneme". Use an uppercase letter for something that just has different forms, and use a lowercase letter for something that is inserted or deleted (i.e., is sometimes realised as nothing).

For now, it will suffice to define all the ways in which each archiphoneme surfaces by making a list in your twol file. This essentially allows all of the options to surface, which means you will be able to analyse incorrect forms as well as correct ones. Later, when you make a generator, you'll write rules to constrain where each of the symbols can occur.

Defining symbols

You'll want to define your archiphoneme symbols in the twol file, each with all its possible outputs.

So your twol file should contain an Alphabet section, which lists all the characters of the alphabet, and then all the archiphonemes with all their realisations. You will also want the > morpheme separator and some punctuation marks, all escaped. A condensed example for Tuvan follows:

Alphabet

   А Б В Г Д Е Ё Ж З И Й К Л М Н Ң О Ө П Р С Т У Ү Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
   а б в г д е ё ж з и й к л м н ң о ө п р с т у ү ф х ц ч ш щ ъ ы ь э ю я

   %{A%}:а %{A%}:е
   %{L%}:л %{L%}:н
   %{i%}:0 %{i%}:ы %{i%}:и %{i%}:у %{i%}:ү

   %.
   %-

   %>:0 
 ;

Starting point

You'll need at least one PATTERNS section in your lexd file. Bootstrapping a new language module per the instructions will create this for you, but don't forget that it's a thing!

Morphology that isn't suffixes

You may remember from last week that we discussed that analyses are generally in the form stem + POS tags + subcategory tags + function tags. What if some of your functional morphology occurs before the stem?

You can certainly implement that as shown above, but there's a problem: your tags will occur in the middle of the analysis. That is, instead of something like do<v><tv><rep><prc> ↔ redoing, you'd get something like <rep>do<v><tv><prog> ↔ redoing. This is undesirable in terms of keeping track of which of your tags are for what.

Fortunately lexd offers a trick for handling such things! You can list the left and right side of a given LEXICON in different parts of a PATTERN, and the pieces will be matched. The following is a simple example (which assumes that the "re" prefix in English should be treated as productive and inflectional, which it probably shouldn't):

PATTERNS

Verbs

PATTERN Verbs

V-Base V-Tenses

PATTERN V-Base

V-Stems(1) [<v>:] V-Stems(2):
:V-Prefixes V-Stems(1) [<v>:] V-Stems(2): V-Prefixes:

LEXICON V-Prefixes

<rep>:re>
<rev>:un>

LEXICON V-Tenses

<prog>:>ing
<past>:>{e}d

LEXICON V-Stems(2)

do:do        <tv>
write:write  <tv>

Questions:

  • What are the forms output by this transducer?
  • (How many forms are there?)
  • What are the numbers in ()?
  • What part of the file adds the prefixes and what part of the file adds the prefix tags?
  • What are the different PATTERN groups for?

another example

Here's a better example, showing (productive, inflectional) gender agreement on verbs in Avar.

PATTERNS

Verbs

PATTERN Verbs

:V-Gender V-Stems(1) [<v>:] V-Stems(2): V-Tense V-Gender:

LEXICON V-Tense

<aor>:>уна

LEXICON V-Gender

<nt>:б>
<m>:в>
<f>:й>
<pl>:р>

LEXICON V-Stems(2)

бицине:иц     <tv>  # "говорить"

The output analyses would be the following:

бицине<v><tv><aor><nt>:бицуна
бицине<v><tv><aor><f>:йицуна
бицине<v><tv><aor><m>:вицуна
бицине<v><tv><aor><pl>:рицуна

In-class exercise

See Morphological analyser/Exercises.

The work we did on this in class is available on Swarthmore's github at ling073-sp21/ling073-eng.

Evaluation

Individual forms

To test whether/how your analyser is analysing a form, you can run the following:

echo "form" | apertium -d /path/to/analyser/ xyz-morph

An example might be the following:

apertium-tyv$ echo өглеримден | apertium -d . tyv-morph
^өглеримден/өг<n><pl><px1sg><abl>$^./.<sent>$

You can also do it this way:

apertium-tyv$ echo өглеримден | hfst-proc tyv.automorph.hfst

This output means that for the form өглеримден there is one analysis: өг<n><pl><px1sg><abl>. A form with multiple analysis would have them separated by /, like the following:

^өг/өг<n><nom>/өг<n><attr>/өг<n><nom>+э<cop><aor><p3><sg>$^./.<sent>$

A form with no analyses in the transducer will just return the form with an * before it, like the following:

^өглеримнен/*өглеримнен$^./.<sent>$

See full contents of analyser

hfst-expand xyz.automorf.hfst

A long list of forms with known analyses

To test whether your analyser is analysing forms correctly, you can put your analyses into a yaml file and use morph-test or aq-morftest:

morph-test -csi xyz.yaml | most

or

aq-morftest -csi xyz.yaml | most

Coverage over a corpus

To test coverage over a corpus, you can use coverage-hfst or aq-covtest:

coverage-hfst xyz.corpus.basic.txt /path/to/xyz.automorf.hfst
aq-covtest xyz.corpus.basic.txt /path/to/xyz.automorf.bin

Generating forms

If you need to test how a form generates, you can do something like the following:

echo "^house<n><pl>$" | apertium -d . -f none xyz-gener

This will return all forms currently being generated, e.g. houses/housees

Counting lexicon entries

You can run the following command to list lexicons and counts in them all:

lexd -x apertium-xyz.xyz.lexd > /dev/null

Note that a number of these lexicons are probably related to your morphology. If you're counting "lexical entries", you should probably exclude the morphology ones.

The assignment

This assignment will be due at the end of the day Friday of the 5th week of class (this semester: 23:59 on Friday, March 12th, 2021).

This assignment is to develop a morphological analyser that implements a good deal of the basic morphology of your language.

Getting set up

  1. Bootstrap a transducer for your language using apertium-init (installed on the lab machines):
    apertium-init -a lexd --with-spellrelax --prefix=ling073 xyz
  2. Go into the new directory (cd ling073-xyz), initialise the module (./autogen.sh), and compile it (make).
    • If this is successful, you should have several "modes" available; run apertium -d . -l to see.
    • One mode should be an xyz-morph mode; this is your analyser. Check it by running echo "houses" | apertium -d . xyz-morph , which should give you a morphological analysis of the word "houses".
  3. Integrate any comments I've provided to you on your grammar documentation page so that all of your morphTests are in good order. See the sanity checks at Grammar documentation#Sanity checks to check the main things.
  4. Augment the commented section at the top of the apertium-xyz.xyz.lexd file with any tags you came up with during the Grammar documentation assignment that aren't there already. Provide a symbol, and a brief comment explaining what the symbol means.
  5. Add all the characters of your language's orthography to the Alphabet section of the apertium-xyz.xyz.twol file. You may need to add archiphonemes later.
  6. Use the morphTests2yaml script (installed on the lab machines) to create a yaml test file in a subdirectory called tests/. Commit this file to the git repo. (You can remove blank sections if you like, and if they appear in the file.) There should be at least 50 tests in this file—make sure you have enough.
  7. Create a new empty repo (that is, don't check the README option) in the course's GitHub organisation with the name ling073-xyz (with your language's code in place of xyz). Then add the SSH link as a remote origin in your initialised module and push the module to the GitHub repo:
    git remote add origin git@github.swarthmore.edu:Ling073-sp21/ling073-xyz.git
    git push --set-upstream origin master

The hard stuff

  1. Build your morphological transducer, adding all of the stems from your Grammar documentation assignment, categorised correctly, so that at least half of your tests pass. You'll need to build up the morphotactics too.
    • If too many of your grammar points are too hard to implement at this point (e.g., require some rules to change some characters to other characters), then you can skip one or two of them and instead add more "easy" forms to your transducer.
    • Alternatively "hard-code" some forms, but add a comment in the lexd file near the relevant forms indicating that they need further work, and mention it on the wiki page (see below).
    • Also don't forget to clean up your grammar page. Don't forget to rerun the scraping script to get a fresh version of the tests file. If you're happy with your grammar page and don't expect to change it much, feel free to "clean up" your tests file manually.
  2. Create a page on the wiki Language/Transducer that links to the code and has Evaluation and Notes sections.

Evaluation

When you've finished getting half of your tests to pass.

Evaluate coverage on your corpus and add the one of the most frequent unanalysed words:

  1. Use coverage-hfst or aq-covtest (as above) to see how many forms in your basic corpus are analysed, and what the top unknown forms are.
    • Make note of the coverage at this point
  2. Make a new yaml file in your tests directory with the top unanalysed words, and name it something like commonwords.yaml. For the analysis side, just put an <unk> tag (for "unknown") after each form. Don't forget to commit this to your git repository.
  3. Figure out what the analyses of at least three of these words should be, using the resources you have available (grammar books, etc.), and update the analysis side of your yaml file accordingly.
  4. Add at least one of these analyses to your transducer so that the test passes.
  5. Rerun coverage-hfst aq-covtest to see by how much your coverage improved.
    • Add a note to the notes section of the additional top word(s) you added, and the resulting change in coverage (e.g., «by adding "and<cnjcoo> ↔ and" to the transducer, coverage went from 9.76% to 12.32%»)

In the Evaluation section on the wiki page, add the following:

  • Total number of stems in the transducer. You can use the following method, or count the stems manually.
    lexd -x apertium-xyz.xyz.lexd > /dev/null (then add the counts for the relevant individual lexicons)
  • Current coverage over your combined corpus
  • The current list of top unknown words returned by aq-covtest
  • Number of tests that pass in each yaml file
    • The main yaml file should have at least half of the tests passing
    • The commonwords.yaml file should have at least 1 passing test

Housekeeping

  1. Add yourself to the AUTHORS file.
  2. Make sure the COPYING file contains an open-source license to your liking (default should be GPL3).
  3. Add links to the transducer repo and wiki page to the list of resources you developed for your language on the language's page on this wiki.

Sanity checks before submitting

  1. Did you commit just the initial files created by bootstrap before you initialised or compiled the module? If not, start over with bootstrapping, being sure copy over any files you've changed. Or use this method.
  2. Did you commit your updates to lexd and twol files? And the yaml test files?
  3. Do you have at least 50 tests in the main tests file? Do at least half of them pass a morph-test?
  4. Did you add everything asked for to the wiki page (evaluation, etc.) and your repo (e.g., both yaml files).
  5. If you have trouble analysing or compiling, are all your tags and symbols (full alphabet) defined in your twol file?