Morphological generator

From LING073
Jump to: navigation, search

The basics

Twol rules are written to change the output of the lexc level of the transducer on a character-by-character basis. The basic structure of a twol rule is a mapping of lexc-level symbols to orthographic form-level symbols in a particular environment. A basic rule mapping lexc-level (or "morphophonological") %{x%} to form-level (or "phonetic") x when it's before form-level y would be the following:

"{x} outputs as x before y"
%{x%}:x <=> _ :y ;

The first line is a unique rule name, which is mandatory. The next line begins with the rule operation, consisting of an input-level (or lexc-level) symbol, a separator colon, an output-level (or surface_form-level) symbol. The operation is followed by a rule operator, and a context. The context contains a series of symbol pairs separated by spaces. The symbols in these pairs use the separator : to separate lexc-level from form-level values, and can each consist of multiple characters. An underscore (_) is used to show the rule locus, or exact position among these symbols where the change should take place.

This structure and terminology is summarised in the following diagram: Structure of twol rule.png

Some useful symbols

The basic stuff:

  • : separates the lexc side of the transducer (left) from the output side (right); e.g. %>:0
  • 0 on the right side means empty output
  • _ is the location in which the rule is applied
  • <=> separates what gets mapped from the environment, and can be read as "where". It's a good idea to put spaces around it.
  • ; represents the end of an environment.
  • .#. is the beginning or end of a path (/word).

Some more advanced stuff:

  • \ means "not" the thing after it
  • () means that the thing inside it is optional (1 or 0 occurrences match)
  • [] groups blocks of things
  • | means "or"
  • * after a symbol (e.g., x:y* or [ … ]* means 0 to infinite occurrences of that thing match
  • / optionally interleaves the symbol after it into all possible positions in the set of symbols before it, e.g. [ :a :b :c ]/:x should be equivalent to [ (:x) :a (:x) :b (:x) :c (:x) ].

Some more tricks

Multiple environments

You can use list multiple environments

"x outputs as y before a and after b"
x:y <=> _ :a ;
        :b _ ;

The rule above maps x to y before an output a and after an output b.

Excluded environments

You can also exclude certain environments using the word except:

"x outputs as y before a unless after b"
x:y <=> _ :a ;
        :b _ ;

The rule above maps x to y before an output a unless it's after an output b.

Matched correspondences

You can do a whole bunch of pairs of symbols in the same environment using a single rule, like this:

"{A} outputs as the vowel before it"
%{A%}:Vy <=> :Vx _ ;
      where Vx in ( a e o )
            Vy in ( a e o )
      matched ;

This rule turns %{A%} into either a, e, or o to match the character before it.

A similar rule:

"Vowels raise after a null-realised {x}"
Vx:Vy <=> %{x%}:0 _ ;
      where Vx in ( a e o )
            Vy in ( ə i u )
      matched ;

This rule says that a becomes ə, e becomes i, and o becomes u after an archiphoneme {x} that is deleted.

Using sets

You can use a set to define a wider context to a rule. For example, you can define a set of vowels as follows:


Vowels = a e i o u ;

Then you can define a rule based on the context, e.g.:


"realise {e} as null if after a morpheme-final vowel"
%{e%}:0 <=> :Vowels %>:0 _ ;

(You can also theoretically use sets with matched correspondences, but no I make no guarantees that it'll work as expected.)

Multiple correct forms

Sometimes there are multiple correct forms for a given analysis in a language. In cases like this, you should analyse all of them, but only generate one. You can use Dir/LR in the comment section of a form to allow only form-to-analysis mapping (and not analysis-to-form mapping). You may also want to add both forms to your tests file.

An example of this is the English word "dwarf", the plural of which can be "dwarves" or "dwarfs". Since the former is considered correct by more people (or simply, is more commonly used), we'll consider it correct. You'll want something like this in your lexc file:

dwarf:dwar%{F%} NounInfl ;
dwarf:dwarf NounInfl ; ! Dir/LR

If you want to have both output forms in your yaml tests file (good for analysis tests, not for generation tests), you can have an entry like the following:

dwarf<n><pl> : {dwarves, dwarfs}

Common pitfalls

  • Problem: Error messages about "conflicts"
    Cause: Rules that apply to the same left side cannot have overlapping environments. If the environments do overlap, it won't know which right side you want.
    Solution: Make sure the contexts for the rules are different. If necessary, put the context for one of the rules under an "except" in the other rule.
  • Make sure you have all your symbols defined in the Alphabet section of the twol file, including multi-character symbols
  • Make sure all your multi-character symbols are also listed in the Multichar_Symbols section of the lexc file.
  • Make sure there's at least one set in the Sets section, and at least one rule in the Rules section (even if they do nothing).

Useful commands

See what lexc outputs

echo "life<n><pl>" | hfst-lookup .deps/eng.LR.lexc.hfst 

There should be a line like this:

> life<n><pl>	li{f}e>{e}s	0.000000

This shows that lexc part of the transducer maps house<n><pl> to house>{e}s.

If you're using twoc, you'll want to lookup the lexc output in the file .deps/xyz.LR.lexc-twoc.hfst instead.

Check what's being generated by the full transducer

echo "^life<n><pl>$" | apertium -d . -f none eng-gener

will output something like


If you get something like #life, then this means it's not generating the form correctly. You'll want to find the output of lexc (as above) and set up a pair test for it (as below).

Pair tests

You can add pair tests throughout your twol file and test them using a pairtest script. The script is helpful if no correct form is being generated—it'll give you some idea of what rule might be the/a culprit. If you have a correct form being generated and incorrect forms too, this script won't help—you simply need to write more rules.

The accepted format is like this (in this case, mapping li{f}e>{e}s to lives):

!@ l:l i:i {f}:v e:e >:0 {e}:0 s:s

Once you have one or more pair tests, you can run the pairtest script linked to above. Note that you don't have to recompile the pair to run the tests (but you should be in the tests directory).

The assignment

This assignment will be due at the beginning of the Thursday class during the 6th week (this semester: 11:20am on Thursday, March 1, 2018).

Initial evaluation

  1. If you're still working on your analyser, finish that first.
  2. Commit and push your code at this point, and then add a tag "analyser":
    • git tag analyser
    • Then push the tag to the repo: git push origin analyser
  3. Make a section on your Transducer wiki page for generator evaluation with a subsection title something like "initial evaluation of morphological generation" and note the number of passing and failing morphological analysis tests at this point using the morphology test method used for your analyser assignment.
    • Put existing sections under a new "Analyser evaluation" section to keep it separate from the generator evaluation.
  4. Also rerun the coverage test on your corpus, and add the current coverage number.
  5. Run a morph test again on your tests file, but this time with the -l option (to test morphological generation) instead of with -s (morphological analysis tests). You also don't need -i—the full command you'll probably want is morph-test -cl xyz.yaml | most.) Put the number of passes and fails for the generation test in the same subsection on the wiki page.

The hard stuff

Continue development of your transducer, focussing on twol rules (in the twoc file as well!), so that two-thirds of your [minimum 50] tests pass.

  • Note that if every test passes analysis but has multiple forms generated, exactly one-half of your tests will pass—this is simply due to the way the tests are counted (i.e., each form will be counted as both passing and failing). In such a situation, each form which subsequently is made to have only a correct form generated will reduce the number of failing tests by one, but not increase the number of passing tests. Hence, if all your tests analyse correctly, then you only need to write twol rules to make half of them generate correctly.
  • In another situation, half (or a little more) of your tests might be passing, and these same tests might generate correctly because you don't have any need for twol rules. If you do not need at least one twol rule (in twol or twoc files), then you should do extra work on your lexc file so that at least 85% of your tests pass.
  • If you already had some necessary twol rules implemented and didn't need to modify even one of them for this part of the assignment, then you only need to aim for 75% passing generation tests.

Also note:

  • You will need to leave only one definition for each archiphoneme that you write rules to deal with. The one definition you leave should be the "default" realisation. For example, if you write a rule to deal with the {e} in English plurals, you might want to leave the default at null (%{e%}:0) or realised as "e" (%{e%}:e), but you don't want both! The rule you have should map the other (non-default) realisation.
  • Make sure that your code compiles with no errors, and preferably with no warnings!

Final evaluation

When you're done with the above, add a new subsection for "final evaluation of morphological generation" and add the following:

  • the number of passing and failing tests now,
  • the number of twol rules that you added (and a separate number for any that you might have already had that you modified) for this part of the assignment (even if zero because they were already all working). You can use git to compare your work since you started work on this using the "analyser" tag you made at the beginning:
    • to compare all your work: git diff analyser
    • to compare only a given file (e.g., your twol file): git diff analyser
  • Again run the coverage test on your corpus, and add the new coverage number.

Additional resources