Difference between revisions of "Morphological generator"

From LING073
Jump to: navigation, search
(Additional resources)
(The assignment: updated date)
Line 184: Line 184:
 
== The assignment ==
 
== The assignment ==
  
This assignment will be due at the beginning of the Thursday class during the 6th week (this semester: '''8:30am on Thursday, February 28th, 2019''').
+
This assignment will be due at the end of the week during the 6th week (this semester: '''23:59 on Friday, March 19th, 2021''').
  
 
=== Initial evaluation ===
 
=== Initial evaluation ===

Revision as of 00:10, 16 March 2021

The basics

Twol (pronounced [tuˈɛl]) constraints (or rules) are written to limit (or change) the output of the lexd level of the transducer on a character-by-character basis. One way of thinking of it is that the rules map between the morphological or phonological level and the phonetic or orthographical level. It's less useful to think of the rules as "transforming" the input—this approach is slightly inaccurate and can lead to problems.

A twol rule is structured around an environment where a lexd-level symbol (or symbols) are restricted to a particular orthographic form-level symbol.

A basic rule restricting lexd-level (or "morphophonological") %{x%} (note that % is an escape character) to be output only as form-level (or "phonetic") x when it's before form-level y would be the following:

"{x} outputs as x before y"
%{x%}:x <=> _ :y ;

The first line is a unique rule name, which is mandatory. The next line begins with the rule operation, consisting of an input-level (or lexd-level) symbol, a separator colon, an output-level (or surface_form-level) symbol. The operation is followed by a rule operator, and a context. The context contains a series of symbol pairs separated by spaces. The symbols in these pairs use the separator : to separate lexd-level from form-level values, and can each consist of multiple characters. An underscore (_) is used to show the rule locus, or exact position among these symbols where the change should take place.

This structure and terminology is summarised in the following diagram: Structure of twol rule.png

Some useful symbols

The basic stuff:

  • : separates the lexd side of the transducer (left) from the output side (right); e.g. %>:0
  • 0 on the right side means empty output
  • _ is the location in which the rule is applied
  • <=> separates what gets mapped from the environment, and can be read as "where". It's a good idea to put spaces around it.
  • ; represents the end of an environment.
  • .#. is the beginning or end of a path (/word).

Some more advanced stuff:

  • \ means "not" the thing after it
  • () means that the thing inside it is optional (1 or 0 occurrences match)
  • [] groups blocks of things
  • | means "or"
  • * after a symbol (e.g., x:y* or [ … ]* means 0 to infinite occurrences of that thing match
  • / optionally interleaves the symbol after it into all possible positions in the set of symbols before it, e.g. [ :a :b :c ]/:x should be equivalent to [ (:x) :a (:x) :b (:x) :c (:x) ].

How rules interact

One of the most important things to keep in mind about twol rule/constraints is that if defined in the same file, they operate in parallel, or phrased differently, are not "ordered". Each rule/constraint represents a direct mapping between the input/morphological and output/phonetic layer.

This means that rules may easily interact with one another, for better or worse. For instance, if you have rules to map li{F}e>{e}s to lives, the rule mapping {F}:v will probably need the {e}: in its environment (so as to not make the change in li{F}e. In cases like this, you can exclude the output side of the other rule-affected character in the first rule's environment, unless of course only one possible mapping of the other character triggers the first rule. In that case, you'll want to specify the mapping.

Some more tricks

Multiple environments

You can use list multiple environments

"x outputs as y before a and after b"
x:y <=> _ :a ;
        :b _ ;

The rule above maps x to y before an output a and after an output b.

Excluded environments

You can also exclude certain environments using the word except:

"x outputs as y before a unless after b"
x:y <=> _ :a ;
    except
        :b _ ;

The rule above maps x to y before an output a unless it's after an output b.

Matched correspondences

You can do a whole bunch of pairs of symbols in the same environment using a single rule, like this:

"{A} outputs as the vowel before it"
%{A%}:Vy <=> :Vx _ ;
      where Vx in ( a e o )
            Vy in ( a e o )
      matched ;

This rule turns %{A%} into either a, e, or o to match the character before it.

A similar rule:

"Vowels raise after a null-realised {x}"
Vx:Vy <=> %{x%}:0 _ ;
      where Vx in ( a e o )
            Vy in ( ə i u )
      matched ;

This rule says that a becomes ə, e becomes i, and o becomes u after an archiphoneme {x} that is deleted.

Using sets

You can use a set to define a wider context to a rule. For example, you can define a set of vowels as follows:

Sets

Vowels = a e i o u ;

Then you can define a rule based on the context, e.g.:

Rules

"realise {e} as null if after a morpheme-final vowel"
%{e%}:0 <=> :Vowels %>:0 _ ;

(You can also theoretically use sets with matched correspondences, but no I make no guarantees that it'll work as expected.)

Multiple correct forms

Sometimes there are multiple correct forms for a given analysis in a language. In cases like this, you should analyse all of them, but only generate one. You can use Dir/LR in the comment section of a form to allow only form-to-analysis mapping (and not analysis-to-form mapping). You may also want to add both forms to your tests file.

An example of this is the English word "dwarf", the plural of which can be "dwarves" or "dwarfs". Since the former is considered correct by more people (or simply, is more commonly used), we'll consider it correct. You'll want something like this in your lexd file:

dwarf:dwar{F}  #
dwarf:dwarf    # Dir/LR

If you want to have both output forms in your yaml tests file (good for analysis tests, not for generation tests), you can have an entry like the following:

dwarf<n><pl> : {dwarves, dwarfs}

Common pitfalls

  • Problem: Error messages about "conflicts"
    Cause: Rules that apply to the same left side cannot have overlapping environments. If the environments do overlap, it won't know which right side you want.
    Solution: Make sure the contexts for the rules are different. If necessary, put the context for one of the rules under an "except" in the other rule.
  • Make sure you have all your symbols defined in the Alphabet section of the twol file, including multi-character symbols
  • Make sure there's at least one set in the Sets section, and at least one rule in the Rules section (even if they do nothing).
  • Make sure you put spaces around but not within symbol pairs. Also make sure you have spaces adjacent to brackets correct. Spaces are very important in twol!

Useful commands

See what lexd outputs

echo "life<n><pl>" | hfst-lookup .deps/eng.LR.lexd.hfst 

There should be a line like this:

> life<n><pl>	li{F}e>{e}s	0.000000

This shows that the lexd part of the transducer maps life<n><pl> to li{F}e>{e}s.


You can also output all the morphological forms this way:

hfst-expand .deps/eng.LR.lexd.hfst

Check what's being generated by the full transducer

echo "^life<n><pl>$" | apertium -d . -f none eng-gener

or

echo "life<n><pl>" | hfst-proc eng.autogen.hfst

will output something like

lifes/lifees/lives/livees

If you get something like #life, then this means it's not generating the form correctly. You'll want to find the output of lexd (as above) and set up a pair test for it (as below).

Other useful tricks

Pair tests

You can add pair tests throughout your twol file and test them using a pairtest script. The script is helpful if no correct form is being generated—it'll give you some idea of what rule might be the/a culprit. If you have a correct form being generated and incorrect forms too, this script won't help—you simply need to write more rules.

The accepted format is like this (in this case, mapping li{F}e>{e}s to lives):

!@ l:l i:i {F}:v e:e >:0 {e}:0 s:s

Once you have one or more pair tests, you can run the pairtest script linked to above. Note that you don't have to recompile the pair to run the tests (but you should be in the tests directory).

Out-of-lexicon characters

If you're experiencing weird tokenisation issues, like half-words ending up in your top unknown words list, then you probably have out-of-lexicon characters. This means that some characters in your corpus are not encountered anywhere in your transducer. You'll need to make sure there are words with them in your lexd file. This should result in more accurate coverage tests and a more useful list of unknown words.

Another reason you might get half words ending up in your top unknown words list is because some words contain punctuation characters, like a punctuation apostrophe (' or ’) instead of an alphabetic one (ʼ). The only real way to fix this is to change your corpus. You could also add the words, but really no words in your corpus should have punctuation characters in them—instead you'll want to accept those using spellrelax (see below).

Spellrelax

A spellrelax file lets you allow for typographical variance. For example, if your language has lots of diacritics, but it's common for people to not use the diacritics, then you'll want to map the non-diacritic versions to the diacritic versions. Alternatively, if your language uses a special apostrophe-like symbol, then you'll probably want to map other apostrophe symbols to this.

You can make a spellrelax file based on existing ones.

  • Some existing spellrelax files: ote and quc and uzb, all of which include apostrophe-like characters, and some of which include letters with/without diacritics and other useful examples.

You'll also need to adjust your Makefile.am to compile the spellrelax file and combine it with your analyser. This will involve the following:

  • Adjust your automorf.hfst line to look more like this:
$(LANG1).automorf.hfst: .deps/$(LANG1).LR.hfst $(BASENAME).$(LANG1).spellrelax
	hfst-invert $< | hfst-expand-equivalences -T $(BASENAME).$(LANG1).spellrelax | hfst-fst2fst -O -o $@
  • Make sure that your indents at the beginning of the second line are tabs and not spaces!
  • Then you'll have to reconfigure your compiling environment by running ./autogen.sh before running make again.

Postdix

Postdix can do phonology across word boundaries. You can look at the English postdix to see how the a/an alternation is dealt with.

The assignment

This assignment will be due at the end of the week during the 6th week (this semester: 23:59 on Friday, March 19th, 2021).

Initial evaluation

  1. If you're still working on your analyser, finish that first.
  2. Commit and push your code at this point, and then add a tag "analyser":
    • git tag analyser
    • Then push the tag to the repo: git push origin analyser
  3. Make a section on your Transducer wiki page for generator evaluation with a subsection title something like "initial evaluation of morphological generation" and note the number of passing and failing morphological analysis tests at this point using the morphology test method used for your analyser assignment.
    • Put existing sections under a new "Analyser evaluation" section to keep it separate from the generator evaluation.
  4. Also rerun the coverage test on your corpus, and add the current coverage number.
  5. Run a morph test again on your tests file, but this time with the -l option (to test morphological generation) instead of with -s (morphological analysis tests). You also don't need -i—the full command you'll probably want is morph-test -cl xyz.yaml | most.) Put the number of passes and fails for the generation test in the same subsection on the wiki page.

The hard stuff

Continue development of your transducer, focussing on twol rules, so that two-thirds of your [minimum 50] tests pass.

  • Note that if every test passes analysis but has multiple forms generated, then exactly one-half of your generation tests will pass—this is simply due to the way the tests are counted (i.e., each form will be counted as both passing and failing). In such a situation, each form which subsequently is made to have only a correct form generated will reduce the number of failing tests by one, but not increase the number of passing tests. Hence, if all your tests analyse correctly, then you only need to write twol rules to make half of them generate correctly.
  • In another situation, half (or a little more) of your tests might be passing, and these same tests might generate correctly because you don't have any need for twol rules. If you do not need at least one twol rule (verify this with me!), then you should do extra work on your lexd file so that at least 85% of your tests pass.
  • If you already had some necessary twol rules implemented and didn't need to modify even one of them for this part of the assignment, then you only need to aim for 75% passing generation tests.

Also note:

  • Make sure that your code compiles with no errors, and preferably with no warnings!

Final evaluation

When you're done with the above, add a new subsection for "final evaluation of morphological generation" and add the following:

  • the number of passing and failing tests now,
  • the number of twol rules that you added (and a separate number for any that you might have already had that you modified) for this part of the assignment (even if zero because they were already all working). You can use git to compare your work since you started work on this using the "analyser" tag you made at the beginning:
    • to compare all your work: git diff analyser
    • to compare only a given file (e.g., your twol file): git diff analyser apertium-xyz.xyz.twol
  • Again run the coverage test on your corpus, and add the new coverage number.

Additional resources