Mixe/Transducer

From LING073
Jump to: navigation, search

GitHub repository (internal Swarthmore access may be required to view).

Analyser Evaluation

Notes

Most of the adjectives aren't passing. Maybe it has something to do with the saltillos. More generally, sometimes a morphtest fails, even though the form is correctly analyzed when I do echo form | apertium -d . mto-morph .

Some tests fail because we haven't accounted for various phonological processes I'm not sure the best way to deal with this, especially since I can't predict when such alternations take place (at least, I haven't figured out the pattern as of yet).

The way we've implemented verb morphology seems to work in general, but there's probably a better, cleaner way to approach it. Many verbs in Totontepec have stems that alternate depending on morphological environment. The issue isn't that there are alternations, but that there are so many of them, and they don't seem to follow a clear pattern across all verbs. The current solution is to have multi-columned verb stem LEXICONs and about 40 verb patterns. Not only does this look confusing and cluttered, but it makes troubleshooting verb analysis and generation more tedious.

Currently, we have 82 morphTests, 47 of which pass.

Tests that don't pass (as of 4-10 at 1pm):

  • most of the inflected transitive verbs
  • nouns and verbs with phonological alternations. We're not sure how to predict when such alternations will occur, so we're not sure of the best way to deal with them.
  • complex verbs. We currently have seven morphTests for verbs that include non-obligatory morphology. We haven't added those morphotactics to the transducer yet, mostly because there are about 20 possible morpheme spots per verb, so there are many additional morphemes to implement.

From the top unknown words in our corpus, we determined the analyses of the following:

  • dü <pro><perf>
  • tseꞌe <disc><asrt>
  • maas <adj>
  • ꞌax <disc>
  • juuꞌ <pro>
  • laata <n>

By adding these to our transducer, however, our coverage only went from 0.41 to 0.47. Something that seems to be happening is that some saltillos are being treated as word boundaries. This is why we're getting bits like e and tse as unknown words, when really, at least some of those tokens are tseꞌe. But I'm not sure why this is happening.

Evaluation

We currently have at least 78 stems in the transducer. It may be over twice as many, but I'm not sure whether to count stem alternations (as written in multi-column lexicons) as one each, or as alternations for one stem.

Current coverage over corpus: 0.47%.

Current list of top unknown words:

  • e
  • ax
  • tse
  • ve
  • juu
  • apiv
  • k
  • y
  • jats
  • em
  • üü
  • ëëts
  • èts
  • pi
  • müjit
  • kajha
  • jüdü
  • ijt

Note that most of these aren't actually words, but rather parts of words, or words just missing saltillos. Not sure why they aren't being correctly analyzed.

In mto.yaml, 47 out of 82 tests currently pass. In commonwords.yaml, 6 out of 20 tests currently pass.

Generator Evaluation

Initial evaluation of morphological generation

Upon replacing all apostrophes in mto.yaml with saltillos, the number of passings tests jumped from 47 to 62. While good that the issue wasn't due to errors in our patterns, this does mean that our current spellrelax rule for dealing with apostrophes and saltillos and whatnot isn't working correctly. I'll try to fix that during this morphological generation assignment.

  • Total morphological analysis tests: 82
  • Number passing: 62
  • Number failing: 20

I also went into the monolingual mto corpus and replaced all apostrophes with saltillos. This seems to have fixed the issue mentioned above, where segments of words were being analyzed as full words (and thus getting counted in the unknown forms). This has also technically reduced our corpus coverage, from 0.47 to 0.42, but this number seems to be much more accurate.

Now, the top unknown words are actually words:

  • veꞌem
  • jats
  • ꞌëëts
  • yꞌijt
  • veꞌm
  • müjit
  • kahja
  • and so on

Initial generation test:

  • Total: 120
  • Passes: 67
  • Fails: 53

Final evaluation of morphological generation

Number of passing and failing generation tests:

  • Total: 89
    • Passes: 75
    • Fails: 14

Number of added twol rules: 5

Current corpus coverage: 317/724 , 44%,