Mixe/Transducer

From LING073
Revision as of 15:01, 10 April 2022 by Eresend1 (talk | contribs) (Notes)

Jump to: navigation, search

GitHub repository (internal Swarthmore access may be required to view).

Notes

MorphTests in mto.yaml

Most of the adjectives aren't passing. Maybe it has something to do with the saltillos. More generally, sometimes a morphtest fails, even though the form is correctly analyzed when I do echo form | apertium -d . mto-morph .

Some tests fail because we haven't accounted for various phonological processes I'm not sure the best way to deal with this, especially since I can't predict when such alternations take place (at least, I haven't figured out the pattern as of yet).

The way we've implemented verb morphology seems to work in general, but there's probably a better, cleaner way to approach it. Many verbs in Totontepec have stems that alternate depending on morphological environment. The issue isn't that there are alternations, but that there are so many of them, and they don't seem to follow a clear pattern across all verbs. The current solution is to have multi-columned verb stem LEXICONs and about 40 verb patterns. Not only does this look confusing and cluttered, but it makes troubleshooting verb analysis and generation more tedious.

Currently, we have 82 morphTests, 47 of which pass.

Tests that don't pass (as of 4-10 at 1pm):

  • most of the inflected transitive verbs
  • nouns and verbs with phonological alternations. We're not sure how to predict when such alternations will occur, so we're not sure of the best way to deal with them.
  • complex verbs. We currently have seven morphTests for verbs that include non-obligatory morphology. We haven't added those morphotactics to the transducer yet, mostly because there are about 20 possible morpheme spots per verb, so there are many additional morphemes to implement.

From the top unknown words in our corpus, we determined the analyses of the following:

  • dü <pro><perf>
  • tseꞌe <disc><asrt>
  • maas <adj>
  • ꞌax <disc>
  • juuꞌ <pro>
  • laata <n>

By adding these to our transducer, however, our coverage only went from 0.41 to 0.47. Something that seems to be happening is that some saltillos are being treated as word boundaries. This is why we're getting bits like e and tse as unknown words, when really, at least some of those tokens are tseꞌe. But I'm not sure why this is happening.

Evaluation

We currently have at least 78 stems in the transducer. It may be over twice as many, but I'm not sure whether to count stem alternations (as written in multi-column lexicons) as one each, or as alternations for one stem.

Current coverage over corpus: 0.47%.

Current list of top unknown words:

  • e
  • ax
  • tse
  • ve
  • juu
  • apiv
  • k
  • y
  • jats
  • em
  • üü
  • ëëts
  • èts
  • pi
  • müjit
  • kajha
  • jüdü
  • ijt

Note that most of these aren't actually words, but rather parts of words, or words just missing saltillos. Not sure why they aren't being correctly analyzed.

In mto.yaml, 47 out of 82 tests currently pass. In commonwords.yaml, 6 out of 20 tests currently pass.,