Difference between revisions of "Mixe/Transducer"

From LING073
Jump to: navigation, search
(Notes)
Line 1: Line 1:
 
[https://github.swarthmore.edu/ling073-sp22/ling073-mto GitHub repository] (internal Swarthmore access may be required to view).
 
[https://github.swarthmore.edu/ling073-sp22/ling073-mto GitHub repository] (internal Swarthmore access may be required to view).
  
== Notes ==
+
== Analyser Evaluation ==
  
=== MorphTests in mto.yaml ===
+
=== Notes ===
  
 
Most of the adjectives aren't passing. Maybe it has something to do with the saltillos.
 
Most of the adjectives aren't passing. Maybe it has something to do with the saltillos.
Line 29: Line 29:
 
By adding these to our transducer, however, our coverage only went from 0.41 to 0.47. Something that seems to be happening is that some saltillos are being treated as word boundaries. This is why we're getting bits like ''e'' and ''tse'' as unknown words, when really, at least some of those tokens are ''tseꞌe''. But I'm not sure why this is happening.
 
By adding these to our transducer, however, our coverage only went from 0.41 to 0.47. Something that seems to be happening is that some saltillos are being treated as word boundaries. This is why we're getting bits like ''e'' and ''tse'' as unknown words, when really, at least some of those tokens are ''tseꞌe''. But I'm not sure why this is happening.
  
== Evaluation ==
+
=== Evaluation ===
  
 
We currently have at least 78 stems in the transducer. It may be over twice as many, but I'm not sure whether to count stem alternations (as written in multi-column lexicons) as one each, or as alternations for one stem.
 
We currently have at least 78 stems in the transducer. It may be over twice as many, but I'm not sure whether to count stem alternations (as written in multi-column lexicons) as one each, or as alternations for one stem.

Revision as of 15:07, 10 April 2022

GitHub repository (internal Swarthmore access may be required to view).

Analyser Evaluation

Notes

Most of the adjectives aren't passing. Maybe it has something to do with the saltillos. More generally, sometimes a morphtest fails, even though the form is correctly analyzed when I do echo form | apertium -d . mto-morph .

Some tests fail because we haven't accounted for various phonological processes I'm not sure the best way to deal with this, especially since I can't predict when such alternations take place (at least, I haven't figured out the pattern as of yet).

The way we've implemented verb morphology seems to work in general, but there's probably a better, cleaner way to approach it. Many verbs in Totontepec have stems that alternate depending on morphological environment. The issue isn't that there are alternations, but that there are so many of them, and they don't seem to follow a clear pattern across all verbs. The current solution is to have multi-columned verb stem LEXICONs and about 40 verb patterns. Not only does this look confusing and cluttered, but it makes troubleshooting verb analysis and generation more tedious.

Currently, we have 82 morphTests, 47 of which pass.

Tests that don't pass (as of 4-10 at 1pm):

  • most of the inflected transitive verbs
  • nouns and verbs with phonological alternations. We're not sure how to predict when such alternations will occur, so we're not sure of the best way to deal with them.
  • complex verbs. We currently have seven morphTests for verbs that include non-obligatory morphology. We haven't added those morphotactics to the transducer yet, mostly because there are about 20 possible morpheme spots per verb, so there are many additional morphemes to implement.

From the top unknown words in our corpus, we determined the analyses of the following:

  • dü <pro><perf>
  • tseꞌe <disc><asrt>
  • maas <adj>
  • ꞌax <disc>
  • juuꞌ <pro>
  • laata <n>

By adding these to our transducer, however, our coverage only went from 0.41 to 0.47. Something that seems to be happening is that some saltillos are being treated as word boundaries. This is why we're getting bits like e and tse as unknown words, when really, at least some of those tokens are tseꞌe. But I'm not sure why this is happening.

Evaluation

We currently have at least 78 stems in the transducer. It may be over twice as many, but I'm not sure whether to count stem alternations (as written in multi-column lexicons) as one each, or as alternations for one stem.

Current coverage over corpus: 0.47%.

Current list of top unknown words:

  • e
  • ax
  • tse
  • ve
  • juu
  • apiv
  • k
  • y
  • jats
  • em
  • üü
  • ëëts
  • èts
  • pi
  • müjit
  • kajha
  • jüdü
  • ijt

Note that most of these aren't actually words, but rather parts of words, or words just missing saltillos. Not sure why they aren't being correctly analyzed.

In mto.yaml, 47 out of 82 tests currently pass. In commonwords.yaml, 6 out of 20 tests currently pass.,