Neo-Aramaic/Transducer

From LING073
Revision as of 13:36, 17 March 2018 by Eschalk1 (talk | contribs) (Notes)

Jump to: navigation, search

The code for the transducer can be found in this Github repository.

Getting set up: Part One

Getting set up for this stage of the project was an ongoing process for us; while it was easy to bootstrap our transducer and start working with Apertium, we found that grammar tags, our Alphabet, and our tests/ all needed continuous updating. This was because our grammar rules used tags and characters not found in our defined tags or alphabet, and these changes had to trickle down at some point to tests/, too. We continue to update these to reflect the growing number of tags we use to complete our grammar coverage. We now have tags for verb features, parts of speech, an archiphoneme, person morphology, gender morphology, number morphology, case morphology, TAM morphology, pronoun morphology, and punctuation.

Our Alphabet includes Syriac characters, Latin characters, a null morpheme (useful for analyzing vowel diacritics), and punctuation.

Alphabet
ܐ ܒ ܓ ܕ ܗ ܘ ܙ ܚ ܛ ܝ ܟ ܠ ܡ ܢ ܣ ܥ ܦ ܨ ܩ ܪ ܫ ܬ
ܸ ܲ ܵ ܸ ܹ ܼ
◌:0
. ؛ ! - — ، ؟ ' " « » ” “ ( ] [ ) ܀ \ /
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z

This section also includes our archiphoneme, which is used to realize the vowel sound A as the alaf when it appears at the beginning or end of a word. This is helpful when working with stems where the alaf often disappears after prefixes and suffixes are added to the stem.

%{A%}:0
! %{A%}:ܐ

The hard stuff: Part One

When we moved on to the hardest part of putting together a morphological transducer, our biggest challenge was working with the many prefixes and infixes found in Neo-Aramaic. We focused in the beginning on easy forms, like plural nouns, that used mostly suffixes in their morphology. When we moved on to our prefix morphology, we used the solution suggested on the course wiki, where our twoc file only processes forms with a plus and minus version of any given feature:

Rules
"Remove paths without matching suffix feature"
Fx:0 /<= _ ;
    except
      _ :* Fy:0 ;
    where Fx in ( %[%-hab%] %[%-fut%] %[%-obl%] )
        Fy in ( %[%+hab%] %[%+fut%] %[%+obl%] )
    matched ;
"Remove paths without matching prefix feature"
    Fx:0 /<= _ ;
       except
           Fy:0 :* _ ;
       where Fy in ( %[%-hab%] %[%-fut%] %[%-obl%] )
             Fx in ( %[%+hab%] %[%+fut%] %[%+obl%] )
    matched ;

This guarantees that if a verb picks up a -hab or -fut tag, it will only analyze these forms as having a habitual or future prefix if they pick up a later +hab or +fut tag.

Evaluation

We have 2556 tokenised words in our corpus, with a coverage of 9.66%. We have about 37 stems in our transducer (definitely a place we should be concentrating work right now).

Top unknown words in the corpus:

114 ܐ

86 ܕ

66 ܠܹܗ

62 ܡ

54 ܒ

46 ܢ

44 ܘ

44 ܠ

39 ܡܘܼܠܸܕ

36 ܝ

31 ܪ

30 ܡܢ

23 ܵܐ

21 ܗ

20 ܹܐ

18 ܠܐ

17 ܫ

14 ܵ

13 ܚ

12 ܡܕܝܢܬܐ

Notes

The first unknown word we're going to analyze is ܡܘܼܠܸܕ, which is the ninth most common unknown word and appears 39 times in the corpus. This one is pretty easy because we were able to find it in the Sureth dictionary, and it translates as "to breed, to beget, to conceive." It's not surprising that this one appears so often in our corpus, since the bible is so concerned with lineage.

ܡܘܼܠܸܕ:ܡܘܼܠܸܕ PresentVerbInfl; ! molidh "to begat, to conceive"

Adding this form to our transducer bumped our coverage up to 12.03%!

However, besides this word, we started noticing that a LOT of these uncovered words are single letters; they don't seem to be actual words per se. But we couldn't figure out why so many letters would be sprinkled in our corpus, except as an error. It turns out there are a copy-paste error, as seen in this sentence taking from a pdf piece of our corpus:


53 of 83 tests pass now. Yay!