Neo-Aramaic/Transducer
The code for the transducer can be found in this Github repository.
Getting set up: Part One
Getting set up for this stage of the project was an ongoing process for us; while it was easy to bootstrap our transducer and start working with Apertium, we found that grammar tags, our Alphabet
, and our tests/
all needed continuous updating. This was because our grammar rules used tags and characters not found in our defined tags or alphabet, and these changes had to trickle down at some point to tests/
, too. We continue to update these to reflect the growing number of tags we use to complete our grammar coverage. We now have tags for verb features, parts of speech, an archiphoneme, person morphology, gender morphology, number morphology, case morphology, TAM morphology, pronoun morphology, and punctuation.
Our Alphabet
includes Syriac characters, Latin characters, a null morpheme (useful for analyzing vowel diacritics), and punctuation.
Alphabet ܐ ܒ ܓ ܕ ܗ ܘ ܙ ܚ ܛ ܝ ܟ ܠ ܡ ܢ ܣ ܥ ܦ ܨ ܩ ܪ ܫ ܬ ܸ ܲ ܵ ܸ ܹ ܼ ◌:0 . ؛ ! - — ، ؟ ' " « » ” “ ( ] [ ) ܀ \ / A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z
This section also includes our archiphoneme, which is used to realize the vowel sound A
as the alaf when it appears at the beginning or end of a word. This is helpful when working with stems where the alaf often disappears after prefixes and suffixes are added to the stem.
%{A%}:0 ! %{A%}:ܐ
The hard stuff: Part One
When we moved on to the hardest part of putting together a morphological transducer, our biggest challenge was working with the many prefixes and infixes found in Neo-Aramaic. We focused in the beginning on easy forms, like plural nouns, that used mostly suffixes in their morphology. When we moved on to our prefix morphology, we used the solution suggested on the course wiki, where our twoc file only processes forms with a plus and minus version of any given feature:
Rules "Remove paths without matching suffix feature" Fx:0 /<= _ ; except _ :* Fy:0 ; where Fx in ( %[%-hab%] %[%-fut%] %[%-obl%] ) Fy in ( %[%+hab%] %[%+fut%] %[%+obl%] ) matched ; "Remove paths without matching prefix feature" Fx:0 /<= _ ; except Fy:0 :* _ ; where Fy in ( %[%-hab%] %[%-fut%] %[%-obl%] ) Fx in ( %[%+hab%] %[%+fut%] %[%+obl%] ) matched ;
This guarantees that if a verb picks up a -hab
or -fut
tag, it will only analyze these forms as having a habitual or future prefix if they pick up a later +hab
or +fut
tag.
Evaluation
Number of tokenised words in the corpus: 2556
Coverage: 9.66%
Top unknown words in the corpus:
114 ܐ
86 ܕ
66 ܠܹܗ
62 ܡ
54 ܒ
46 ܢ
44 ܘ
44 ܠ
39 ܡܘܼܠܸܕ
36 ܝ
31 ܪ
30 ܡܢ
23 ܵܐ
21 ܗ
20 ܹܐ
18 ܠܐ
17 ܫ
14 ܵ
13 ܚ
12 ܡܕܝܢܬܐ
Notes
53 of 83 tests pass now. Yay!