Neo-Aramaic/Transducer

From LING073
Revision as of 23:29, 16 March 2018 by Eschalk1 (talk | contribs) (The hard stuff: Part One)

Jump to: navigation, search

The code for the transducer can be found in this Github repository.

Getting set up: Part One

Getting set up for this stage of the project was an ongoing process for us; while it was easy to bootstrap our transducer and start working with Apertium, we found that grammar tags, our Alphabet, and our tests/ all needed continuous updating. This was because our grammar rules used tags and characters not found in our defined tags or alphabet, and these changes had to trickle down at some point to tests/, too. We continue to update these to reflect the growing number of tags we use to complete our grammar coverage. We now have tags for verb features, parts of speech, an archiphoneme, person morphology, gender morphology, number morphology, case morphology, TAM morphology, pronoun morphology, and punctuation.

Our Alphabet includes Syriac characters, Latin characters, a null morpheme (useful for analyzing vowel diacritics), and punctuation.

Alphabet
ܐ ܒ ܓ ܕ ܗ ܘ ܙ ܚ ܛ ܝ ܟ ܠ ܡ ܢ ܣ ܥ ܦ ܨ ܩ ܪ ܫ ܬ
ܸ ܲ ܵ ܸ ܹ ܼ
◌:0
. ؛ ! - — ، ؟ ' " « » ” “ ( ] [ ) ܀ \ /
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z

This section also includes our archiphoneme, which is used to realize the vowel sound A as the alaf when it appears at the beginning or end of a word. This is helpful when working with stems where the alaf often disappears after prefixes and suffixes are added to the stem.

%{A%}:0
! %{A%}:ܐ

The hard stuff: Part One

When we moved on to the hardest part of putting together a morphological transducer, our biggest challenge was working with the many prefixes and infixes found in Neo-Aramaic. We focused in the beginning on easy forms, like plural nouns, that used mostly suffixes in their morphology. When we moved on to our prefix morphology, we used the solution suggested on the course wiki, where our twoc file only processes forms with a plus and minus version of any given feature:

Rules
"Remove paths without matching suffix feature"
Fx:0 /<= _ ;
    except
      _ :* Fy:0 ;
    where Fx in ( %[%-hab%] %[%-fut%] %[%-obl%] )
        Fy in ( %[%+hab%] %[%+fut%] %[%+obl%] )
    matched ;
"Remove paths without matching prefix feature"
    Fx:0 /<= _ ;
       except
           Fy:0 :* _ ;
       where Fy in ( %[%-hab%] %[%-fut%] %[%-obl%] )
             Fx in ( %[%+hab%] %[%+fut%] %[%+obl%] )
    matched ;

This guarantees that if a verb picks up a -hab or -fut tag, it will only analyze these forms as having a habitual or future prefix if they pick up a later +hab or +fut tag.

Evaluation

Number of tokenised words in the corpus: 2556

Coverage: 9.66%

Top unknown words in the corpus:

114 ܐ

86 ܕ

66 ܠܹܗ

62 ܡ

54 ܒ

46 ܢ

44 ܘ

44 ܠ

39 ܡܘܼܠܸܕ

36 ܝ

31 ܪ

30 ܡܢ

23 ܵܐ

21 ܗ

20 ܹܐ

18 ܠܐ

17 ܫ

14 ܵ

13 ܚ

12 ܡܕܝܢܬܐ

Notes

53 of 83 tests pass now. Yay!