Neo-Aramaic/Transducer

From LING073
Revision as of 13:02, 18 March 2018 by Eschalk1 (talk | contribs) (Notes)

Jump to: navigation, search

The code for the transducer can be found in this Github repository.

Getting set up: Part One

Getting set up for this stage of the project was an ongoing process for us; while it was easy to bootstrap our transducer and start working with Apertium, we found that grammar tags, our Alphabet, and our tests/ all needed continuous updating. This was because our grammar rules used tags and characters not found in our defined tags or alphabet, and these changes had to trickle down at some point to tests/, too. We continue to update these to reflect the growing number of tags we use to complete our grammar coverage. We now have tags for verb features, parts of speech, an archiphoneme, person morphology, gender morphology, number morphology, case morphology, TAM morphology, pronoun morphology, and punctuation.

Our Alphabet includes Syriac characters, Latin characters, a null morpheme (useful for analyzing vowel diacritics), and punctuation.

Alphabet
ܐ ܒ ܓ ܕ ܗ ܘ ܙ ܚ ܛ ܝ ܟ ܠ ܡ ܢ ܣ ܥ ܦ ܨ ܩ ܪ ܫ ܬ
ܸ ܲ ܵ ܸ ܹ ܼ
◌:0
. ؛ ! - — ، ؟ ' " « » ” “ ( ] [ ) ܀ \ /
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z

This section also includes our archiphoneme, which is used to realize the vowel sound A as the alaf when it appears at the beginning or end of a word. This is helpful when working with stems where the alaf often disappears after prefixes and suffixes are added to the stem.

%{A%}:0
! %{A%}:ܐ

The hard stuff: Part One

When we moved on to the hardest part of putting together a morphological transducer, our biggest challenge was working with the many prefixes and infixes found in Neo-Aramaic. We focused in the beginning on easy forms, like plural nouns, that used mostly suffixes in their morphology. When we moved on to our prefix morphology, we used the solution suggested on the course wiki, where our twoc file only processes forms with a plus and minus version of any given feature:

Rules
"Remove paths without matching suffix feature"
Fx:0 /<= _ ;
    except
      _ :* Fy:0 ;
    where Fx in ( %[%-hab%] %[%-fut%] %[%-obl%] )
        Fy in ( %[%+hab%] %[%+fut%] %[%+obl%] )
    matched ;
"Remove paths without matching prefix feature"
    Fx:0 /<= _ ;
       except
           Fy:0 :* _ ;
       where Fy in ( %[%-hab%] %[%-fut%] %[%-obl%] )
             Fx in ( %[%+hab%] %[%+fut%] %[%+obl%] )
    matched ;

This guarantees that if a verb picks up a -hab or -fut tag, it will only analyze these forms as having a habitual or future prefix if they pick up a later +hab or +fut tag.

Evaluation

We have 2556 tokenised words in our corpus, with a coverage of 9.66%. We have about 37 stems in our transducer (definitely a place we should be concentrating work right now).

Top unknown words in the corpus:

114 ܐ

86 ܕ

66 ܠܹܗ

62 ܡ

54 ܒ

46 ܢ

44 ܘ

44 ܠ

39 ܡܘܼܠܸܕ

36 ܝ

31 ܪ

30 ܡܢ

23 ܵܐ

21 ܗ

20 ܹܐ

18 ܠܐ

17 ܫ

14 ܵ

13 ܚ

12 ܡܕܝܢܬܐ

Notes

The first unknown word we're going to analyze is ܡܘܼܠܸܕ, which is the ninth most common unknown word and appears 39 times in the corpus. This one is pretty easy because we were able to find it in the Sureth dictionary, and it translates as "to breed, to beget, to conceive." It's not surprising that this one appears so often in our corpus, since the bible is so concerned with lineage.

ܡܘܼܠܸܕ:ܡܘܼܠܸܕ PresentVerbInfl; ! molidh "to begat, to conceive"

Adding this form to our transducer bumped our coverage up to 12.03%!

However, besides this word, we started noticing that a LOT of these uncovered words are single letters; they don't seem to be actual words per se. But we couldn't figure out why so many letters would be sprinkled in our corpus, except as an error. It turns out these are a copy-paste error, as seen in this sentence taking from a pdf piece of our corpus:

Noticescreenshot.png

This sentence came over to our aii-basic-corpus.txt file looking like this:

ܡܬܪܓܡܹܢܐ ܡܗܼܝܹܪ ܸܒܠܫܢ ܼ ܕܪܘܫ ܹܡ ܿ ܵ ܐ ܐ ܵ ܵ o ܐ

We knew we would need to find a more powerful way to get text out of our pdf sources, and after some research decided to find an OCR with Syriac support. Unfortunately, our first thought, Adobe Acrobat Pro, had a tiny selection, and even ABBYY FineReader didn't include Syriac in its list of 192 recognized languages. Fortunately, our friend George Kiraz (a local Syriac scholar) came through and recommended Tesseract, an open-source OCR developed by Google!!! Our .txt file outputted by Tesseract looks a lot more like the original:

 ܡܬܪܓܡܠܝܐ ܡܗܝܠܐ ܒܠܦܫܐ ܕܪܢܕܫܠܪܩ

However, after looking more carefully at our corpus, we decided to remove most of our pdf source because it was too messy for apertium to work with; it was full of problematic characters like Latin script and even a few instances of � that were causing errors when we tried to later calculate ambiguity. We replaced these lines with a translation of the universal declaration of human rights.

53 of 83 tests pass now. Yay!