Neo-Aramaic/Transducer

From LING073
Jump to: navigation, search

The code for the transducer can be found in this Github repository.

Analyzer Evaluation

Getting set up: Part One

Getting set up for this stage of the project was an ongoing process for us; while it was easy to bootstrap our transducer and start working with Apertium, we found that grammar tags, our Alphabet, and our tests/ all needed continuous updating. This was because our grammar rules used tags and characters not found in our defined tags or alphabet, and these changes had to trickle down at some point to tests/, too. We continue to update these to reflect the growing number of tags we use to complete our grammar coverage. We now have tags for verb features, parts of speech, an archiphoneme, person morphology, gender morphology, number morphology, case morphology, TAM morphology, pronoun morphology, and punctuation.

Our Alphabet includes Syriac characters, Latin characters, a null morpheme (useful for analyzing vowel diacritics), and punctuation.

Alphabet
ܐ ܒ ܓ ܕ ܗ ܘ ܙ ܚ ܛ ܝ ܟ ܠ ܡ ܢ ܣ ܥ ܦ ܨ ܩ ܪ ܫ ܬ
ܸ ܲ ܵ ܸ ܹ ܼ
◌:0
. ؛ ! - — ، ؟ ' " « » ” “ ( ] [ ) ܀ \ /
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z

This section also includes our archiphoneme, which is used to realize the vowel sound A as the alaf when it appears at the beginning or end of a word. This is helpful when working with stems where the alaf often disappears after prefixes and suffixes are added to the stem.

%{A%}:0
! %{A%}:ܐ

The hard stuff: Part One

When we moved on to the hardest part of putting together a morphological transducer, our biggest challenge was working with the many prefixes and infixes found in Neo-Aramaic. We focused in the beginning on easy forms, like plural nouns, that used mostly suffixes in their morphology. When we moved on to our prefix morphology, we used the solution suggested on the course wiki, where our twoc file only processes forms with a plus and minus version of any given feature:

Rules
"Remove paths without matching suffix feature"
Fx:0 /<= _ ;
    except
      _ :* Fy:0 ;
    where Fx in ( %[%-hab%] %[%-fut%] %[%-obl%] )
        Fy in ( %[%+hab%] %[%+fut%] %[%+obl%] )
    matched ;
"Remove paths without matching prefix feature"
    Fx:0 /<= _ ;
       except
           Fy:0 :* _ ;
       where Fy in ( %[%-hab%] %[%-fut%] %[%-obl%] )
             Fx in ( %[%+hab%] %[%+fut%] %[%+obl%] )
    matched ;

This guarantees that if a verb picks up a -hab or -fut tag, it will only analyze these forms as having a habitual or future prefix if they pick up a later +hab or +fut tag.

Evaluation

We have 2595 tokenised words in our corpus, with a coverage of 15.26%. We have about 37 stems in our transducer (definitely a place we should be concentrating work right now).

Top unknown words in the corpus, before we added some of them:

66 ܠܹܗ

62 ܕ

54 ܘ

52 ܐ

50 ܚܕ

49 ܡ

44 ܒ

40 ܪ

39 ܡܘܼܠܸܕ

37 ܠ

35 ܢ

31 ܡܢ

28 ܟܠ

28 ܝܢ

28 ܝ

27 ܗܩܘܬܐ

25 ܠܐ

22 ܐܝܬܠܗ

20 ܗ

19 ܵܐ


Top unknown words in the corpus after we added ܠܹܗ, ܚܕ, ܡܘܼܠܸܕ, ܟܠ:

62 ܕ

54 ܘ

52 ܐ

49 ܡ

44 ܒ

40 ܪ

37 ܠ

35 ܢ

31 ܡܢ

28 ܝ

28 ܝܢ

27 ܗܩܘܬܐ

25 ܠܐ

22 ܐܝܬܠܗ

20 ܗ

19 ܵܐ

17 ܫ

16 ܹܐ

13 ܩܐ

12 ܦܐܫ


Currently, 53 of 83 tests in aii.yaml are passing. We have four passing tests in commonwords.yaml.

Notes

Eighteen of the thirty failing tests in aii.yaml are from our "plural nouns" section, in which we gave examples of different ways to form plural nouns. Since there are many different strategies for noun pluralization in Neo-Aramaic, not all of which are totally regular, we only implemented one strategy in our transducer. Thus the tests that exemplify forms that we didn't implement are still failing.

Six failing tests are from the preposition section. Rules relating to prepositions were not among the 10 grammatical features we documented in our wiki page, so our transducer currently doesn't deal with prepositions.

Four failing tests are from the examples we had in our verbs section. While we could implement these stems, we haven't done so yet because we're not sure if they follow the same paradigm as the patx/madməx group we currently have implemented.

Finally, there are two failing tests in the habitual prefix section. Both of these come from the form ܟܝܼܐܵܬܹܐ ci-atə 'to come (habitually).' This word has two vowels in a row, and the only way we can think of to spell it is to have an alaf in the middle of the word to carry the second vowel. But our twol rules dictate that an archiphonemic alaf should be realized as null when it's not at a word boundary. We could spell ܟܝܼܐܵܬܹܐ with a plain alaf rather than with an archiphoneme, but that would then mess up ܒܹܬܵܬܹܐ bet-atə 'he will come,' in which the alaf does have to be realized as null.

The first unknown word we're going to analyze is ܡܘܼܠܸܕ, which is the ninth most common unknown word and appears 39 times in the corpus. This one is pretty easy because we were able to find it in the Sureth dictionary, and it translates as "to breed, to beget, to conceive." It's not surprising that this one appears so often in our corpus, since the bible is so concerned with lineage.

ܡܘܼܠܸܕ:ܡܘܼܠܸܕ PresentVerbInfl; ! molidh "to begat, to conceive"

Adding this form to our transducer bumped our coverage from 9.66% up to 12.03%!

However, besides this word, we started noticing that a LOT of these uncovered words are single letters; they don't seem to be actual words per se. But we couldn't figure out why so many letters would be sprinkled in our corpus, except as an error. It turns out these are a copy-paste error, as seen in this sentence taking from a pdf piece of our corpus:

Noticescreenshot.png

This sentence came over to our aii-basic-corpus.txt file looking like this:

ܡܬܪܓܡܹܢܐ ܡܗܼܝܹܪ ܸܒܠܫܢ ܼ ܕܪܘܫ ܹܡ ܿ ܵ ܐ ܐ ܵ ܵ o ܐ

We knew we would need to find a more powerful way to get text out of our pdf sources, and after some research decided to find an OCR with Syriac support. Unfortunately, our first thought, Adobe Acrobat Pro, had a tiny selection, and even ABBYY FineReader didn't include Syriac in its list of 192 recognized languages. Fortunately, our friend George Kiraz (a local Syriac scholar) came through and recommended Tesseract, an open-source OCR developed by Google!!! Our .txt file outputted by Tesseract looks a lot more like the original:

 ܡܬܪܓܡܠܝܐ ܡܗܝܠܐ ܒܠܦܫܐ ܕܪܢܕܫܠܪܩ

However, after looking more carefully at our corpus, we decided to remove most of our pdf source because it was too messy for apertium to work with; it was full of problematic characters like Latin script and even a few instances of � that were causing errors when we tried to later calculate ambiguity. We replaced these lines with a translation of the universal declaration of human rights.

Adding the declaration significantly changed the list of unknown words and removed a lot of those single letters we were worried about; the next word we decided to investigate was ܟܠ, which was our twelfth most common unknown word and appeared 28 times in our corpus. This was also easy to find in the Sureth dictionary and is a simple adjective that means "any, all." It's used often in the Declaration of Human Rights since each statement in this document begins with ܟܠ ܚܕ, which translates as "any person."

ܟܠ:ܟܠ AdjectiveInfl ; ! had 'any, all'

Since our coverage went down after adding the Universal Declaration of Human Rights, we went from 9.67% to 10.75% by adding this word!

We next added ܚܕ, which was our fifth most common unknown word with 50 appearances in the corpus. The Sureth dictionary showed that this word acts as a sort of indeterminate noun, meaning "a, an, one." As previously stated, it often appears in our corpus in the Declaration of Human Rights in conjunction with ܟܠ to mean "anyone."

ܚܕ:ܚܕ RegNounInfl ; ! 'had- "a, an, one"

Adding this word to our transducer brought our coverage up to 12.72%.

The elephant in our room was our most common word, ܠܹܗ. This word was so common in our corpus in part because it shows up in almost every sentence in the first chapter of Matthew (it's the genealogy of Jesus, so most of the sentences are of the form "and a begat b, and b begat c, and c begat d"—you can see how ܡܘܼܠܸܕ comes in here as well), but we couldn't figure out what it was doing. Eventually, we found a dictionary in which it was explained that ܠܹܗ is one form of a word that is used to mark "an object in close association with a noun." In our transducer, we had this word analyze as a preposition.

 ܠܹܗ:ܠܹܗ PrepositionInfl ; ! leh object marker

After we added this word, our coverage jumped to 15.26%.

Generator Evaluation

Initial evaluation of morphological generation

We have 53 passing morph tests and 30 failing morph tests in aii.yaml (63.86%).

We have 2595 tokenised words in our corpus, with a coverage of 15.26%.

When testing for generation, we have 45 passing morph tests and 56 failing morph tests (44.55%).

Final evaluation of morphological generation

Now there are 59 passing tests and 18 failing tests (76.62%). We didn't add any twol rules. The only thing we did was comment out one of the twol rules we already have so that the other one can stand as the default. Instead, we added some stems and paths to the transducer. In the yaml file, we originally had a series of long-form s-suffixes for present-tense verbs. However, in the transducer, the long-form s-suffixes are set to analyze, but not generate, so those tests failed during generation. So we took the long form s-suffix tests out of the yaml file and replaced them with the default form s-suffixes. We had removed the default form s-suffixes a while ago because there was something about that particular section that caused errors when we tried to run morftest, but they worked just fine after we retyped them by hand.

Our corpus still has 2595 forms, but coverage has strangely dropped to 13.06%.