Tibetan/Transducer

From LING073
Revision as of 08:58, 6 March 2018 by Arobey1 (talk | contribs)

Jump to: navigation, search

This page is dedicated to describing the Tibetan morphological analyzer I created as part of this class. The code for this project is located on Github.

Notes

The first form I implemented was the infinitive. In Tibetan, an infinitive is formed by adding པ (pa) or བ (ba) to the verbal root. For example, the verb ལོཀ (lok) meaning "read" is changed to ལོཀ་པ (lok-pa) to form the verb "to read." At first, I was confused about why this verb wouldn't be transliterated as "loka-pa," since the character ཀ is transliterated as "ka" rather than "k." However, I learned that in Tibetan, words are broken up into different syllables and are separated by a tsek, which is the small dot separating ལོཀ and པ in ལོཀ་པ. The tsek almost serves as a space in Tibetan, since spaces are not used to break words. Right now, some incorrect forms will analyze without giving an error. From my grammar page:

The suffix «pa» is added when the final letter of the root is any consonant except for «r» or «l». On the other hand, the suffix «wa» is used if the verb root ends in «r», «l» or any vowel.

So right now, if "pa" or "ba" is added to the end of an infinitive, it will analyze either form. This is undesirable, as there is only one correct form for any given verb. This is a problem I will be able to fix with the "Rules" section of the twol file.

After completing the infinitive analysis, I began implementing analysis of noun declensions. In Tibetan, there are several endings/suffixes that denote the standard noun forms (i.e. nominative, accusative, dative, etc.). Each of these is listed as a separate option in my lexc file so that incorrect forms can be analyzed as a correct form:

%<agt%>:%>་ཀྱི # ;   ! agentive (kyi = ཀྱི)
%<agt%>:%>་གྷི # ;   ! agentive (ghi = གྷི)
%<agt%>:%>་གི # ;   ! agentive (gi = གི)
%<agt%>:%>་ཡི # ;   ! agentive (yi = ཡི)

In total, there are seven forms a noun can take in Tibetan, and there is a case for singular and plural.

  1. Nominative
  2. Genitive
  3. Dative
  4. Accusative
  5. Locative
  6. Ablative
  7. Agentive

I hope to add more nouns to my grammar to make sure that I have found all of the endings that nouns can take on in declension. There are various rules that govern which nouns get which endings. Again, I will implement these rules in the "Rules" section of my twol file when I have a chance.

The adjectives were somewhat easier to implement. Most adjectives take a suffix of པ (pa) or བ (ba) to move from positive to comparative. This implementation was very similar to the code I used to analyze infinitive verbs. There are actually four possible cases for comparative adjectives:

  • Adjective + normalizing participle + "-pa" or "-ba"
  • Adjective + verb participle + "-kyi" or "-gi" or "-gyi" or "-red"
  • Adjective + "-pa" or "-ba" with lengthened vowel at predicator position
  • Adjective + causative verb participle "-ru"

The superlative generally takes ཤོས (shos) as a suffix. I also implemented this case in my lexc file.

Challenges

Tibetan is a difficult language to work with. One of the major problems has been trying to understand how the text fits together. It was only after putting my grammar page together and talking with Prof. Washington that I realized that only the first character in a syllable takes a sounded vowel. Thus the verb ཡར་ལང་ཝ meaning "to get up/rise" is not transliterated as "yara langa pa" even though ཡ is "ya," ར is "ra," ལ is "la," ང is "nga" and ཝ is "pa," which in this case denotes the infinitive. Instead, only the first character of each syllable takes its vowel, and so we end up with "yar lang pa."

Furthermore, I realized when I was working with my analyzer that some of the characters I needed were missing from the "Alphabet" section of my twol file, which was messing up my analysis. In particular, one of my forms needed the character ཨ which is transliterated to 'a. It is very difficult to work with a language that does not have Roman script characters, because I got this character confused with ཡ ("ya") and I couldn't figure out why the transducer wasn't working.

I also found that I had to take it really slow when writing code for my twol file. My first try for the transducer would not work when I compiled it because I had written a lot of code that didn't have the correct syntax. And since I didn't really know the lexc language, it was almost impossible to debug. I ended up just completely starting over with a new project and built it up one form at a time. I worked mostly on my laptop, so to speed up the compilation process, I created a bash script to run my tests for me over ssh after I had pushed to Github from my local machine:

#!/bin/bash
cd ..
git stash
git pull
./autogen.sh
make clean
make
cd tools
morphTests2yaml "Tibetan/Grammar" -l tib
cp --force tib.yaml ../tests/
cd ../tests
aq-morftest -csi tib.yaml

Getting used to an entirely new alphabet has been perhaps the most challenging portion of this project.