Difference between revisions of "Tibetan/Transducer"

From LING073
Jump to: navigation, search
Line 36: Line 36:
 
I also found that I had to take it really slow when writing code for my '''twol''' file.  My first try for the transducer would not work when I compiled it because I had written a lot of code that didn't have the correct syntax.  And since I didn't really know the lexc language, it was almost impossible to debug.  I ended up just completely starting over with a new project and built it up one form at a time.  I worked mostly on my laptop, so to speed up the compilation process, I created a bash script to run my tests for me over ssh after I had pushed to Github from my local machine:
 
I also found that I had to take it really slow when writing code for my '''twol''' file.  My first try for the transducer would not work when I compiled it because I had written a lot of code that didn't have the correct syntax.  And since I didn't really know the lexc language, it was almost impossible to debug.  I ended up just completely starting over with a new project and built it up one form at a time.  I worked mostly on my laptop, so to speed up the compilation process, I created a bash script to run my tests for me over ssh after I had pushed to Github from my local machine:
  
<code>#!/bin/bash
+
<syntaxhighlight lang="cpp">#!/bin/bash
  
 
cd ..
 
cd ..
Line 48: Line 48:
 
cp --force tib.yaml ../tests/
 
cp --force tib.yaml ../tests/
 
cd ../tests
 
cd ../tests
aq-morftest -csi tib.yaml</code>
+
aq-morftest -csi tib.yaml</syntaxhighlight>

Revision as of 08:50, 6 March 2018

This page is dedicated to describing the Tibetan morphological analyzer I created as part of this class. The code for this project is located on Github.

Notes

The first form I implemented was the infinitive. In Tibetan, an infinitive is formed by adding པ (pa) or བ (ba) to the verbal root. For example, the verb ལོཀ (lok) meaning "read" is changed to ལོཀ་པ (lok-pa) to form the verb "to read." At first, I was confused about why this verb wouldn't be transliterated as "loka-pa," since the character ཀ is transliterated as "ka" rather than "k." However, I learned that in Tibetan, words are broken up into different syllables and are separated by a tsek, which is the small dot separating ལོཀ and པ in ལོཀ་པ. The tsek almost serves as a space in Tibetan, since spaces are not used to break words. Right now, some incorrect forms will analyze without giving an error. From my grammar page:

The suffix «pa» is added when the final letter of the root is any consonant except for «r» or «l». On the other hand, the suffix «wa» is used if the verb root ends in «r», «l» or any vowel.

So right now, if "pa" or "ba" is added to the end of an infinitive, it will analyze either form. This is undesirable, as there is only one correct form for any given verb. This is a problem I will be able to fix with the "Rules" section of the twol file.

After completing the infinitive analysis, I began implementing analysis of noun declensions. In Tibetan, there are several endings/suffixes that denote the standard noun forms (i.e. nominative, accusative, dative, etc.). Each of these is listed as a separate option in my lexc file so that incorrect forms can be analyzed as a correct form:

%<agt%>:%>་ཀྱི # ;   ! agentive (kyi = ཀྱི)
%<agt%>:%>་གྷི # ;   ! agentive (ghi = གྷི)
%<agt%>:%>་གི # ;   ! agentive (gi = གི)
%<agt%>:%>་ཡི # ;   ! agentive (yi = ཡི)

In total, there are seven forms a noun can take in Tibetan, and there is a case for singular and plural.

  1. Nominative
  2. Genitive
  3. Dative
  4. Accusative
  5. Locative
  6. Ablative
  7. Agentive

I hope to add more nouns to my grammar to make sure that I have found all of the endings that nouns can take on in declension. There are various rules that govern which nouns get which endings. Again, I will implement these rules in the "Rules" section of my twol file when I have a chance.

Challenges

Tibetan is a difficult language to work with. One of the major problems has been trying to understand how the text fits together. It was only after putting my grammar page together and talking with Prof. Washington that I realized that only the first character in a syllable takes a sounded vowel. Thus the verb ཡར་ལང་ཝ meaning "to get up/rise" is not transliterated as "yara langa pa" even though ཡ is "ya," ར is "ra," ལ is "la," ང is "nga" and ཝ is "pa," which in this case denotes the infinitive. Instead, only the first character of each syllable takes its vowel, and so we end up with "yar lang pa."

Furthermore, I realized when I was working with my analyzer that some of the characters I needed were missing from the "Alphabet" section of my twol file, which was messing up my analysis. In particular, one of my forms needed the character ཨ which is transliterated to 'a. It is very difficult to work with a language that does not have Roman script characters, because I got this character confused with ཡ ("ya") and I couldn't figure out why the transducer wasn't working.

I also found that I had to take it really slow when writing code for my twol file. My first try for the transducer would not work when I compiled it because I had written a lot of code that didn't have the correct syntax. And since I didn't really know the lexc language, it was almost impossible to debug. I ended up just completely starting over with a new project and built it up one form at a time. I worked mostly on my laptop, so to speed up the compilation process, I created a bash script to run my tests for me over ssh after I had pushed to Github from my local machine:

<syntaxhighlight lang="cpp">#!/bin/bash

cd .. git stash git pull ./autogen.sh make clean make cd tools morphTests2yaml "Tibetan/Grammar" -l tib cp --force tib.yaml ../tests/ cd ../tests aq-morftest -csi tib.yaml</syntaxhighlight>