Difference between revisions of "Tibetan"

From LING073
Jump to: navigation, search
(Developed Resources)
(Orthography Documentation)
Line 68: Line 68:
 
==Orthography Documentation==
 
==Orthography Documentation==
 
* Similar to [https://en.wikipedia.org/wiki/Modern_Standard_Tibetan_grammar these examples].  Put spaces between words:
 
* Similar to [https://en.wikipedia.org/wiki/Modern_Standard_Tibetan_grammar these examples].  Put spaces between words:
 
  
 
     Skin of sheep
 
     Skin of sheep

Revision as of 13:26, 8 February 2018

Tibetan Language Page

Existing Tools

In this section, we list existing dictionaries, translations and translating tools. It seems that we will have lots of dictionaries to work with, and there are several useful-loping translation tools that will definitely help us in this course.

Tibetan-English Dictionaries

Tibetan spellchecker

Tibetan keyboard layout

Orthography/Grammar Tools

Translations and Texts

Academic Papers

There are a number of academic papers relating to this subject. In this section, we list a few of them. There seem to be quite a lot of papers on the topic of Tibetan machine translation and speech recognition. This is by no means a complete summary.

Tibetan Speech Recognition

Research papers on speech recognition for Tibetan

Tibetan Machine Translation

Developed Resources

This section includes the tools I have developed as a part of being in LING073. Because three standard keyboards already exist (ewts, tcrc and wylie) for Linux, I chose to take this task further by developing a transcription keyboard. My research has indicated that the Wylie keyboard is the most standard, and thus this is the keyboard I will use for the remainder of this class.

Transcription Keyboard

Because a keyboard for Tibetan already exists in IBus, I created a transcription keyboard. This will allow users to type pronunciations of Tibetan words as they might appear in Tibetan dictionaries. The GitHub repository I created contains the standard Wylie Tibetan keyboard (bo-wylie.mim) and the transcription keyboard (bo-transcription.mim) I created. I also holds AUTHORS, LICENSE and INSTALL files. I also created a wiki page describing how I implemented this keyboard.

Corpus

I have assembled a corpus of texts in Tibetan. The full corpus and the scripts I used to parse the data can be found in this Github repository. This includes the entirety of the Tibetan Wikipedia, several books of the bible with corresponding English translations underneath each line, several biographies and 600 webpages from the homepage of the Dalai Lama. The tools that I have developed to scrape data are fairly general - they can be extended to a number of different kinds of sites. In particular, the Parser() class in the parse.py file is very useful for formatting glosses.

Orthography Documentation

   Skin of sheep
   ལུག་གི་པགས་པ
   <lug-gi pags-pa> 
  • See info on little dot tsek character - seems to separate syllables, but there is no concept of words

References

  1. Cite error: Invalid <ref> tag; no text was provided for refs named First_Tibetan_Dictionary
  2. Cite error: Invalid <ref> tag; no text was provided for refs named Tibetan_Grammer