User:Cayoh1/Final project
Contents
Outline
I have begun to create a very preliminary, basic transducer for Yoruba, my native Nigerian Language. I have used what we've learned in class to build morphological analysis capabilities for my transducer.
- I added noun, verb, pronoun, adjectives, and interjections to the
lexd
file. - In my grammar documentation, I created
morphTests
to test morphological analysis. - I started writing rules in the
twol
file. - I used scraped text from Wikipedia and the Bible to evaluate my preliminary transducer on.
Code/ Resources
Solution
Background
Essentially, even though there are up to 55 million speakers of Yoruba from Southwestern Nigeria, it is still an under-resourced language. It does have limited Google Translate functionality and some documentation, but they could use a lot of improvement.
Approach
With my knowledge of the language, growing up in a Yoruba speaking household and my knowledge of morphological analysis, I aimed to create a viable apertium transducer for the language that reflects how current speakers of the language use it. I have been checking with my family members to ensure my implementation is adequate.
Issues
Some issues I've run into are implementing correctly the Yoruba characters and structure with apertium tools and the lack of a standard orthography for the language. Yoruba is a tonal language and has many characters in the vocabulary are accented and can be written differently depending on which region you learned the language in. This has led to some difficulty in the beginning of characters not being represented correctly.
Also Yoruba doesn't have gendered or plural forms of nouns or adjectives, you express gender in some cases by adding another word, for example: older sister
in Yoruba would be ẹ̀gbọ́n obìnrin
with the first word meaning older sibling and the second word signifying female. Similar to Romance languages like French and Spanish, Yoruba also distinguishes verbs and interjections by familiarity; there are different ways to say a number of things depending on the recipient.
Evaluation
To evaluate my transducer, I tested coverage over two large corpuses in Yoruba; Wikipedia and the Bible.
Wikipedia
TOP UNKNOWN WORDS:
26244 ^ni/*ni$ 18828 ^tí/*tí$ 16783 ^ti/*ti$ 16252 ^ní/*ní$ 9895 ^àwọn/*àwọn$ 7621 ^àti/*àti$ 7503 ^je/*je$ 7093 ^jẹ́/*jẹ́$ 6521 ^ń/*ń$ 6085 ^ati/*ati$ 5680 ^rẹ̀/*rẹ̀$ 5556 ^sí/*sí$ 5458 ^ọdún/*ọdún$ 5237 ^to/*to$ 5182 ^wọ́n/*wọ́n$ 4846 ^èdè/*èdè$ 4768 ^ṣe/*ṣe$ 4717 ^the/*the$ 4697 ^fún/*fún$ 4513 ^of/*of$
coverage: 287237 / 1283498 (~0.22379232379014225188) remaining unknown forms: 996261
The Bible
TOP UNKNOWN WORDS:
9572 ^si/*si$ 8152 ^ti/*ti$ 3724 ^awọn/*awọn$ 3530 ^ni/*ni$ 3405 ^fun/*fun$ 3204 ^li/*li$ 3051 ^ki/*ki$ 2841 ^ati/*ati$ 2534 ^rẹ̀/*rẹ̀$ 2412 ^nwọn/*nwọn$ 2069 ^pe/*pe$ 1965 ^nyin/*nyin$ 1927 ^fi/*fi$ 1872 ^bi/*bi$ 1649 ^ninu/*ninu$ 1648 ^ba/*ba$ 1588 ^kò/*kò$ 1506 ^ṣe/*ṣe$ 1379 ^Ọlọrun/*Ọlọrun$ 1278 ^lati/*lati$
coverage: 53863 / 233959 (~0.23022409909428575092)
remaining unknown forms: 180096
Moving Forward
I think that with further implementation, this resource would be very useful for the Yoruba Speaking community in Nigeria and the Yoruba diaspora community all over the world. I know that my peers who have also grown up with the language, but are not aware of its grammatical intricacies would benefit from this transducer to further learn about their native language. I think any addition to the computational tools for Yoruba will be positive as it is low resource compared to other languages with a similar number of speakers.