I have begun to create a very preliminary, basic transducer for Yoruba, my native Nigerian Language. I have used what we've learned in class to build morphological analysis capabilities for my transducer.
- I added nouns, verbs, pronouns, adjectives, and interjections, and conjunctions to the
- In my grammar documentation, I created
morphTeststo test morphological analysis.
- I used scraped text from Wikipedia and the Bible to evaluate my preliminary transducer on.
Essentially, even though there are up to 55 million speakers of Yoruba from Southwestern Nigeria, it is still an under-resourced language. It does have limited Google Translate functionality and some documentation, but they could use a lot of improvement.
With my knowledge of the language, growing up in a Yoruba speaking household and my knowledge of morphological analysis, I aimed to create a viable apertium transducer for the language that reflects how current speakers of the language use it. I have been checking with my family members to ensure my implementation is adequate.
Some issues I've run into are implementing correctly the Yoruba characters and structure with apertium tools and the lack of a standard orthography for the language. Yoruba is a tonal language and has many characters in the vocabulary are accented and can be written differently depending on which region you learned the language in. This has led to some difficulty in the beginning of characters not being represented correctly.
Also Yoruba doesn't have gendered or plural forms of nouns or adjectives, you express gender in some cases by adding another word, for example:
older sister in Yoruba would be
ẹ̀gbọ́n obìnrin with the first word meaning older sibling and the second word signifying female. Similar to Romance languages like French and Spanish, Yoruba also distinguishes verbs and interjections by familiarity; there are different ways to say a number of things depending on the recipient.
To evaluate my transducer, I tested coverage over two large corpuses in Yoruba; Wikipedia and the Bible.
TOP UNKNOWN WORDS:
26244 ^ni/*ni$ 18828 ^tí/*tí$ 16783 ^ti/*ti$ 16252 ^ní/*ní$ 9895 ^àwọn/*àwọn$ 7621 ^àti/*àti$ 7503 ^je/*je$ 7093 ^jẹ́/*jẹ́$ 6521 ^ń/*ń$ 6085 ^ati/*ati$ 5680 ^rẹ̀/*rẹ̀$ 5556 ^sí/*sí$ 5458 ^ọdún/*ọdún$ 5237 ^to/*to$ 5182 ^wọ́n/*wọ́n$ 4846 ^èdè/*èdè$ 4768 ^ṣe/*ṣe$ 4717 ^the/*the$ 4697 ^fún/*fún$ 4513 ^of/*of$
coverage: 287237 / 1283498 (~0.22379232379014225188)
remaining unknown forms: 996261
TOP UNKNOWN WORDS:
9572 ^si/*si$ 8152 ^ti/*ti$ 3724 ^awọn/*awọn$ 3530 ^ni/*ni$ 3405 ^fun/*fun$ 3204 ^li/*li$ 3051 ^ki/*ki$ 2841 ^ati/*ati$ 2534 ^rẹ̀/*rẹ̀$ 2412 ^nwọn/*nwọn$ 2069 ^pe/*pe$ 1965 ^nyin/*nyin$ 1927 ^fi/*fi$ 1872 ^bi/*bi$ 1649 ^ninu/*ninu$ 1648 ^ba/*ba$ 1588 ^kò/*kò$ 1506 ^ṣe/*ṣe$ 1379 ^Ọlọrun/*Ọlọrun$ 1278 ^lati/*lati$
coverage: 53863 / 233959 (~0.23022409909428575092)
remaining unknown forms: 180096
I think that with further implementation, this resource would be very useful for the Yoruba Speaking community in Nigeria and the Yoruba diaspora community all over the world. I know that my peers who have also grown up with the language, but are not aware of its grammatical intricacies would benefit from this transducer to further learn about their native language. I think any addition to the computational tools for Yoruba will be positive as it is low resource compared to other languages with a similar number of speakers.