User:Cayoh1/Final project

From LING073
Jump to: navigation, search

Outline

I have begun to create a very preliminary, basic transducer for Yoruba, my native Nigerian Language. I have used what we've learned in class to build morphological analysis capabilities for my transducer.

  • I added nouns, verbs, pronouns, adjectives, and interjections, and conjunctions to the lexd file.
  • In my grammar documentation, I created morphTests to test morphological analysis.
  • I used scraped text from Wikipedia and the Bible to evaluate my preliminary transducer on.

Code/ Resources

Github

Grammar Documentation

Solution

Background

Essentially, even though there are up to 55 million speakers of Yoruba from Southwestern Nigeria, it is still an under-resourced language. It does have limited Google Translate functionality and some documentation, but they could use a lot of improvement.

Approach

With my knowledge of the language, growing up in a Yoruba speaking household and my knowledge of morphological analysis, I aimed to create a viable apertium transducer for the language that reflects how current speakers of the language use it. I have been checking with my family members to ensure my implementation is adequate.

Issues

Some issues I've run into are implementing correctly the Yoruba characters and structure with apertium tools and the lack of a standard orthography for the language. Yoruba is a tonal language and has many characters in the vocabulary are accented and can be written differently depending on which region you learned the language in. This has led to some difficulty in the beginning of characters not being represented correctly.

Vowels Used in Yoruba

Also Yoruba doesn't have gendered or plural forms of nouns or adjectives, you express gender in some cases by adding another word, for example: older sister in Yoruba would be ẹ̀gbọ́n obìnrin with the first word meaning older sibling and the second word signifying female. Similar to Romance languages like French and Spanish, Yoruba also distinguishes verbs and interjections by familiarity; there are different ways to say a number of things depending on the recipient.

Evaluation

To evaluate my transducer, I tested coverage over two large corpuses in Yoruba; Wikipedia and the Bible.

Wikipedia

TOP UNKNOWN WORDS:

 26244 ^ni/*ni$
 18828 ^tí/*tí$
 16783 ^ti/*ti$
 16252 ^ní/*ní$
  9895 ^àwọn/*àwọn$
  7621 ^àti/*àti$
  7503 ^je/*je$
  7093 ^jẹ́/*jẹ́$
  6521 ^ń/*ń$
  6085 ^ati/*ati$
  5680 ^rẹ̀/*rẹ̀$
  5556 ^sí/*sí$
  5458 ^ọdún/*ọdún$
  5237 ^to/*to$
  5182 ^wọ́n/*wọ́n$
  4846 ^èdè/*èdè$
  4768 ^ṣe/*ṣe$
  4717 ^the/*the$
  4697 ^fún/*fún$
  4513 ^of/*of$

coverage: 287237 / 1283498 (~0.22379232379014225188)

remaining unknown forms: 996261

The Bible

TOP UNKNOWN WORDS:

  9572 ^si/*si$
  8152 ^ti/*ti$
  3724 ^awọn/*awọn$
  3530 ^ni/*ni$
  3405 ^fun/*fun$
  3204 ^li/*li$
  3051 ^ki/*ki$
  2841 ^ati/*ati$
  2534 ^rẹ̀/*rẹ̀$
  2412 ^nwọn/*nwọn$
  2069 ^pe/*pe$
  1965 ^nyin/*nyin$
  1927 ^fi/*fi$
  1872 ^bi/*bi$
  1649 ^ninu/*ninu$
  1648 ^ba/*ba$
  1588 ^kò/*kò$
  1506 ^ṣe/*ṣe$
  1379 ^Ọlọrun/*Ọlọrun$
  1278 ^lati/*lati$

coverage: 53863 / 233959 (~0.23022409909428575092)

remaining unknown forms: 180096

Moving Forward

I think that with further implementation, this resource would be very useful for the Yoruba Speaking community in Nigeria and the Yoruba diaspora community all over the world. I know that my peers who have also grown up with the language, but are not aware of its grammatical intricacies would benefit from this transducer to further learn about their native language. I think any addition to the computational tools for Yoruba will be positive as it is low resource compared to other languages with a similar number of speakers.