For my final project, I expanded my Sorani Kurdish (ckb) finite-state transducer and boosted coverage on a large corpus. I also enabled spell-checking on my package by adding a speller directory with a list of character pairs and weights. I licensed my work under the MIT License. This Wikipedia page outlines the work I did and provides different evaluation metrics. For more details, see my project poster
My transducer was written in lexd format. It contained nearly 2000 stems, the majority of which were nouns. Most of the stems in the transducer were obtained by copying Kurdish lemmas from Apertium's bilingual Sorani-English transducer. Much of the language's basic morphology has been implemented, including definiteness markers, pluralization, subject pronoun suffixes, past and non-past conjugations, and some tenses of the subjunctive mood. Given Sorani's agglutinative nature and the way different word endings affect the way certain suffixes are added, many twol rules are used.
Among other things, the speller directory contains .txt files that contain single- and multi-character mappings for possible misspelling pairs. Each mapping comes with its respective weight. Lower weights correspond to higher likelihoods of a character mapping being brought up when suggesting corrections for a misspelled word.
- 63% coverage over Sorani Kurdish Wikipedia (~2 million words)
- 73% coverage over Sorani Kurdish Bible excerpt (~1000 words)
Precision & Recall
- 98.1% precision
- 68.2% recall
There are currently 1916 stems in the transducer, with 1235 noun stems.
The spell checker correctly outputs 'C' for correct spellings that exist in the transducer, 'W' for incorrect spellings, and 'S' for suggestions. Because I do not have a great deal of twol narrowing rules, the checker currently fails to flag certain misspellings.