Difference between revisions of "User:Nfeldba1/Final project"
(Created page with "==Pre-Final Project== *Number of tokenised words in the corpus: 57847 *Coverage: 57.26% *Top unknown words in the corpus: *206 kam *200 kum *179 Bah *172 kiwei *171 bynta...") |
|||
(18 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | == | + | ==The Project== |
− | + | For my final project, I expanded my transducer until it reached 85% coverage, which involved adding words, disambiguations, and extra suffixes. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | == | + | My code can be found here: https://github.com/nfeldbaum/Khasi_Transducer |
− | + | ||
− | + | ==The Results== | |
− | + | ||
− | + | I managed to reach 85.15% coverage on my initial 50,000 word corpus! However, I ended up adding a significant number of English words that were present to reach this amount. A truer measure of corpus coverage is the 80.50% coverage I get when I delete all the words that aren't part of Khasi. | |
− | + | ||
− | + | Instead of testing precision and recall against hand-annotated randomly selected forms, I decided to gather a further 25,000 words in order to test my transducer on a corpus it hadn't trained on. This new corpus can be found in the repository linked above under ling073-kha-corpus/kha.corpus.large.test.txt. On this test corpus, I achieved 83.39% coverage with English words included, and 80.50% coverage without English words - exactly the same as on my training corpus. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
[[Category:sp17_FinalProjects]] | [[Category:sp17_FinalProjects]] |
Latest revision as of 13:21, 11 May 2017
The Project
For my final project, I expanded my transducer until it reached 85% coverage, which involved adding words, disambiguations, and extra suffixes.
My code can be found here: https://github.com/nfeldbaum/Khasi_Transducer
The Results
I managed to reach 85.15% coverage on my initial 50,000 word corpus! However, I ended up adding a significant number of English words that were present to reach this amount. A truer measure of corpus coverage is the 80.50% coverage I get when I delete all the words that aren't part of Khasi.
Instead of testing precision and recall against hand-annotated randomly selected forms, I decided to gather a further 25,000 words in order to test my transducer on a corpus it hadn't trained on. This new corpus can be found in the repository linked above under ling073-kha-corpus/kha.corpus.large.test.txt. On this test corpus, I achieved 83.39% coverage with English words included, and 80.50% coverage without English words - exactly the same as on my training corpus.