What I did
I selected four pages from the Japanese Wikipedia (Hiroshima Peace Park, Grapefruit, Shibori, and Orcs) and copied most of the articles. I excluded things like information in tables and sentences with a lot of English. These articles were intended to from very different domains, and thus represent a wide range of the Japanese lexicon.
- I added a small number of suffixes which can follow country names (e.g. 軍 (military) and 人 (person)). However, many suffixes which follow country names cannot be produced in this manner, and so a number of words must instead be lexicalised. (For instance, while 語 can be added to the end of many country names to refer to the language of that country, there is not a 1:1 mapping between countries and languages, and so words like 日本語 (the Japanese language) cannot be produced in a regular manner.)
- I added a small number of the ways in which い adjectives can be conjugated, including adverbialisation. I added the copula (which comes at the end of polite adjective forms) with a plus sign. な adjectives can acquire an attributive tag when they have な, but otherwise inflect like nouns.
- I implemented a number of the most common postpositions.
- I added a small number of sentence final particles.
- I implemented causative, passive, and causative-passive forms for う and る verbs.
- I added a small number of conjugations of う and る verbs.
- I implemented numbers by having a digits lexicon which goes in a loop, and then adds various forms like year, day, and hour to the end of a number.
- I added a moderate number of nouns to the lexicon, based on common words in the peace park, grapefruit, and shibori articles.
- I added a small number of adjectives, verbs, and conjunctions.
- I hard-coded a small number of inflections of する (do). I tokenised the corpus such that there was a space between the noun (or the を) and the する, but it may be that it would be better to lexicalise at least some of the more common of these verbs.
- I wrote four twol rules to handle the conjugations of う verbs for た、て、and ない forms.
- I added kanji to the twol alphabet as I added stems with those kanji to the lexicon.
What I did not do
- I did not automate the process of corpus tokenisation. The corpus files were all hand-tokenised.
- I did not end up creating yaml files to test the morphology. This would have been useful.
- I did not touch prefixes. There are not a large number of them in Japanese, but they do exist, like 毎 (every), and お and ご for honorific, etc. While words with 毎 could be lexicalised, that seems less practical for the honorifics.
- I did not deal with lengthy things such as というわけではない, and I am unsure how best to handle these.
- The lexicon of postpositions needs to be greatly expanded, and more thought and research should go into how they are tagged.
- It would likely make more sense for the causative-passive form to be generated by going from the causative form to the る verb version of the passive form, rather than doing it explicitly in both the う and る CP lexicons the way I currently am.
- I tagged て form as cvb_te. There is undoubtedly a better tag to give it.
- There are a number of forms which follow the short form of verbs and て form which I did not implement.
- Irregular verbs need to be expanded.
- There are a number of adjective, noun, and verb conjugations in general which I did not add.
- I only added one personal pronoun. There was not a great deal of need for more in order to cover the corpus with which I was working, but Japanese has a large number of, in particular, first-person singular pronouns that should be added.
- I added a small number of determiners and demonstrative pronouns, and there are many more.
- There are a number of characters (both hiragana and katakana) that do not appear in any stems, in large part because I entered most words in kanji rather than in hiragana. Words with these characters can be added, with the exception of を, because no words contain it. This makes it more difficult to ascertain what the top unknown words are.
- There are many more kanji which need to be added to the twol alphabet. They should also be organised better.
- Coverage over the Hiroshima article:
- Coverage over the grapefruit article:
- Coverage over the shibori article:
- Coverage over the orc article:
- Coverage over all articles:
Precision and recall would also be useful metrics, but I did not have time to work on them.