What I did
I selected four pages from the Japanese Wikipedia (Hiroshima Peace Park, Grapefruit, Shibori, and Orcs) and copied most of the articles. I excluded things like information in tables and sentences with a lot of English. These articles were intended to from very different domains, and thus represent a wide range of the Japanese lexicon.
- I implemented numbers by having a digits lexicon which goes in a loop, and then adds various forms like year, day, and hour to the end of a number.
- I added a small number of suffixes which can follow country names (e.g. 軍 (military) and 人 (person)). However, many suffixes which follow country names cannot be produced in this manner, and so a number of words must instead be lexicalised. (For instance, while 語 can be added to the end of many country names to refer to the language of that country, there is not a 1:1 mapping between countries and languages, and so words like 日本語 (the Japanese language) cannot be produced in a regular manner.)
- I added a small number of the ways in which い adjectives can be conjugated. I added the copula (which comes at the end of polite adjective forms) with a plus sign. な adjectives can acquire an attributive tag when they have な, but otherwise inflect like nouns.
- I implemented a number of the most common postpositions.
- I added a small number of sentence final particles.
- I implemented causative, passive, and causative-passive forms for う and る verbs.
- I hard-coded a small number of inflections of する (do). I tokenised the corpus such that there was a space between the noun (or the を) and the する, but it may be that it would be better to lexicalise at least some of the more common of these verbs.
- I added a small number of conjugations of う and る verbs.
- I wrote four twol rules to handle the conjugations of う verbs for た、て、and ない forms.
What I did not do
- I did not automate the process of corpus tokenisation. The corpus files were all hand-tokenised.
- Prefixes (e.g. 毎 for every, お and ご for honorific, etc.)
- There are a number of characters (both hiragana and katakana) that do not appear in any stems. This is in large part because I entered most words in kanji rather than in hiragana. Words with these characters can be added, with the exception of を, because no words contain it.
- The lexicon of postpositions needs to be greatly expanded, and more thought and research should go into how they are tagged.
- It would likely make more sense for the causative-passive form to be generated by going from the causative form to the る verb version of the passive form, rather than doing it explicitly in both the う and る CP lexicons the way I currently am.
- There are a number of forms which follow the short form of verbs and て form which I did not implement.
- Irregular verbs need to be expanded.
- I only added one personal pronoun. There was not a great deal of need for more in order to cover the corpus with which I was working, but Japanese has a large number of, in particular, first-person singular pronouns that should be added.
- I added a small number of determiners and demonstrative pronouns, and there are many more.
- Coverage over the Hiroshima article:
- Coverage over the grapefruit article:
- Coverage over the shibori article:
- Coverage over the orc article:
- Coverage over all articles:
Precision and recall would also be useful metrics, but I did not have time to work on them.