Building Corpora

From LING073
Jump to: navigation, search

Corpora for morphological evaluation

Corpora for disambiguation evaluation

Corpora for MT evaluation

You need two corpora with the same contents but in different languages. This is most easily done as two files structured the same way, e.g. each with a list of equivalent sentences in the same order. The bigger the better, though I'm going to require at least 2500 characters in at least one of the languages. If both languages have bible translations, see how much of the bible you can align this way—the more the better!

In terms of naming convention, try xyz-abc.corpustype.txt, where xyz and abc are the two language codes, abc (the second, or "destination" language) is the language that the file contains, and corpustype is the type of corpus (e.g., bible, phrases, or just corpus if it's a mix). Make sure you have the corresponding abc-xyz.corpustype.txt file in your repository too!

You can evaluate your translation system using this corpus by running something like this:

cat xyz-abc.corpustype.txt | apertium -d /path/to/translator abc-xyz > abc-xyz.corpustype.out


Corpora for UD evaluation

Places to look for corpora

  • Start by seeing what a Google search for resources on your language turns up.
  • Also check language resource aggregating sites like the following:
  • List of Wikipedias by language - you probably want around 1000 articles or more for it to be useful for this class (small Wikipedias often have formulaic one-sentence articles, without a wide range of vocabulary or grammatical structures), but somewhat smaller Wikipedias can also work if there are other resources available. Anything available will be useful.
    • Note that you can extract text from an entire Wikipedia using the Wikipedia Extractor tool.
    • Also note that many languages' Wikipedias are small enough to be incubated. This means they're on until they're big enough to get their own site. The extractor tool above should now support extracting all pages in a language from an incubator dump. Note also that some incubator languages' Wikipedias have almost no content.
  • See if you can find a whole or partial bible translation, e.g.
  • If you can figure out enough of your language to do an internet search for resources in it, you may be able to find news, entertainment, or informational websites. Also try searching for materials in another "big" language that speakers might also know (e.g., speakers of Shor may not know English, but basically all of them will know Russian, so you may only find resources made available by native speakers that are in Russian, or at least introduced in Russian).
  • Check for Twitter accounts (or similar) of speakers of the language by searching for distinctive words of the language. You may find that a non-standard orthography is used, but this is authentic use of the language, and there are several options for how this data can be used.
  • Try a library search for literature in the language. You may have to figure out some words in the language to conduct a successful search, as well as how American and Western European libraries have historically transliterated text in it. You have essentially the whole world's library resources available to you through ILL, if you do it right. Put in requests early!
    • You can also use example sentences in a grammar book if there are enough. Preferably they're in orthographic form or you can figure out how to convert them from the transcription.
    • Pandemic note: many library resources can be obtained in digital format, or I can request a source and scan it for you when it arrives. It also may be worth checking shadow libraries for resources, if that's something you're comfortable with.
  • Ask anyone with a potential connection to the language. Look online, around campus, etc. for native speakers, linguists and language experts, activists, or anyone else who might know something about the language, including even just where to find resources.
    • E.g., contact authors of linguistics papers on the language to find out if they're willing to share some data. Some may have field notes or access to rare sources that could serve you well.
  • If your language is particularly small, there may be nothing or nearly nothing available in the language. If you can't find any print materials either, consider choosing another language for this class. We will do some probing of our languages similar to what's done in linguistic fieldwork, but it will all be done through written material—the expectation is that (except in rare circumstances) we will not be able to rely on native speaker consultants for our work in the class.