Difference between revisions of "Building Corpora"

From LING073
Jump to: navigation, search
m (Places to look for corpora)
Line 9: Line 9:
 
== Places to look for corpora ==
 
== Places to look for corpora ==
 
* Start by seeing what a Google search for resources on your language turns up.
 
* Start by seeing what a Google search for resources on your language turns up.
* [[:wikipedia:List_of_Wikipedias#Detailed_list|List of Wikipedias by language]] - you probably want at least 1000 articles for it to be useful for this class, but somewhat smaller wikipedias can also work if there are other resources
+
* [[:wikipedia:List_of_Wikipedias#Detailed_list|List of Wikipedias by language]] - you probably want around 1000 articles or more for it to be useful for this class (small Wikipedias often have formulaic one-sentence articles, without a wide range of vocabulary or grammatical structures), but somewhat smaller Wikipedias can also work if there are other resources available.
 
* See if you can find a whole or partial bible translation, e.g.
 
* See if you can find a whole or partial bible translation, e.g.
 
** [https://ibtrussia.org/en Institute for Bible Translation] - mostly languages of Russia
 
** [https://ibtrussia.org/en Institute for Bible Translation] - mostly languages of Russia

Revision as of 03:08, 13 January 2017

Corpora for morphological evaluation

Corpora for disambiguation evaluation

Corpora for MT evaluation

Corpora for UD evaluation

Places to look for corpora

  • Start by seeing what a Google search for resources on your language turns up.
  • List of Wikipedias by language - you probably want around 1000 articles or more for it to be useful for this class (small Wikipedias often have formulaic one-sentence articles, without a wide range of vocabulary or grammatical structures), but somewhat smaller Wikipedias can also work if there are other resources available.
  • See if you can find a whole or partial bible translation, e.g.
    • Institute for Bible Translation - mostly languages of Russia
    • bible.is - a whole bunch of bible translations
    • potentially available in print or on microfilm from a library (check worldcat, use ILL)
  • If you can figure out enough of your language to do an internet search for resources in it, you may be able to find news, entertainment, or informational websites. Also try searching for materials in another "big" language that speakers might also know (e.g., speakers of Shor may not know English, but basically all of them will know Russian, so you may only find resources made available by native speakers that are in Russian, or at least introduced in Russian).
  • Check for Twitter accounts (or similar) of speakers of the language by searching for distinctive words of the language. You may find that a non-standard orthography is used, but this is authentic use of the language, and there are several options for how this data can be used.
  • Try a library search for literature in the language. You may have to know both some of the language and how American and Western European libraries have historically transliterated it. You have essentially the whole world's library resources available to you through ILL, if you do it right.
  • Ask anyone with a potential connection to the language. Look online, around campus, etc. for native speakers, linguists and language experts, activists, or anyone else who might know something about the language, including even just where to find resources.
    • E.g., contact authors of linguistics papers on the language to find out if they're willing to share some data. Some may have field notes or access to rare sources that could serve you well.
  • If your language is particularly small, there may be nothing or nearly nothing available in the language. If you can't find any print materials either, consider choosing another language for this class. We will do some probing of our languages similar to what's done in linguistic fieldwork, but it will all be done through written material—the expectation is that we will not be able to rely on native speaker consultants for our work in the class.