Initial corpus assembly

From LING073
Jump to: navigation, search

Your first corpus is just a bunch of raw text and then some sentences extracted from that. You need enough authentic text to fill (at minimum) approximately one page—or about 5000 characters.

The text can be drawn from different sources, all ideally allowing for redistribution under a free license. If all the texts you have available are copyrighted, that's fine—just make sure your repo is private and shared only with me (and the course assistant).

The assignment

This assignment is due at the end of the third week of class (midnight at the end of February 3rd, 2023).

  1. For this assignment, you should create a repository that everyone in your team has access to under the Ling073_sp23 group. Call it ling073-xyz-corpus. You'll build a corpus of authentic text here. The corpus will be built from materials as described at Building Corpora#Places to look for corpora. You may have to type up a certain amount of text from print sources using your new keyboard layout.
  2. To start with, add any full corpora you find to this repository in plain text format. For example, if you find a bible translation or newspaper site online, strip out all the text (ignoring notices, menus, etc.) as best you can (there are tools for this if copy-paste is giving sub-optimal results), and put it in a single file (per source) in this directory. Name these files something like xyz.type.label.txt. Replace the word type with the type of source, e.g., bible, phrases, wikipedia, news, tweets. For label, use either a publisher (e.g., IBT for a bible translation from IBT); or date you got the content from a site that changes, like Wikipedia or a news site (e.g., 2013-01-28); or author (e.g., smith if they're all from a book by Smith). You can combine these too if you need, using an underscore (_).
  3. Add a brief description of each of these files in a separate MANIFEST file. Ideally all of your content will be redistributable (e.g., licensed under one of the Creative Commons licenses), but chances are that you will have to include non-redistributable content as well. So also include mention of what license each of the corpora is covered under in the MANIFEST file (or if the original source doesn't say; if it's a print book, though, chances are that it's "all rights reserved"). You can format each line something like the following:
    xyz.news.WN_2023-01-28.txt - contains news from WorldNews™ website, acquired on 2023-01-28, copyright WorldNews Eurasia, Inc.
  4. Choose sentences from a range of these files, and add them to a file xyz.corpus.basic.txt. This file should consist of one sentence per line, and should contain at least 5000 characters. Do everything you can to make sure all of the material you include in this new file is in the same orthography. You may have to manually convert sentences as you copy them over from the other files.
    Note: you can count how many characters are in your file with the following command: wc -m xyz.corpus.basic.txt. From within vim, you can issue the command :!wc -m %, and if you are using gedit, you can check this in Document Tools → Document Statistics.
  5. You should enter English glosses for at least a quarter of the sentences in your basic corpus in an xyz-eng.corpus.basic.txt file, with available glosses on corresponding line numbers. This may result in a number of blank lines throughout the file. You will probably have trouble finding English glosses for a lot of the text (prioritise diversity of sources of text in this file as well), but bible translations should be pretty straightforward to find English equivalents for, and phrase books and grammars almost always provide glosses.
  6. Add a README file to address anything else you think is important, such as issues that came up during corpus assembly that you had to make decisions on.
  7. Add a link to the corpus repository under the new Resources section of your wiki page on the language. Even if it's not "public" (and even "public" repos are Swarthmore-only) due to licensing issues, it's good to document openly that it exists.

Sanity check (go through before you submit):

  • Is the orthography in the xyz.corpus.basic.txt file standardised, with a sentence per line? Is your basic corpus at least 5000 characters long?
  • Do you have a MANIFEST file describing each raw corpus and its license?
  • Do you have English glosses (in xyz-eng.corpus.basic.txt) of at least a quarter of the sentences in the basic corpus?
  • Did you include a README in the repo and a link to the repo on the wiki page for your language?