Difference between revisions of "Language selection"

From LING073
Jump to: navigation, search
(Random list of languages that might work)
(Languages chosen in previous semesters)
Line 25: Line 25:
  
 
=== Languages chosen in previous semesters ===
 
=== Languages chosen in previous semesters ===
Languages in italics were not implemented in translation pipelines.
+
Languages in italics were not implemented in translation pipelines.  Again, these are languages to avoid in your selection.
 
<div style="column-count:3;-moz-column-count:3;-webkit-column-count:3">
 
<div style="column-count:3;-moz-column-count:3;-webkit-column-count:3">
 
* [[Ainu]] (2017)
 
* [[Ainu]] (2017)

Revision as of 13:12, 14 February 2021

In Ling 073, everyone will be applying the topics of the class to an under-resourced language of their choice throughout the semester. Students will [for the most part] work in pairs on a single language, but no two pairs will work on the same language.

Note: If you have a strong desire to work on language that is normally regarded as entirely "isolating", some accommodations may be made, but you should talk with the professor about it immediately.

Key linguistic concepts

Considerations for language selection

  • Ideally, you should choose a language with at least some interesting morphological processes.
  • You'll need some authentic text (i.e., text produced by native speakers, even if not standardised) in this language, whether from documents found online, an excerpt of published text that you type up, someone's twitter account, or sample sentences from a grammar. See Places to look for corpora for more info.
  • Because of the above, it's easiest to choose a language with a written standard of one sort or other. Some languages have more than one written standard (which is fine!) and some are subsumed under some other language's written standard (which makes it harder). If the documentation and corpora you identify all use linguist transcriptions, this can also work, but isn't ideal.
  • You need to choose a language that doesn't have [many] existing computational resources; specific exclusions listed below:

Languages you may not choose

Note: If you really want, you may indicate your preference to work on an Apertium language listed as "incubator", but if you do end up working on it, you will basically be expected to start from scratch for each assignment and ignore what's available from Apertium except to augment your resources later
  • No languages supported by Giellatekno.
  • No historical languages unless with special permission; there should be some current speech community—ideally L1—even if small
  • No conlangs unless with special permission; again, the point is to build tools that are potentially useful to a language community (of conlangs with L1 speakers, Esperanto speakers have plenty of resources, and the rest—e.g. Klingon-speakers—can fend for themselves)
  • No languages chosen in a previous semester (see below)

Languages chosen in previous semesters

Languages in italics were not implemented in translation pipelines. Again, these are languages to avoid in your selection.

Random list of languages that might work

The following is a list of languages with few to no relevant computational resources which otherwise appear to meet the criteria I set up. If you need some inspiration, this list could be a good place to start.

  • Western Abenaki
  • Kabardian
  • Lakota
  • Shor
  • Ndebele
  • Arrernte
  • Iatmul
  • Tiwi
  • Beja
  • Garifuna
  • Arhuaco/Ikʉ
  • Mapudungun
  • Maithili
  • Santali
  • Waray
  • Kikamba
  • Biak
  • Konkani or Rohingya
  • Platduuts (nds-nl) or Plattdüütsch (nds)
  • Alemannisch (any southern German)
  • Kinyarwanda
  • Lepcha
  • Pontic Greek
  • Somali
  • Tigre
  • Pʼurhépecha
  • Kabyle
  • Mandinka
  • Lezgian
  • Kabardian
  • Denaʼina
  • Wakhi
  • Luri
  • Lari
  • Mixe
  • Chatino
  • Oromo
  • Tamasheq
  • Kanza
  • Magahi
  • Saraiki
  • Jicarilla Apache
  • Mazandarani
  • Udege
  • Lenakel/Netwar

A few languages that used to be on this list but have had other people (elsewhere) do some work on them since being put on the list: Evenki, Bhojpuri.

The assignment

By the beginning of the Thursday class during the first week of classes (this semester: 14:00 on 16 February 2021), turn in the following:

  1. Make a page on the wiki:
    • Create a "Language selection" page under your userpage (wikis.swarthmore.edu/ling073/User:student1/Language_selection, replacing student1 with your username).
    • At the very top, mention who you might like to work with in a pair. This could be anything from "someone who knows linguistics really well" or "someone who is good with computers" or even a specific person (in which case, link to their language selection page!) or a note that you're not sure or don't care.
    • List in order of preference three languages you might like to work on this semester. There are some examples given above, but don't limit yourself to those. There are thousands of languages to choose from!
  2. Document some things for each language:
    • For each language, determine as best you can with the resources available a morphological typology of the language. E.g., is it primarily isolating, agglutinative, etc., and how do you know? Are there patterns in that language that reflect more than one morphological type? If there is inflectional morphology (ideally the language you choose will have some!), what sorts of strategies are used (prefixation, suffixation, etc.)?
    • Determine basic information about each language. How many speakers are there, where do they live, what other languages might they know, what is the status of the language in terms of its transmission to current and future generations, is there a normative orthography of some sort? What is the orthography like (what script / any interesting features / multiple official/historical orthographies / etc.)? Provide ISO codes used for the language, especially three-letter ones. Basically all of this information should be findable on ethnologue and wikipedia (in one language or other), but feel free to use any source that seems reliable (academic papers, census data, etc.). Cite the sources you use.
    • Give some estimation of how likely it will be for you to find at least a few pages' worth of text in this language. In other words, see if you can find something online quickly—websites in the language, a translation or the bible or universal declaration of human rights, a blog, a grammar book with lots of examples, etc. Don't limit yourself to online resources—if library resources exist (even if not available at Swarthmore), that can also work! (If it's not at all likely that you can have some amount of text in the language on your screen or in your hand within a week or two, you probably should find some other language to work on!)
  3. Clean up the page
    • Include a category tag for sp21_LanguageSelection and one for the name of each language. You should have four category tags on your page, e.g. [[Category:sp21_LanguageSelection]], [[Category:Abkhaz]], and one each for the other two languages.
    • Make use of MediaWiki formatting markup. E.g., each language can be a section, data can be formatted as bullet points or in tables, citations should make use of proper macros, etc. You can see how MW markup works simply by going to edit an existing page and examining the source used to produce various elements.
  • NOTES
    • Note that conflicts of first choice will be resolved in class on Thursday, but in cases of an impasse, the first person to post their interest in the language to the wiki will get their earlier choice, and the other party will get a subsequent choice.
    • Feel free to examine language selection pages from previous years (e.g., sp17_LanguageSelection, sp18_LanguageSelection, sp19_LanguageSelection), but don't copy stuff wholesale—and note that a number of those languages have already been done so you can't choose them anyway :)