Difference between revisions of "Language selection"

From LING073
Jump to: navigation, search
m (The assignment)
(Random list of languages that might work)
(21 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
'''Morphological typology''' is a simple way of talking in approximate terms about how words in a language are generally formed.
 
'''Morphological typology''' is a simple way of talking in approximate terms about how words in a language are generally formed.
  
You reference the [[:wikipedia:Morphological typology|Wikipedia article on Morphological typology]] to get a good overview of the possible categories and to figure out how a given language (whose grammar you're able to learn a little bit about) might fit the typology.
+
You can reference the [[:wikipedia:Morphological typology|Wikipedia article on Morphological typology]] to get a good overview of the possible categories and to figure out how a given language (whose grammar you're able to learn a little bit about) might fit the typology.
  
 
== Considerations for language selection ==
 
== Considerations for language selection ==
Line 18: Line 18:
 
=== Languages you may not choose ===
 
=== Languages you may not choose ===
 
* No languages supported by big corporate translation systems like '''[http://translate.google.com Google translate]''', '''[https://www.bing.com/translator Bing Translator]''', or '''[https://translate.yandex.com/ Yandex Translate]''', however poor-quality.
 
* No languages supported by big corporate translation systems like '''[http://translate.google.com Google translate]''', '''[https://www.bing.com/translator Bing Translator]''', or '''[https://translate.yandex.com/ Yandex Translate]''', however poor-quality.
* No languages supported by '''[https://beta.apertium.org Apertium]''' (you can also check the packages on [https://github.com/apertium/ Apertium's Github account]).
+
* No languages supported by '''[https://beta.apertium.org Apertium]''' (you should check either the [https://apertium.github.io/apertium-on-github/source-browser.html Apertium source browser] or the packages listed on [https://github.com/apertium/ Apertium's Github account]).
:: Note: ''If you really want, you may indicate your preference to work on an Apertium language listed as "incubator", but if you do end up working on it, you will basically be expected to start from scratch for each assignment and ignore what's available from Apertium except to augment your resources later''
+
:: Note: ''If you really want, you may indicate your preference to work on an Apertium language listed as "[https://github.com/apertium/apertium-incubator incubator]", but if you do end up working on it, you will basically be expected to start from scratch for each assignment and ignore what's available from Apertium except to augment your resources later''
 
* No [http://giellatekno.uit.no/all-lang.eng.html languages supported by '''Giellatekno'''].
 
* No [http://giellatekno.uit.no/all-lang.eng.html languages supported by '''Giellatekno'''].
 
* No '''historical languages''' unless with special permission; there should be <em>some</em> current speech community—ideally L1—even if small.  Here I do not mean to exclude languages which may have recently lost their last fluent speakers, but languages like Classical Maya, Middle Japanese, or Old Church Slavonic.
 
* No '''historical languages''' unless with special permission; there should be <em>some</em> current speech community—ideally L1—even if small.  Here I do not mean to exclude languages which may have recently lost their last fluent speakers, but languages like Classical Maya, Middle Japanese, or Old Church Slavonic.
Line 26: Line 26:
  
 
=== Languages chosen in previous semesters ===
 
=== Languages chosen in previous semesters ===
Languages in italics were not implemented in translation pipelines.  Again, these are languages to avoid in your selection.
+
Languages in italics were not implemented in translation pipelines.  Again, these are languages to '''avoid''' in your selection.
 
<div style="column-count:3;-moz-column-count:3;-webkit-column-count:3">
 
<div style="column-count:3;-moz-column-count:3;-webkit-column-count:3">
 +
* [[Adyghe]] (2022)
 
* [[Ainu]] (2017)
 
* [[Ainu]] (2017)
 
* [[Berik]] (2018)
 
* [[Berik]] (2018)
Line 41: Line 42:
 
* [[Fijian]] (2018)
 
* [[Fijian]] (2018)
 
* [[Guarani]] (2017)
 
* [[Guarani]] (2017)
 +
* [[Hokkien]] (2022)
 
* ''[[Hiaki]]'' (2017)
 
* ''[[Hiaki]]'' (2017)
 
* ''[[Innu]]'' (2017)
 
* ''[[Innu]]'' (2017)
Line 50: Line 52:
 
* [[Kikuyu]] (2017)
 
* [[Kikuyu]] (2017)
 
* [[Ladino]] (2021)
 
* [[Ladino]] (2021)
 +
* [[Lakota]] (2022)
 
* [[Lingala]] (2017)
 
* [[Lingala]] (2017)
 
* [[Magahi]] (2021)
 
* [[Magahi]] (2021)
 
* [[Miskito]] (2021)
 
* [[Miskito]] (2021)
 +
* [[Mixe|Totontepec Mixe]] (2022)
 
* [[Miyako]] (2017)
 
* [[Miyako]] (2017)
 +
* [[Navajo]] (2022)
 
* ''[[Neapolitan]]'' (2017)
 
* ''[[Neapolitan]]'' (2017)
 +
* [[Nheengatú]] (2022)
 
* ''[[Neo-Aramaic (Assyrian)]]'' (2018)
 
* ''[[Neo-Aramaic (Assyrian)]]'' (2018)
 
* [[Nivkh]] (2019)
 
* [[Nivkh]] (2019)
 +
* [[Nuosu]] (2022)
 
* [[Okinawan]] (2017)
 
* [[Okinawan]] (2017)
 
* [[Purépecha|Pʼurhépecha]] (2021)
 
* [[Purépecha|Pʼurhépecha]] (2021)
 +
* [[Siberian Yupik]] (2022)
 
* [[Central Kurdish|Sorani Kurdish]] (2021)
 
* [[Central Kurdish|Sorani Kurdish]] (2021)
 
* ''[[Standard Tibetan]]'' (2018)
 
* ''[[Standard Tibetan]]'' (2018)
Line 65: Line 73:
 
* [[Tongan]] (2017)
 
* [[Tongan]] (2017)
 
* [[Wamesa]] (2017)
 
* [[Wamesa]] (2017)
 +
* [[Waray]] (2021)
 
* [[Wôpanâak]] (2017)
 
* [[Wôpanâak]] (2017)
 
* [[Warlpiri]] (2017)
 
* [[Warlpiri]] (2017)
 
* ''[[Yoruba]]'' (2021)
 
* ''[[Yoruba]]'' (2021)
 +
 
</div>
 
</div>
  
 
== Random list of languages that might work ==
 
== Random list of languages that might work ==
The following is a list of languages with few to no relevant computational resources which otherwise appear to meet the criteria I set up.  If you need some inspiration, this list could be a good place to start.
+
The following is a list of languages with few to no relevant computational resources which otherwise appear to meet the criteria I set up.  If you need some <b>inspiration</b>, this list could be a good place to start.
 
<div style="column-count:3;-moz-column-count:3;-webkit-column-count:3">
 
<div style="column-count:3;-moz-column-count:3;-webkit-column-count:3">
 
* Western Abenaki
 
* Western Abenaki
 
* Kabardian
 
* Kabardian
* Lakota
 
 
* Shor
 
* Shor
 
* Ndebele
 
* Ndebele
Line 84: Line 93:
 
* Arhuaco/Ikʉ
 
* Arhuaco/Ikʉ
 
* Mapudungun
 
* Mapudungun
* Maithili
 
* Waray
 
 
* Kikamba
 
* Kikamba
 
* Rohingya
 
* Rohingya
Line 92: Line 99:
 
* Lepcha
 
* Lepcha
 
* Pontic Greek
 
* Pontic Greek
* Somali
 
 
* Tigre
 
* Tigre
 
* Kabyle
 
* Kabyle
Line 101: Line 107:
 
* Luri
 
* Luri
 
* Lari
 
* Lari
* Mixe
+
* a Mixe variety (other than <code>mto</code>)
 
* Chatino
 
* Chatino
* Oromo
 
 
* Tamasheq
 
* Tamasheq
 
* Kanza
 
* Kanza
Line 111: Line 116:
 
* Udege
 
* Udege
 
* Lenakel/Netwar
 
* Lenakel/Netwar
* Nheengatu
 
 
* Nauru
 
* Nauru
* Fula/Pular
+
* Fula/Pular/Fulani/Fulfulde
* Twi
 
 
* Newari
 
* Newari
 
* Garhwali
 
* Garhwali
Line 131: Line 134:
 
* Marshallese
 
* Marshallese
 
* Dolgan
 
* Dolgan
 +
* Rgyalrong
 +
* Dangaura Tharu
 +
* Balochi
 +
* a Zapotec variety (other than <code>zab</code>)
 +
* a Cree variety (other than <code>moe</code>)
 +
* Sebat Bet Gurage
 +
* Ladakhi
 +
* a Mixtec variety
 +
* Balti
 +
* Akan
 +
* Zazaki
 +
* Gorani
 +
* Afar
 +
* Shan
 +
* Khowar/Chitrali
 +
* Tai Nüa
 
</div>
 
</div>
  
 
A few languages that used to be on this list but have had other people (elsewhere) do some work on them since being put on the list: Evenki, Bhojpuri, Santali, Konkani.
 
A few languages that used to be on this list but have had other people (elsewhere) do some work on them since being put on the list: Evenki, Bhojpuri, Santali, Konkani.
  
Removed from list because now covered by Google: Kinyarwanda.
+
Removed from list because now covered by Google: Kinyarwanda, Oromo, Maithili, Somali, Twi.
  
 
== The assignment ==
 
== The assignment ==
 
<!-- By the beginning of the Thursday class during the first week of classes (this semester: '''14:00 on 16 February 2021'''), turn in the following: -->
 
<!-- By the beginning of the Thursday class during the first week of classes (this semester: '''14:00 on 16 February 2021'''), turn in the following: -->
By the end of preparation week ('''23:59 on 21 January 2022'''), turn in the following:
+
By the beginning of class on Tuesday of the second week of classes (this semester: '''9:55 on 24 January 2023'''), turn in the following:
 
# Make a page on the wiki:
 
# Make a page on the wiki:
 
#* '''Create a "Language selection" page''' under your userpage (<code>wikis.swarthmore.edu/ling073/User:student1/Language_selection</code>, replacing <code>student1</code> with your username).
 
#* '''Create a "Language selection" page''' under your userpage (<code>wikis.swarthmore.edu/ling073/User:student1/Language_selection</code>, replacing <code>student1</code> with your username).
 
#* At the very top, mention '''who you might like to work with''' in a pair.  This could be anything from "someone who knows linguistics really well" or "someone who is good with computers" or even a specific person (in which case, link to their language selection page!) or a note that you're not sure or don't care.
 
#* At the very top, mention '''who you might like to work with''' in a pair.  This could be anything from "someone who knows linguistics really well" or "someone who is good with computers" or even a specific person (in which case, link to their language selection page!) or a note that you're not sure or don't care.
#* List in order of preference '''three languages''' you might like to work on this semester.  There are some examples given above, but don't limit yourself to those.  There are thousands of languages to choose from!
+
#* List in order of preference '''three languages''' you might like to work on this semester.  There are some examples given above, but don't limit yourself to those.  There are thousands of languages to choose from! (It might be good to make each one a separate section on the page.)
 
# Document some things for each language:
 
# Document some things for each language:
 
#* For each language, determine as best you can with the resources available '''a [[:wikipedia:Morphological typology|morphological typology]] of the language'''.  E.g., is it primarily isolating, agglutinative, etc., and how do you know?  Are there patterns in that language that reflect more than one morphological type?  If there is inflectional morphology (ideally the language you choose will have some!), what sorts of strategies are used (prefixation, suffixation, etc.)?
 
#* For each language, determine as best you can with the resources available '''a [[:wikipedia:Morphological typology|morphological typology]] of the language'''.  E.g., is it primarily isolating, agglutinative, etc., and how do you know?  Are there patterns in that language that reflect more than one morphological type?  If there is inflectional morphology (ideally the language you choose will have some!), what sorts of strategies are used (prefixation, suffixation, etc.)?
Line 149: Line 168:
 
#* Give some estimation of how likely it will be for you to find at least '''a few pages' worth of text''' in this language.  In other words, see if you can find something online quickly—websites in the language, a translation or the bible or universal declaration of human rights, a blog, a grammar book with lots of examples, etc.  Don't limit yourself to online resources—if library resources exist (even if not available at Swarthmore), that can also work!  (If it's not at all likely that you can have some amount of text in the language on your screen or in your hand within a week or two, you probably should find some other language to work on!)
 
#* Give some estimation of how likely it will be for you to find at least '''a few pages' worth of text''' in this language.  In other words, see if you can find something online quickly—websites in the language, a translation or the bible or universal declaration of human rights, a blog, a grammar book with lots of examples, etc.  Don't limit yourself to online resources—if library resources exist (even if not available at Swarthmore), that can also work!  (If it's not at all likely that you can have some amount of text in the language on your screen or in your hand within a week or two, you probably should find some other language to work on!)
 
# Clean up the page
 
# Clean up the page
#* Include a category tag for <code>[[:Category:sp22_LanguageSelection|sp22_LanguageSelection]]</code> and one for the name of each language.  You should have '''four category tags''' on your page, e.g. <code><NOWIKI>[[Category:sp22_LanguageSelection]], [[Category:Abkhaz]],</NOWIKI></code> and one each for the other two languages.
+
#* Include a category tag for <code>[[:Category:sp23_LanguageSelection|sp23_LanguageSelection]]</code> and one for the name of each language.  You should have '''four category tags''' on your page, e.g. <code><NOWIKI>[[Category:sp23_LanguageSelection]], [[Category:Abkhaz]],</NOWIKI></code> and one each for the other two languages.
 
#* Make use of MediaWiki formatting markup.  E.g., each language can be a section, data can be formatted as bullet points or in tables, citations should make use of proper macros, etc.  You can see how MW markup works simply by going to edit an existing page and examining the source used to produce various elements.
 
#* Make use of MediaWiki formatting markup.  E.g., each language can be a section, data can be formatted as bullet points or in tables, citations should make use of proper macros, etc.  You can see how MW markup works simply by going to edit an existing page and examining the source used to produce various elements.
 
* NOTES
 
* NOTES
** Note that conflicts of first choice will be resolved in class on Thursday, but in cases of an impasse, the first person to post their interest in the language to the wiki will get their earlier choice, and the other party will get a subsequent choice.
+
** Note that conflicts of first choice will be resolved in class, but in cases of an impasse, the first person to post their interest in the language to the wiki will get their earlier choice, and the other party will get a subsequent choice.
** Feel free to examine language selection pages from previous years (e.g., [[:Category:sp17_LanguageSelection|sp17_LanguageSelection]], [[:Category:sp18_LanguageSelection|sp18_LanguageSelection]], [[:Category:sp19_LanguageSelection|sp19_LanguageSelection]], [[:Category:sp21_LanguageSelection|sp21_LanguageSelection]]), but don't copy stuff wholesale—and note that a number of those languages have already been done so you can't choose them anyway :)
+
** Feel free to examine language selection pages from previous years (e.g., [[:Category:sp17_LanguageSelection|sp17_LanguageSelection]], [[:Category:sp18_LanguageSelection|sp18_LanguageSelection]], [[:Category:sp19_LanguageSelection|sp19_LanguageSelection]], [[:Category:sp21_LanguageSelection|sp21_LanguageSelection]], [[:Category:sp22_LanguageSelection|sp22_LanguageSelection]]), but don't copy stuff wholesale—and note that a number of those languages have already been done so you can't choose them anyway :)
 +
 
 +
By the beginning of the first class of the second week ('''9:55 on 24 January 2023'''), add the following:
 +
# Peruse your classmates' language selection pages at [[:Category:sp23_LanguageSelection|sp23_LanguageSelection]], once they're all submitted.
 +
# Find at least one classmate you feel like you'd be interested in working with this semester, and reach out to them to see if they might be interested in working with you.
 +
#* You can base this on any number of factors, e.g. who you already know, what languages your classmates listed, what sort of partner they said they were looking for, or what they said on their user page for the [[:Category:Sp23_students|first day assignment]], such as background in linguistics or CS that might complement your own background, or even class year.
 +
#* People may form pairs before you reach out to them (and don't feel shy about letting someone know that you've already formed a pair with someone else), so be prepared to reach out to someone else, or ask more than one person from the beginning (but being open with the people you ask that you've asked someone else too).
 +
# When you've made an agreement with a classmate to be partners (at least tentatively), at the very top of your language selection page, mention who you'll be working with.
 +
 
  
 
[[Category:Assignments]]
 
[[Category:Assignments]]
 
[[Category:Language selection]]
 
[[Category:Language selection]]

Revision as of 15:07, 30 January 2023

In Ling 073, everyone will be applying the topics of the class to an under-resourced language of their choice throughout the semester. Students will [for the most part] work in pairs on a single language, but no two pairs will work on the same language.

Note: If you have a strong desire to work on language that is normally regarded as entirely "isolating", some accommodations may be made, but you should talk with the professor about it immediately.

Morphological typology

Morphological typology is a simple way of talking in approximate terms about how words in a language are generally formed.

You can reference the Wikipedia article on Morphological typology to get a good overview of the possible categories and to figure out how a given language (whose grammar you're able to learn a little bit about) might fit the typology.

Considerations for language selection

  • Ideally, you should choose a language with at least some interesting morphological processes.
  • You'll need some authentic text (i.e., text produced by native speakers, even if not standardised) in this language, whether from documents found online, an excerpt of published text that you type up, someone's twitter account, or sample sentences from a grammar. See Places to look for corpora for more info.
  • Because of the above, it's easiest to choose a language with a written standard of one sort or other. Some languages have more than one written standard (which is fine!) and some are subsumed under some other language's written standard (which makes it harder). If the documentation and corpora you identify all use linguist transcriptions, this can also work, but isn't ideal.
  • You need to choose a language that doesn't have [many] existing computational resources; specific exclusions listed below:

Languages you may not choose

Note: If you really want, you may indicate your preference to work on an Apertium language listed as "incubator", but if you do end up working on it, you will basically be expected to start from scratch for each assignment and ignore what's available from Apertium except to augment your resources later
  • No languages supported by Giellatekno.
  • No historical languages unless with special permission; there should be some current speech community—ideally L1—even if small. Here I do not mean to exclude languages which may have recently lost their last fluent speakers, but languages like Classical Maya, Middle Japanese, or Old Church Slavonic.
  • No conlangs unless with special permission; again, the point is to build tools that are potentially useful to a language community (of conlangs with L1 speakers, Esperanto speakers have plenty of resources, and the rest—e.g. Klingon-speakers—can fend for themselves)
  • No languages chosen in a previous semester (see below)

Languages chosen in previous semesters

Languages in italics were not implemented in translation pipelines. Again, these are languages to avoid in your selection.

Random list of languages that might work

The following is a list of languages with few to no relevant computational resources which otherwise appear to meet the criteria I set up. If you need some inspiration, this list could be a good place to start.

  • Western Abenaki
  • Kabardian
  • Shor
  • Ndebele
  • Arrernte
  • Iatmul
  • Beja
  • Garifuna
  • Arhuaco/Ikʉ
  • Mapudungun
  • Kikamba
  • Rohingya
  • Platduuts (nds-nl) or Plattdüütsch (nds)
  • Alemannisch (any southern German)
  • Lepcha
  • Pontic Greek
  • Tigre
  • Kabyle
  • Mandinka
  • Lezgian
  • Denaʼina
  • Wakhi
  • Luri
  • Lari
  • a Mixe variety (other than mto)
  • Chatino
  • Tamasheq
  • Kanza
  • Saraiki
  • Jicarilla Apache
  • Mazandarani
  • Udege
  • Lenakel/Netwar
  • Nauru
  • Fula/Pular/Fulani/Fulfulde
  • Newari
  • Garhwali
  • Mirandés
  • Kimwani
  • (Serer-)Noon
  • Wolof
  • Dagbani
  • Totonac (Sierra, Misantla, or Upper Necaxa)
  • Ingush
  • Lak
  • Balinese
  • Benchnon
  • Turkana
  • Dinka
  • Marshallese
  • Dolgan
  • Rgyalrong
  • Dangaura Tharu
  • Balochi
  • a Zapotec variety (other than zab)
  • a Cree variety (other than moe)
  • Sebat Bet Gurage
  • Ladakhi
  • a Mixtec variety
  • Balti
  • Akan
  • Zazaki
  • Gorani
  • Afar
  • Shan
  • Khowar/Chitrali
  • Tai Nüa

A few languages that used to be on this list but have had other people (elsewhere) do some work on them since being put on the list: Evenki, Bhojpuri, Santali, Konkani.

Removed from list because now covered by Google: Kinyarwanda, Oromo, Maithili, Somali, Twi.

The assignment

By the beginning of class on Tuesday of the second week of classes (this semester: 9:55 on 24 January 2023), turn in the following:

  1. Make a page on the wiki:
    • Create a "Language selection" page under your userpage (wikis.swarthmore.edu/ling073/User:student1/Language_selection, replacing student1 with your username).
    • At the very top, mention who you might like to work with in a pair. This could be anything from "someone who knows linguistics really well" or "someone who is good with computers" or even a specific person (in which case, link to their language selection page!) or a note that you're not sure or don't care.
    • List in order of preference three languages you might like to work on this semester. There are some examples given above, but don't limit yourself to those. There are thousands of languages to choose from! (It might be good to make each one a separate section on the page.)
  2. Document some things for each language:
    • For each language, determine as best you can with the resources available a morphological typology of the language. E.g., is it primarily isolating, agglutinative, etc., and how do you know? Are there patterns in that language that reflect more than one morphological type? If there is inflectional morphology (ideally the language you choose will have some!), what sorts of strategies are used (prefixation, suffixation, etc.)?
    • Determine basic information about each language. How many speakers are there, where do they live, what other languages might they know, what is the status of the language in terms of its transmission to current and future generations, is there a normative orthography of some sort? What is the orthography like (what script / any interesting features / multiple official/historical orthographies / etc.)? Provide ISO codes used for the language, especially three-letter ones. Basically all of this information should be findable on ethnologue (not paywalled if accessed on campus or through TriPod) and wikipedia (in one language or other), but feel free to use any source that seems reliable (academic papers, census data, etc.). Cite the sources you use (at least add a link).
    • Give some estimation of how likely it will be for you to find at least a few pages' worth of text in this language. In other words, see if you can find something online quickly—websites in the language, a translation or the bible or universal declaration of human rights, a blog, a grammar book with lots of examples, etc. Don't limit yourself to online resources—if library resources exist (even if not available at Swarthmore), that can also work! (If it's not at all likely that you can have some amount of text in the language on your screen or in your hand within a week or two, you probably should find some other language to work on!)
  3. Clean up the page
    • Include a category tag for sp23_LanguageSelection and one for the name of each language. You should have four category tags on your page, e.g. [[Category:sp23_LanguageSelection]], [[Category:Abkhaz]], and one each for the other two languages.
    • Make use of MediaWiki formatting markup. E.g., each language can be a section, data can be formatted as bullet points or in tables, citations should make use of proper macros, etc. You can see how MW markup works simply by going to edit an existing page and examining the source used to produce various elements.
  • NOTES
    • Note that conflicts of first choice will be resolved in class, but in cases of an impasse, the first person to post their interest in the language to the wiki will get their earlier choice, and the other party will get a subsequent choice.
    • Feel free to examine language selection pages from previous years (e.g., sp17_LanguageSelection, sp18_LanguageSelection, sp19_LanguageSelection, sp21_LanguageSelection, sp22_LanguageSelection), but don't copy stuff wholesale—and note that a number of those languages have already been done so you can't choose them anyway :)

By the beginning of the first class of the second week (9:55 on 24 January 2023), add the following:

  1. Peruse your classmates' language selection pages at sp23_LanguageSelection, once they're all submitted.
  2. Find at least one classmate you feel like you'd be interested in working with this semester, and reach out to them to see if they might be interested in working with you.
    • You can base this on any number of factors, e.g. who you already know, what languages your classmates listed, what sort of partner they said they were looking for, or what they said on their user page for the first day assignment, such as background in linguistics or CS that might complement your own background, or even class year.
    • People may form pairs before you reach out to them (and don't feel shy about letting someone know that you've already formed a pair with someone else), so be prepared to reach out to someone else, or ask more than one person from the beginning (but being open with the people you ask that you've asked someone else too).
  3. When you've made an agreement with a classmate to be partners (at least tentatively), at the very top of your language selection page, mention who you'll be working with.