Grammar documentation

From LING073
Jump to: navigation, search

Now that we have text in the language and are getting comfortable typing in the language, it's time to explore the grammar of the language. This will put you in a good spot to start implementing its morphology computationally - the next unit.

Morphology

Morphology is concerned with how words are formed in a language.

Common morphological strategies

The following cover the most common morphological strategies. You may reference Wikipedia for information and examples of each type by following the links.

Aside from suppletion, these can all be thought of as productive (=consistent) operations that are performed on one string (a root) to create another (a particular form).

Functional vs derivational morphology

For the purposes of grammar documentation, we should concern ourselves only with functional morphology - that is, alternations between forms that change the function (as opposed to the meaning) of a word. It should also be a productive alternation (i.e., an alternation that holds for any word of the same word class) even if it isn't always formed in the same way. A straightforward example is English plurals - basically every noun in English has a plural form, even though it could be formed irregularly (sometimes even with no phonological alternation, like "sheep/sheep").

An example of a derivational (as opposed to functional) non-productive morphological alternation might be the adjectival "un" prefix in English (in words like "unhappy", "unusual", "uncaring"). You can make new word forms with it, which means it's at least semi-productive, but you can't form new words from just any adjective using "un": *"unglistening", *"unfast", *"undigital" might be expected, but seem to mostly be thought of as "incorrect" (uncorrect?;) by native English speakers. Furthermore, the prefix changes the meaning and doesn't affect how the word is used in a sentence ("They are happy", "They are unhappy" are both okay, cf. "They are cats", *"They are cat", where the meaning would be basically the same, but the particular grammatical form of the word is no longer compatible with the sentence).

We need to keep in mind both the distinctions being made in a language and the language-specific strategies for the distinction. If a language doesn't distinguish number on nouns, then we don't need to try to impose a singular and plural distinction. In more analytical terms, we don't want to say that any noun is both singular and plural, like the form "sheep" is in English - we simply leave analyses of number out of the picture. If the primary strategy in a language is that the plural receives an additional morphological element, then we can think of the singular as "unmarked", and our analyses of the singular nouns should probably just be "noun" (not "noun.singular"). However, if the primary strategy of the language is to change a morphological element, then we probably want to analyse the singular noun forms as explicitly singular ("noun.singular").

Note that morphology can also include words that are written separately, especially if they affect the form of an adjacent word. An example of this might be the tense/person verb markers in Hñähñu, e.g. thede ‘laugh’, but bi nthede ‘he/she/it laughed’, where the form of the verb changes too. An example with a space that might be less likely to be considered morphology are the "case markers" of Khasi, e.g. ngi ‘us’, but ha ngi ‘to us’; there may be no reason to treat these any differently than prepositions in English. What you choose to do in this regard will depend on a lot of factors, including how your grammar documentation describes the phenomenon, how native speakers think of it (potentially/likely based on what they've been taught about it), and maybe additional linguistic evidence (such as whether something may intervene or not).

What you're looking for on this assignment

For this assignment, the following are the sorts of things you're looking for:

  • Any alternation of forms of a given word based on their syntactic or phonological environment
    • Phonological and [functional] morphological alternations
  • Any categorisation schema relevant to the lexicon
    • noun/verb classes, pronoun features, etc.
    • can have bearing on what morphology is taken, or what syntactic arguments are allowed

Morphological analyses

There are standardised formal representations of the mapping between morphological form and morphological analysis. A morphological form is just any word form of a language, whereas a morphological analysis tells you further information about that form. (Soon we will be developing a transducer to map between these two representations.) For our purposes, a morphological analysis will consist of a stem followed by a set of tags, or more specifically:

  • A lemma, or the stem or "base form" of the word. In English, a noun or verb lemma will just be the bare form of the word: the lemma of "cats" is "cat" and the lemma of "running" is "run". A more complicated example is the lemma of "is", which would be "be", which isn't morphologically a root, but is still the lemma.
  • A series of tags, or abbreviations used to both categorise words and provide information about common morphological properties of the words. We'll try to stick to the Apertium tagset.
    • The first tag is usually the word category or part of speech (often POS), for example <n> for "noun", <v> for "verb", <adj> for "adjective".
    • A subcategory tag may follow if relevant for the particular language and part of speech, for example gender for nouns, transitivity for verbs, etc.
      • Note: usually purely semantic tags like "body part" or "verb of motion" aren't subcategory tags, unless they have some exclusive influence over the form of other words. Similarly, inflectional classes, like "class I verbs" (verbs that take certain patterns of morphology to the exclusion of other verbs), will be useful to consider later, but are not syntactic subcategories, so should not be encoded as tags.
    • Also optionally are tags for the relevant morphological distinctions, often referred to as grammatical tags. This can include tags for things like person agreement (for verbs, possessed nouns, etc.), number (e.g., for nouns), number agreement (for determiners, adjectives, etc.), polarity (e.g., <neg> for negative), and lots of other things.
      • Note that the order of functional morphology tags isn't always clear. By default, it should occur in the order the morphemes appear in in the language, but this may not be stable within a single language's grammar. When in doubt, you can work down from "major distinction" to "minor distinction", whatever those terms might mean for your particular language.

An example is provided in the following diagram:

Description and complete contents available on file page.

Working with IGT

Interlinear Glossed Text (or IGT) is a notation that linguists use to provide morphological information about examples in grammatical descriptions or theoretical arguments. It's common in grammatical descriptions and academic work, and can be very useful for making sense of material in the language. Here is an example from nci (Classical Nahuatl):


ni-c-chihui-lia
1sg.subj-3sg.obj-mach-appl
in
det
no-piltzin
1sg.poss-Sohn
ce
ein
calli
Haus
 

This "converts" to Apertium-style tagging, in morphtest, as the following. Note that dashes were removed.

chihui<v><tv><s_sg1><o_sg3><appl> ↔ nicchihuilia
in<det><def> ↔ in
piltzin<n><px1sg> ↔ nopiltzin
ce<det><ind> ↔ ce
calli<n> ↔ calli

IGT will often provide much of the information you need for a morphological analysis, but not always. Commonly no information is provided in the gloss when there is not an overt morpheme used in the language for a morphological distinction. Specific examples include 3rd person marking on verbs or singular of nouns in otherwise morphologically rich languages.

Even when such information is provided, IGT may still lack information needed for a full computational treatment. For example, the lemma might not be specified (as in the word in in the above example).

Also, IGT will often not include an orthographic form of the example. That is, the example will be in a transcription system of some sort (IPA if you're lucky), and not in any normative orthography used for the language. You'll have to figure out how to spell the example in cases like this if the example is useful to you.

And the glosses themselves might not even be in a language you can read.

More specifically

At this point in the course, we will be working through the grammar of a language by trying to understand the mapping of specific forms to specific analyses.

To consider

  • Are there any irregular forms in the language? What about the pronouns (are their alternations identical to nouns?)?
  • Is there any sort of agreement morphology for person/number/etc on verbs or nouns?
  • How are tense/mood/aspect/evidentiality marked in the language? Do the verbs change form? Are there auxiliary verbs or other particle-like words that might be analysed as part of the morphology instead of as separate words? Do transitive and intransitive verbs take different morphology, or different syntactic arguments?
  • Do nouns change form in different number (e.g., singular/plural)? Do they change form based on how they are used in the sentence? Are they lexically specified for class (masculine/feminine, or more?)? Do all nouns take the same set of forms?
  • Do adjectives behave like nouns in terms of number/case/etc.? How are comparatives formed?
  • What properties of personal pronouns are distinguished? Most languages have at least 3 person distinctions (1st, 2nd, 3rd) and many have number distinctions as well (singular, plural). Are other things distinguished, like an additional person or number, relative social class encoding of speaker and hearer (for all pronouns or just 2nd person?), etc.?
  • What about demonstrative and interrogative pronouns?
  • Are there any phonologically productive alternations in the language?
  • Make sure you say what the use of the morphology is.

Examples

See Grammar documentation/Examples for examples of how to do the assignment. Following are some ideas for the types of things to look at:

  • You could document something similar to the plural pattern(s) of English. List the regular form, predictable alternations, and a list of irregular forms.
  • You could document something like a single tense conjugation of Spanish. Mention that the theme vowel determines what the set of endings is, and list the endings for each person/number.
  • spellrelax. If there is a list of common spelling alternatives that you want to interpret as a given standard, listing those (with some explanation) can count as one grammar point. For example, if certain accent marks are considered proper, but people frequently don't use them, then you'll want to interpret characters without these accent marks

The assignment

This assignment is due at the end of the 5th week of class, on Friday at midnight (this semester, 23:59 on Friday, February 17th, 2023).

  1. Determine what the main parts of speech are in your language. There are going to be some open classes, like nouns, verbs, and adjectives, and some [relatively more] closed classes like prepositions or pronouns. Create one section on a Language/Grammar page on the wiki outlining the main parts of speech and any subcategories, providing computational POS symbols (or tags) for each one that are compatible with the ones used by the Apertium project. Give an example or two of each class and subclass using the {{morphTest}} template as it's used on the examples page. This need not go into a lot of depth—but try to list at least 5 parts of speech and at least one example of each.
  2. Find any set of alternations or systems (as described above) in your language. For each one, write one new section describing this grammar point. Provide some examples. You'll have to make preliminary decisions about what the base form is, what tags you should be using, etc. Some examples are available. You should have at least eight grammar points in all.
    • If you're working on a polysynthetic language, you may have a lot of options to sort out in order to choose discrete grammar points, and if you're working on a more isolating language, there may not be a lot of morphology points to choose from. <-- If you need easier grammar points or just more grammar points, then feel free to create sections for some of the examples listed above that aren't on the examples page. Include these in an "Other" section, since they won't be relevant for your language. If you're working on a polysynthetic language, though, please limit the number of easier grammar points from other languages you choose to two only. -->
      • If you're having trouble finding morphological alternations in your language, consider adding sections like the following, especially ones for "closed" parts of speech or elements labelled "particles" by sources, where it might be difficult to determine the right tags: personal pronouns (list personal pronouns with appropriate tags), demonstrative determiners and demonstrative pronouns (list all demonstrative determiners and pronouns with their various forms), conjunctions, classifiers, auxiliaries, or anything labelled "particle" (which is not usually a single coherent part of speech). The challenge in many of these won't be to work out the morphology-to-tag mapping but the precise parts of speech and perhaps relevant subcategories.
    • As mentioned above, you can list a handful of spellrelax mappings and count that as one grammar point.
    • If you identify a dominant pattern in the morphology (like x when A and y when B), and are also able to document a number of exceptions, this can count as a second grammar point—but even if there are four dominant patterns, if it's the same process it can only count as two grammar points.
    • Each grammar point should have at least three examples using the {{morphTest}} template.
  3. Add the page to the category sp23_GrammarDocumentation and also a category for your language. Add a link to this new page to the main language page, under the section for resources developed in this class.

Sanity checks

  • There should be on the order of 40 morphTests, 25 at an absolute minimum.
    • You can have examples of each part of speech tag in the initial section.
  • Each morphTest should have an analysis on the left and a form on the right.
    • The analysis should have a stem (or "lemma"), a main categorisation tag (e.g., <v>), any sub-categorisation tags (e.g., <iv>), and any morphology tags (e.g., <past>).
    • The morphological forms should be proper orthographic forms of the language (i.e., native orthography, not grammar book orthography). There should be no dashes in the forms, no extraneous quotation marks, no English glosses, etc. inside the morphTest template. You can have these things in notes outside the template.
  • Make sure you use the same tag throughout the page consistently—e.g., you don't want <v>, <vb>, and <vblex> all used for verbs—choose one and be consistent.
  • There should be minimal use of non-productive morphology, such as derivations. An example of this might be infinite<adj><→n> ↔ infinity, since this same process can't be applied freely to any noun. In some languages, derivation of this sort is entirely or almost entirely productive, in which case this is fine.