User:Tjones5/Final project

From LING073
< User:Tjones5
Revision as of 00:54, 5 May 2017 by Tjones5 (talk | contribs) (Presentation)

Jump to: navigation, search

Placeholder page for final project

Notes

  • To implement:
    • Reduplication
  • Get a big corpus:
    • Build a scraper? to get corpus up to 25K
    • Use regular expressions to clean up scraped (or copied/pasted) text
  • For evaluation:
    • Go back and look at coverage
  • Other ling to do: UD pipeline

Presentation

Description of the problem

  • Word structure is analyzed by composition of morphemes (the smallest units with grammatical meaning)
  • A morphological transducer connects forms with analyses.
  • Surface level -- represents the concatenation of letters which make up the actual spelling of the word, e.g. foxes
  • Lexical level -- represents the concatenation of morphemes making up a word, e.g. NOUN-fox + PLURAL-es
  • Morphological parsing -- the problem of recognizing that a word breaks down into component morphemes, as above with "foxes"

Simple transducer.png

Previous work

  • Many languages have working Apertium transducers:
  • Warlpiri doesn't have a working transducer -- in fact, no languages from the Pama–Nyungan family have a transducer
  • Other work?

Benefits of this project

  • Why is a transducer useful to a community of Warlpiri speakers?
    • Search engine
    • Spell checker
    • Machine Translation, especially to/from English (the primary language of most Warlpiri speakers). Many are trying to revive the use of Warlpiri, so there are many potentially useful applications.
      • eng → wbp MT system could be used to look up how to say/write a certain word or phrase
      • wbp → eng MT system could be used to analyze (and then translate) untranslated Warlpiri phrases, or give alternate translations
    • All of these applications might also be helpful in starting an analyzer for other indigenous languages that have the same orthography, are highly agglutinative, and have other grammatical similarities

Approaches

  • Mostly have worked within lexc and twol files
    • Examples??
  • Disambiguation is a challenge because of flexible word order
    • Numerals are one thing to deal with
  • Need to build a scraper to test on a larger corpus

Evaluation

  • Coverage over a large corpus
  • Precision and recall against a hand-annotated randomly selected forms
  • Coverage over large corpus
    • Number of tokenised words in the corpus: 18427
    • Coverage: 51.26%
  • Precision/Recall on wbp.annotated.basic.txt from Polished RMBT:
    • Precision: 97.54902%
    • Recall: 97.07317%
  • Number of stems: ~250


status description stems coverage
prototype language module that has not received heavy development <1,000 <60%
development language module under development ≥1,000 ≥60%
working language module with near-production-quality performance ≥8,000 ≥80%
production language module used in a released pair ≥10,000 ≥90%