Final projects/Transducer

From LING073
Jump to: navigation, search

For any of the transducer projects:

  • You can contribute straight to Apertium: ask for commit access.
  • To get random forms from your corpora for precision/recall measures:
    1. get 1200 random words:
      cat corpus1.txt corpus2.txt | apertium -d . abc-tagger | sed -r 's/\$\s*\^/\$\n\^/g' | cut -f1 -d'/' | sed 's/[\^\s\t]*//' | sort -Ru | head -n 1200 > words.txt
      This might not work as expected with your tagger (i.e., you may not get one form per line—there might be some non-form symbols included), so make sure your call to cg-conv in your tagger mode in `modes.xml` doesn't have `-n` and recompile.
    2. get analyses from transducer:
      cat words.txt | apertium -d . abc-disam
    3. Then do this again: Polished RBMT system#Hand-annotating corpora
  • Use the wikipedia extractor script to get the entire contents of wikipedia in your language (if it exists).