Final projects/Transducer
From LING073
For any of the transducer projects:
- You can contribute straight to Apertium: ask for commit access.
- To get random forms from your corpora for precision/recall measures:
- get 1200 random words:
cat corpus1.txt corpus2.txt | apertium -d . abc-tagger | sed -r 's/\$\s*\^/\$\n\^/g' | cut -f1 -d'/' | sed 's/[\^\s\t]*//' | sort -Ru | head -n 1200 > words.txt
- This might not work as expected with your tagger (i.e., you may not get one form per line—there might be some non-form symbols included), so make sure your call to cg-conv in your tagger mode in `modes.xml` doesn't have `-n` and recompile.
- get analyses from transducer:
cat words.txt | apertium -d . abc-disam
- Then do this again: Polished RBMT system#Hand-annotating corpora
- get 1200 random words:
- Use the wikipedia extractor script to get the entire contents of wikipedia in your language (if it exists).