This page describes requirements and presents ideas for doing a computational linguistics final project.
The project should be something new that could be potentially useful to a linguistic community or people working with linguistic communities. Some project ideas are listed below. Most projects will require some form of evaluation of the results, and should be released to the general public.
Part of your project is to evaluate the effectiveness of whatever you made. An appropriate method for evaluating your project should be established in consultation with the professor. This requirement may be waived in extreme cases.
- If you improve or make a transducer, you should test coverage over a large corpus and precision and recall against hand-annotated randomly selected forms;
- If you improve or make an MT system, you should test trimmed coverage and WER and PER.
Your final project should be published and publicly accessible.
- Make a wiki page
User:Username/Final_projectwhich outlines what you did, links to your code, and provides the results of your evaluation.
- Put the page in the category Category:sp17_FinalProjects.
- Also document any things you are aware of that still need to be done.
- Put code somewhere public (github.com or similar, the Apertium svn repo, etc.)
- Include the following files:
READMEfile overviewing what the code is, what it does, and examples of how to use it, and links back to your wiki page
AUTHORSfile providing your name (or alias) and some way of contacting you (can be a throw-away email address that forwards to your main address)—be as anonymous or open with your identity as you want.
LICENSEfile that includes a copy of the open-source license you chose to release the project under
- If your project is completed by way of contribution to an existing code base, make sure these files are there too, but also link the wiki page to each commit you made or any issue tracker you interacted with (e.g., list the issues you took on).
When you're done with the project, send an email to the professor with a link to the wiki page on the project.
Due date for 2017 is
Monday, May 8th at the end of the day (midnight) Wednesday, May 10th at the end of the day (midnight).
During finals periods, you will present your progress on the project to that point. This semester, we will meet on Friday, May 5th from 9:00 to noon for these presentations.
What will be expected for this assignment is a 10 minute presentation in which you share the following with your classmates:
- A description of the problem addressed by your project. For some projects this may require thinking about it in a somewhat new light. Whether or not the problem is something we've dealt with in class, provide a brief but complete introduction to the problem.
- Any previous work you've found that also addresses this problem or similar problems, and how it relates to what you're doing.
- Your thoughts on who might benefit from your project and in what way. E.g., might a language community be able to find a use for what you're doing, or maybe [computational] linguists working on a language or issue?
- How you're approaching the solution to the overall problem, including how you're implementing the solution. You can talk here about smaller individual issues that have arisen as well.
- How you're evaluating the effectiveness of the solution, and some preliminary look at results of the evaluation. We'll understand that the project isn't yet complete, so the evaluation may show that the project is entirely ineffective and useless :-P
This section lists some ideas for final projects, but you should not feel limited by these ideas.
Extending what you've done
|Mature tagger||Expand analyser (up to ~85% coverage on large corpus), improve disambiguation, and evaluate performance of both||if you have a large corpus to work from|
|Spell checker||Expand analyser (up to ~85% coverage on large corpus), set up spell checker, add weights, evaluate performance, demonstrate in libreoffice||if you have a large corpus to work from|
|State-of-the-art MT System||Improve one direction of the MT system you developed this semester so that it has ~85% trimmed coverage on a large corpus and mostly passes testvoc.||if you think you can make something good enough where the next step would be for a native speaker to evaluate its output|
|New MT System||Make a polished MT system between your language and either a closely related language or a big language for which there are already good resources||if you're good at structural transfer|
|UD Corpus + Annotation standards||Create and release a good sized UD corpus of your language and release annotation guidelines for it||if you have a large/diverse enough Free corpus (>500 sentences) that you can understand (e.g., one with glosses) and you like syntactic annotation|
Going beyond what you've done
|Transducer for a new language||Create a transducer for a language that doesn't seem to have an open source transducer available. Under-resourced languages preferred.||if you feel like you got the hang of lexc and twol, and have interest in another language|
|Keyboard layout suite||Create keyboard layouts for multiple operating systems (Windows, Mac, iOS, Android, etc.) and release all of them together, with documentation about use and design principles.||if you have strong opinions about keyboard use and don't want to code or do analysis|
|Matxin translator||Create a Matxin version of your translation pair (or another translation pair, in consultation with the professor).||if you're good at syntax, don't mind xml, and are a little masochistic...|
|Autoglosser||Create a script that can convert the output of a parser (e.g., UD) into interlinear glossed text in a variety of formats (LaTeX, general html/css, MediaWiki, command-line readable), testing with your language.||if you're good at a scripting language (e.g., python), or want to learn|
|Mobile keyboard with auto-correct||With some coding, an Android keyboard may be able to use the acceptor part of a transducer (i.e., a spell-checker) for auto-correction/suggestion.||if you have Android development experience or want to learn|
|Tokenisation for spaceless orthographies||Figure out a way to tokenise text (for morphological analysis) in a language that doesn't use spaces in its orthography / has optional spaces, and implement it into lt-proc or hfst-proc.||if you're good at C++ and like thinking about how spaceless orthographies are readable.|