This page describes requirements and presents ideas for doing a computational linguistics final project.
The project should be something new that could be potentially useful to a linguistic community or people working with linguistic communities. Some project ideas are listed below. Most projects will require some form of evaluation of the results, and should be released to the general public.
You can form different groups for the final project than you have been working on this semester. This includes the option of working by yourself. Let me know what you decide as you decide it.
Part of your project is to evaluate the effectiveness of whatever you made. An appropriate method for evaluating your project should be established in consultation with the professor. This requirement may be waived in extreme cases.
- If you improve or make a transducer, you should test coverage over a large corpus and precision and recall against hand-annotated randomly selected forms;
- If you improve or make an MT system, you should test trimmed coverage and WER and PER.
Your final project should be published and publicly accessible.
- Make a wiki page
User:Username/Final_projectwhich outlines what you did, links to your code, and provides the results of your evaluation.
- Put the page in the category Category:sp21_FinalProjects.
- Also document any things you are aware of that still need to be done.
- Put code somewhere public (github.com or similar)
- Include the following files:
READMEfile overviewing what the code is, what it does, and examples of how to use it, and links back to your wiki page
AUTHORSfile providing your name (or alias) and some way of contacting you (can be a throw-away email address that forwards to your main address)—be as anonymous or open with your identity as you want.
LICENSEfile that includes a copy of the open-source license you chose to release the project under
- If your project is completed by way of contribution to an existing code base, make sure these files are there too, but also link the wiki page to each commit you made or any issue tracker you interacted with (e.g., list the issues you took on).
When you're done with the project, send an email to the professor with a link to the wiki page on the project.
The due date for 2021 is TBA.
During finals periods, you will present your progress on the project to that point in the form of a poster. This semester, we will meet on TBA from TBA to TBA for these presentations.
Besides being available to answer questions about the poster, you will give a short (30-second) introduction to your project, as well as a longer (5-10 minute) presentation of your poster.
These are topics you should cover:
- A description of the problem addressed by your project. For some projects this may require thinking about it in a somewhat new light. Whether or not the problem is something we've dealt with in class, provide a brief but complete introduction to the problem.
- Any previous work you've found that also addresses this problem or similar problems, and how it relates to what you're doing.
- Your thoughts on who might benefit from your project and in what way. E.g., might a language community be able to find a use for what you're doing, or maybe [computational] linguists working on a language or issue? Remember that Bird (2020) advocates for evaluating in this way—it just might not be possible to do so within the context of this class.
- How you're approaching the solution to the overall problem, including how you're implementing the solution. You can talk here about smaller individual issues that have arisen as well.
- How you're evaluating the effectiveness of the solution, and some preliminary look at results of the evaluation. We'll understand that the project isn't yet complete, so the evaluation may show that the project is entirely ineffective and useless :-P
This section lists some ideas for final projects, but you should not feel limited by these ideas.
Extending what you've done
|Mature tagger||Expand analyser (up to ~85% coverage on large corpus), improve disambiguation, and evaluate performance of both||if you have a large corpus to work from|
|Spell checker||Expand analyser (up to ~85% coverage on large corpus), set up spell checker, add weights, evaluate performance, demonstrate in libreoffice||if you have a large corpus to work from|
|State-of-the-art MT System||Improve one direction of the MT system you developed this semester so that it has ~85% trimmed coverage on a large corpus and mostly passes testvoc.||if you think you can make something good enough where the next step would be for a native speaker to evaluate its output|
|New MT System||Make a polished MT system between your language and either a closely related language or a big language for which there are already good resources||if you're good at structural transfer|
|UD Corpus + Annotation standards||Create and release a good sized UD corpus of your language and release annotation guidelines for it||if you have a large/diverse enough Free corpus (>500 sentences) that you can understand (e.g., one with glosses) and you like syntactic annotation|
Going beyond what you've done
|Transducer for a new language||Create a transducer for a language that doesn't seem to have an open source transducer available. Under-resourced languages preferred.||if you feel like you got the hang of lexc and twol, and have interest in another language|
|Keyboard layout suite||Create keyboard layouts for multiple operating systems (Windows, Mac, iOS, Android, etc.) and release all of them together, with documentation about use and design principles.||if you have strong opinions about keyboard use and don't want to code or do analysis|
|Apertium-separable support for a translation pair||Add support for Apertium-separable to any existing translation pair (including Apertium pairs), and get a sizable number of translations working using it.||If you're interested in syntactic long-distance dependencies, or think dealing with them outside of the syntactic transfer module would benefit a translation pair.|
|Apertium-anaphora support for a translation pair||Add support for Apertium-anaphora to any existing translation pair (including Apertium pairs), and get some number of test translations working using it.||If you're interested in anaphora resolution (e.g., deciding what referent a pronoun refers to and e.g. giving it the right gender based on that) and would like to play with this new(ish) module.|
|Mobile keyboard with auto-correct||With some coding, an Android keyboard may be able to use the acceptor part of a transducer (i.e., a spell-checker) for auto-correction/suggestion.||if you have Android development experience or want to learn|
|Improve apertium-init||Go through the list of issues for apertium-init and improve the tool significantly.||if you have some basic python skills and enjoy thinking of Apertium modules as composed of lots of moving parts|
|Paradigm generation for LaTeX||Create a paradigm generation tool that allows one to iterate paradigm generation for a series of stems to allow someone to easily generate something similar to a "verb conjugation book" (like a "501 verbs" book). The basic functionality would be similar to the apertium-paradigmatrix, but with iteration.||if you are interested in LaTeX, a scripting language, and/or paradigm reference books|
|Transducer support in OCR||Implement the ability for an open-source OCR system (like Tesseract) to use HFST or lttoolbox transducers in place of a lookup dictionary. (Optical Character Recognition systems use lookup dictionaries as a way to improve accuracy.)||if you're good with C/C++|
|Design your own!||Any sizable improvement to any language technology tool out there, ideally for a language that could use it, chosen in consultation with the instructor.||if nothing here sounds quite right|