Final projects

From LING073
Jump to: navigation, search

This page describes requirements and presents ideas for doing a computational linguistics final project.


The project

The project should be something new that could be potentially useful to a linguistic community or people working with linguistic communities. Some project ideas are listed below. Most projects will require some form of evaluation of the results, and should be released to the general public.


Part of your project is to evaluate the effectiveness of whatever you made. An appropriate method for evaluating your project should be established in consultation with the professor. This requirement may be waived in extreme cases.

Examples include:

  • If you improve or make a transducer, you should test coverage over a large corpus and precision and recall against hand-annotated randomly selected forms;
  • If you improve or make an MT system, you should test trimmed coverage and WER and PER.


Your final project should be published and publicly accessible.

  • Make a wiki page User:Username/Final_project which outlines what you did, links to your code, and provides the results of your evaluation.
  • Put code somewhere public ( or similar, the Apertium svn repo, etc.)
  • Include the following files:
    • a README file overviewing what the code is, what it does, and examples of how to use it, and links back to your wiki page
    • a AUTHORS file providing your name (or alias) and some way of contacting you (can be a throw-away email address that forwards to your main address)—be as anonymous or open with your identity as you want.
    • a COPYING or LICENSE file that includes a copy of the open-source license you chose to release the project under
    • If your project is completed by way of contribution to an existing code base, make sure these files are there too, but also link the wiki page to each commit you made or any issue tracker you interacted with (e.g., list the issues you took on).


When you're done with the project, send an email to the professor with a link to the wiki page on the project.

Due date for 2017 is Monday, May 8th at the end of the day (midnight) Wednesday, May 10th at the end of the day (midnight).


During finals periods, you will present your progress on the project to that point. This semester, we will meet on Friday, May 5th from 9:00 to noon for these presentations.

What will be expected for this assignment is a 10 minute presentation in which you share the following with your classmates:

  • A description of the problem addressed by your project. For some projects this may require thinking about it in a somewhat new light. Whether or not the problem is something we've dealt with in class, provide a brief but complete introduction to the problem.
  • Any previous work you've found that also addresses this problem or similar problems, and how it relates to what you're doing.
  • Your thoughts on who might benefit from your project and in what way. E.g., might a language community be able to find a use for what you're doing, or maybe [computational] linguists working on a language or issue?
  • How you're approaching the solution to the overall problem, including how you're implementing the solution. You can talk here about smaller individual issues that have arisen as well.
  • How you're evaluating the effectiveness of the solution, and some preliminary look at results of the evaluation. We'll understand that the project isn't yet complete, so the evaluation may show that the project is entirely ineffective and useless :-P

Project ideas

This section lists some ideas for final projects, but you should not feel limited by these ideas.

Extending what you've done

title description good if…
Mature tagger Expand analyser (up to ~85% coverage on large corpus), improve disambiguation, and evaluate performance of both if you have a large corpus to work from
Spell checker Expand analyser (up to ~85% coverage on large corpus), set up spell checker, add weights, evaluate performance, demonstrate in libreoffice if you have a large corpus to work from
State-of-the-art MT System Improve one direction of the MT system you developed this semester so that it has ~85% trimmed coverage on a large corpus and mostly passes testvoc. if you think you can make something good enough where the next step would be for a native speaker to evaluate its output
New MT System Make a polished MT system between your language and either a closely related language or a big language for which there are already good resources if you're good at structural transfer
UD Corpus + Annotation standards Create and release a good sized UD corpus of your language and release annotation guidelines for it if you have a large/diverse enough Free corpus (>500 sentences) that you can understand (e.g., one with glosses) and you like syntactic annotation

Going beyond what you've done

title description good if…
Transducer for a new language Create a transducer for a language that doesn't seem to have an open source transducer available. Under-resourced languages preferred. if you feel like you got the hang of lexc and twol, and have interest in another language
Keyboard layout suite Create keyboard layouts for multiple operating systems (Windows, Mac, iOS, Android, etc.) and release all of them together, with documentation about use and design principles. if you have strong opinions about keyboard use and don't want to code or do analysis
Matxin translator Create a Matxin version of your translation pair (or another translation pair, in consultation with the professor). if you're good at syntax, don't mind xml, and are a little masochistic...

Something new

title description good if…
UD Annotatrix Go through the list of issues to improve the UD annotatrix interface if you're good with javascript/jquery, or want to learn
Paradigm generation site Make a maintainer-configurable (i.e., usable for any language) web application for generating paradigms of a given POS (noun/verb) in morphologically rich languages that queries Apertium's web API to fill in the forms. if you're good with javascript, or want to learn
Autoglosser Create a script that can convert the output of a parser (e.g., UD) into interlinear glossed text in a variety of formats (LaTeX, general html/css, MediaWiki, command-line readable), testing with your language. if you're good at a scripting language (e.g., python), or want to learn
Mobile keyboard with auto-correct With some coding, an Android keyboard may be able to use the acceptor part of a transducer (i.e., a spell-checker) for auto-correction/suggestion. if you have Android development experience or want to learn
Tokenisation for spaceless orthographies Figure out a way to tokenise text (for morphological analysis) in a language that doesn't use spaces in its orthography / has optional spaces, and implement it into lt-proc or hfst-proc. if you're good at C++ and like thinking about how spaceless orthographies are readable.