Final projects

From LING073
Jump to: navigation, search

This page describes requirements and presents ideas for doing a computational linguistics final project.

Requirements

The project

The project should be something new that could be potentially useful to a linguistic community or people working with linguistic communities. Some project ideas are listed below. Most projects will require some form of evaluation of the results, and should be released to the general public.

You can form different groups for the final project than you have been working on this semester. This includes the option of working by yourself. Let me know what you decide as you decide it.

Evaluation

Part of your project is to evaluate the effectiveness of whatever you made. An appropriate method for evaluating your project should be established in consultation with the professor. This requirement may be waived in extreme cases.

Examples include:

  • If you improve or make a transducer, you should test coverage over a large corpus and precision and recall against hand-annotated randomly selected forms;
  • If you improve or make an MT system, you should test trimmed coverage and WER and PER.

Publishing

Your final project should be published and publicly accessible.

  • Make a wiki page User:Username/Final_project which outlines what you did, links to your code, and provides the results of your evaluation.
  • Put code somewhere public (github.com or similar)
  • Include the following files:
    • a README file overviewing what the code is, what it does, and examples of how to use it, and links back to your wiki page
    • a AUTHORS file providing your name (or alias) and some way of contacting you (can be a throw-away email address that forwards to your main address)—be as anonymous or open with your identity as you want.
    • a COPYING or LICENSE file that includes a copy of the open-source license you chose to release the project under
    • If your project is completed by way of contribution to an existing code base, make sure these files are there too, but also link the wiki page to each commit you made or any issue tracker you interacted with (e.g., list the issues you took on).

Submission

When you're done with the project, send an email to the professor with a link to the wiki page on the project.

The due date for 2021 is TBA.

Presentation

During finals periods, you will present your progress on the project to that point in the form of a poster. This semester, we will meet on TBA from TBA to TBA for these presentations.

Besides being available to answer questions about the poster, you will give a short (30-second) introduction to your project, as well as a longer (5-10 minute) presentation of your poster.

These are topics you should cover:

  • A description of the problem addressed by your project. For some projects this may require thinking about it in a somewhat new light. Whether or not the problem is something we've dealt with in class, provide a brief but complete introduction to the problem.
  • Any previous work you've found that also addresses this problem or similar problems, and how it relates to what you're doing.
  • Your thoughts on who might benefit from your project and in what way. E.g., might a language community be able to find a use for what you're doing, or maybe [computational] linguists working on a language or issue? Remember that Bird (2020) advocates for evaluating in this way—it just might not be possible to do so within the context of this class.
  • How you're approaching the solution to the overall problem, including how you're implementing the solution. You can talk here about smaller individual issues that have arisen as well.
  • How you're evaluating the effectiveness of the solution, and some preliminary look at results of the evaluation. We'll understand that the project isn't yet complete, so the evaluation may show that the project is entirely ineffective and useless :-P

Project ideas

This section lists some ideas for final projects, but you should not feel limited by these ideas.

Extending what you've done

title description good if…
Mature tagger Expand analyser (up to ~85% coverage on large corpus), improve disambiguation, and evaluate performance of both if you have a large corpus to work from
Spell checker Expand analyser (up to ~85% coverage on large corpus), set up spell checker, add weights, evaluate performance, demonstrate in libreoffice if you have a large corpus to work from
State-of-the-art MT System Improve one direction of the MT system you developed this semester so that it has ~85% trimmed coverage on a large corpus and mostly passes testvoc. if you think you can make something good enough where the next step would be for a native speaker to evaluate its output
New MT System Make a polished MT system between your language and either a closely related language or a big language for which there are already good resources if you're good at structural transfer
UD Corpus + Annotation standards Create and release a good sized UD corpus of your language and release annotation guidelines for it if you have a large/diverse enough Free corpus (>500 sentences) that you can understand (e.g., one with glosses) and you like syntactic annotation

Going beyond what you've done

title description good if…
Transducer for a new language Create a transducer for a language that doesn't seem to have an open source transducer available. Under-resourced languages preferred. if you feel like you got the hang of lexc and twol, and have interest in another language
Keyboard layout suite Create keyboard layouts for multiple operating systems (Windows, Mac, iOS, Android, etc.) and release all of them together, with documentation about use and design principles. if you have strong opinions about keyboard use and don't want to code or do analysis
Apertium-separable support for a translation pair Add support for Apertium-separable to any existing translation pair (including Apertium pairs), and get a sizable number of translations working using it. If you're interested in syntactic long-distance dependencies, or think dealing with them outside of the syntactic transfer module would benefit a translation pair.
Apertium-anaphora support for a translation pair Add support for Apertium-anaphora to any existing translation pair (including Apertium pairs), and get some number of test translations working using it. If you're interested in anaphora resolution (e.g., deciding what referent a pronoun refers to and e.g. giving it the right gender based on that) and would like to play with this new(ish) module.

Something new

title description good if…
UD Annotatrix Go through the list of issues to improve the UD annotatrix interface if you're good with javascript/jquery, or want to learn
Mobile keyboard with auto-correct With some coding, an Android keyboard may be able to use the acceptor part of a transducer (i.e., a spell-checker) for auto-correction/suggestion. if you have Android development experience or want to learn
Improve apertium-init Go through the list of issues for apertium-init and improve the tool significantly. if you have some basic python skills and enjoy thinking of Apertium modules as composed of lots of moving parts
Paradigm generation for LaTeX Create a paradigm generation tool that allows one to iterate paradigm generation for a series of stems to allow someone to easily generate something similar to a "verb conjugation book" (like a "501 verbs" book). The basic functionality would be similar to the apertium-paradigmatrix, but with iteration. if you are interested in LaTeX, a scripting language, and/or paradigm reference books
Transducer support in OCR Implement the ability for an open-source OCR system (like Tesseract) to use HFST or lttoolbox transducers in place of a lookup dictionary. (Optical Character Recognition systems use lookup dictionaries as a way to improve accuracy.) if you're good with C/C++
Design your own! Any sizable improvement to any language technology tool out there, ideally for a language that could use it, chosen in consultation with the instructor. if nothing here sounds quite right