Difference between revisions of "Spring 2019/Structural transfer"

From LING073
Jump to: navigation, search
(The structure of a transfer file)
(Wrapping up)
 
(13 intermediate revisions by 2 users not shown)
Line 8: Line 8:
  
 
=== How structural transfer works in Apertium ===
 
=== How structural transfer works in Apertium ===
 +
Transfer takes the output of the <code>biltrans</code> mode (bilingual translation), matches series of words based on patterns you define, and performs operations on and output those things.  It allows you to change the order of words, change tags, etc.
  
 
==== Three levels ====
 
==== Three levels ====
Line 26: Line 27:
 
Macros are defined in <code><def-macro>...</def-macro></code> blocks inside '''<code><section-def-macros>...</section-def-macros></code>'''.  They allow any combination of parts of an action section (though <code><out>...</out></code> blocks are to be avoided) to be used within an arbitrary action section.
 
Macros are defined in <code><def-macro>...</def-macro></code> blocks inside '''<code><section-def-macros>...</section-def-macros></code>'''.  They allow any combination of parts of an action section (though <code><out>...</out></code> blocks are to be avoided) to be used within an arbitrary action section.
  
An <code><out>...</out></code> block should immediately contain a <code><chunk>...</chunk></code>, which in turn contains chunk <code><tags>...</tags></code> and <code><lu>...</lu></code> (lexical unit) blocks (separated by <code><b/></code> spaces) defining the lexical unit and corresponding tags to be output.
+
An <code><out>...</out></code> block should immediately contain a <code><chunk>...</chunk></code>, which in turn contains chunk <code><tags>...</tags></code> and <code><lu>...</lu></code> (lexical unit) blocks (separated by <code><b/></code> spaces) defining the lexical unit and corresponding tags to be output.  For multiple units being output as a single lexical unit, <code><lu>...</lu></code> blocks should be wrapped in an <code><mlu>...</mlu></code> block.
  
 
Each lexical unit consists of <code><clip/></code>s, which contain the attributes <code>pos=""</code> for position matched in the pattern, <code>side=""</code> for the side to output, and <code>part=""</code> for the part of the material to output.  Parts can be specified as <code>lem</code> for the lemma, <code>whole</code> for the entirety, and any set of tags (as a list of <code><attr-item/></code>s) you define as <code><def-attr>...</def-attr></code>  in '''<code><section-def-attrs>...</section-def-attrs></code>'''.
 
Each lexical unit consists of <code><clip/></code>s, which contain the attributes <code>pos=""</code> for position matched in the pattern, <code>side=""</code> for the side to output, and <code>part=""</code> for the part of the material to output.  Parts can be specified as <code>lem</code> for the lemma, <code>whole</code> for the entirety, and any set of tags (as a list of <code><attr-item/></code>s) you define as <code><def-attr>...</def-attr></code>  in '''<code><section-def-attrs>...</section-def-attrs></code>'''.
  
 +
==== Some things to note ====
 +
* You can't match subpatterns or superpatterns.  If more than one overlapping pattern match in a sentence, only one is chosen.  You can write macros to call from multiple rules to deal with overlapping patterns.
 +
*
 +
 +
==== Examples of implemented Apertium transfer systems ====
 
Plenty of examples are available:
 
Plenty of examples are available:
 
* [https://github.com/jonorthwash/apertium-kir-eng eng-kir] transfer that covers the example above and basically nothing else.
 
* [https://github.com/jonorthwash/apertium-kir-eng eng-kir] transfer that covers the example above and basically nothing else.
* [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-en-es/ en-es]: a mature translation pair with well developed structural transfer for English-Spanish and Spanish-English translation.
+
* [https://github.swarthmore.edu/Ling073-sp19/ling073-eng-spa eng-spa (in-class)]: a basic example from class showing how to transfer adjective+noun from English to Spanish ("big houses casas largas": number and gender agreement and reordering) using chunking (chunker+interchunk).
* And lots in between.
+
* [https://github.com/apertium/apertium-eng-spa eng-spa (apertium)]: a mature translation pair with well developed structural transfer for English-Spanish and Spanish-English translation.
* [https://github.swarthmore.edu/Ling073-sp17/ling073-eng-spa eng-spa]: a basic example from class showing how to transfer adjective+noun from English to Spanish ("large cats gatos largos": number and gender agreement and reordering) using chunking (chunker+interchunk).
+
* And lots in between.  See especially the pairs available in [https://github.com/apertium/ Apertium's GitHub].
  
 
==== Writing rules ====
 
==== Writing rules ====
Line 42: Line 48:
 
* [[:apertium:A long introduction to transfer rules|A long introduction to transfer rules]]
 
* [[:apertium:A long introduction to transfer rules|A long introduction to transfer rules]]
 
* [http://xixona.dlsi.ua.es/~fran/apertium2-documentation.pdf Full apertium documentation] (section 3.5 covers the transfer module)
 
* [http://xixona.dlsi.ua.es/~fran/apertium2-documentation.pdf Full apertium documentation] (section 3.5 covers the transfer module)
 +
 +
== Evaluating ==
 +
==== Scrape a mini test corpus ====
 +
# First make sure you have <code>scrapeTransferTests</code>.  Test that running <code>scrapeTransferTests</code> gives you information on using the tool.  If not, clone the [[Misc tools|tools repo]] (or <code>git pull</code> to update it, if you already have it cloned from other assignments) and run <code>sudo make</code>.  Test again.
 +
# Scrape the transferTests from your contrastive grammar page into a small parallel corpus.  E.g., <code>scrapeTransferTests -p abc-xyz "Language1_and_Language2/Contrastive_Grammar"</code> will result in an <code>abc.tests.txt</code> and <code>xyz.tests.txt</code> file that contain the respective sides of any transferTests on your contrastive grammar page specified as being for abc-to-xyz translation.
 +
# '''Add these two files to your bilingual corpus repository''' and add mention of their origin (the wiki page) to the <code>MANIFEST</code> file.
 +
 +
==== WER and PER ====
 +
'''WER''' or '''[[:wikipedia:word error rate|word error rate]]''' is a measure of how different two texts are.  You will want to know how different the translation your translation pair performs (the "test translation") is from the known good translation of phrases in your parallel corpus (the "reference translation").
 +
 +
'''PER''' ('''position-independent error rate''') is the same measurement, just not sensitive to position in a phrase.  I.e., a correct translation of every word but in an entirely wrong word order will give you high (bad) WER but low (good) PER.
 +
 +
To test WER and PER:
 +
# First make sure you have <code>apertium-eval-translator</code>.  Test that running <code>apertium-eval-translator</code> gives you information on using the tool.  If not, clone the [[Misc tools|tools repo]] (or <code>git pull</code> to update it, if you already have it cloned from [[Morphological analyser/Setup|other]] [[MorphTests2yaml|assignments]]) and run <code>make</code>.
 +
# You need two files: one '''test translation''', and one '''reference translation'''.  The reference translation is the parallel text in your corpus, e.g. <code>abc.tests.txt</code>.  To get a test translation, run the source text through apertium and direct the output into a new file, e.g. <code>cat xyz.tests.txt | apertium -d . xyz-abc > xyz-abc.tests.txt</code>.  You should '''add the [final] test translation to your repository'''.
 +
# The following command should then give you WER and PER measures and some other useful numbers:
 +
#* <code>apertium-eval-translator -r abc.tests.txt -t xyz-abc.tests.txt</code>
  
 
== The assignment ==
 
== The assignment ==
This assignment is due at the end of week 12 (this semester, '''noon on Friday, April 7, 2017''').
+
This assignment is early in week 13 (this semester, '''noon on Wednesday, April 17, 2019''').
 +
 
 +
=== Getting set up ===
 
# Add a page to the wiki called <code>Language1_and_Language2/Structural_transfer</code>, linking to it from the main page on the language pair.
 
# Add a page to the wiki called <code>Language1_and_Language2/Structural_transfer</code>, linking to it from the main page on the language pair.
#* Put the page in the category [[:Category:Sp17_StructuralTransfer]].
+
#* Put the page in the category [[:Category:Sp19_StructuralTransfer]] and the categories for the two languages.
#* Perform WER, PER, and coverage tests on your short sentences corpus, and add this in to a pre-evaluation section.
+
#* Perform WER, PER, and coverage tests on your short sentences corpus, and add this in to a <code>pre-evaluation</code> section.
# Implement at least one item from your contrastive grammar.
+
 
 +
=== Adding stems ===
 +
# Add all the words for the transfer tests (from [[contrastive grammar|the last assignment]]) to analyse to bilingual dictionary.
 +
#* And make sure both analysers can analyse all sentences correctly, which includes adding the words to the relevant monolingual dictionaries as necessary.
 +
 
 +
=== Write structural transfer rules ===
 +
# Implement at least one item from your [[contrastive grammar]].
 
#* Each person in each group should implement at least one item for the direction that translates into the language that they have been primarily working with.  The same item does not need to be used for each direction.
 
#* Each person in each group should implement at least one item for the direction that translates into the language that they have been primarily working with.  The same item does not need to be used for each direction.
 
#* If the contrastive grammar item only involves relabelling or reordering tags within the same form, then please do at least two items.
 
#* If the contrastive grammar item only involves relabelling or reordering tags within the same form, then please do at least two items.
 +
 +
=== Wrapping up ===
 
# Add to your structural transfer wiki page:
 
# Add to your structural transfer wiki page:
 
#* Add at least one example sentence for each item you implement.  Show the outputs of the following modes for your translation system: tagger, biltrans, chunker, interchunk, postchunk, and the pair itself (abc-xyz).
 
#* Add at least one example sentence for each item you implement.  Show the outputs of the following modes for your translation system: tagger, biltrans, chunker, interchunk, postchunk, and the pair itself (abc-xyz).
Line 58: Line 91:
 
[[Category:Assignments]]
 
[[Category:Assignments]]
 
[[Category:Structural transfer]]
 
[[Category:Structural transfer]]
 +
[[Category:Old versions of pages]]

Latest revision as of 18:37, 17 April 2021

Background

The basic idea of structural transfer in RBMT

The idea of structural transfer in RBMT is to deal with the order and tag differences encountered in translation between two languages

The arrows between the two tagged levels represent where structural transfer is needed. Colour coding shows [rough] correspondences.

How structural transfer works in Apertium

Transfer takes the output of the biltrans mode (bilingual translation), matches series of words based on patterns you define, and performs operations on and output those things. It allows you to change the order of words, change tags, etc.

Three levels

There are three stages of structural transfer in Apertium: chunker (t1x), interchunk (t2x), postchunk (t3x). The effect of some rules implemented at each stage are shown below:

Each stage of structural transfer: chunker, interchunk, postchunk

Chunker has access to word-level lemmas and tags, interchunk has access to chunk-level names and tags, and postchunk has access only to chunk-level names.

The structure of a transfer file

The rules in a transfer file go in <section-rules>...</section-rules>. Each <rule>...</rule> consists of a <pattern>...</pattern> and an <action>...</action>.

The matched pattern is an ordered list of <pattern-item>...</pattern-item>s, whose names refer to <def-cat>...</def-cat>s, which contain <cat-item tags=""/>s (tags defined as in lexical selection) and are defined in <section-def-cats>...</section-def-cats>.

The action section of a rule can contain <out>...</out> blocks containing the general structure of what is output in place of the matched pattern, <let>...</let> statements for setting variables (defined in <section-def-vars>...</section-def-vars>) or mutating tags, <choose>...</choose> conditional blocks, <call-macro>...</call-macro> statements for calling a macro.

Macros are defined in <def-macro>...</def-macro> blocks inside <section-def-macros>...</section-def-macros>. They allow any combination of parts of an action section (though <out>...</out> blocks are to be avoided) to be used within an arbitrary action section.

An <out>...</out> block should immediately contain a <chunk>...</chunk>, which in turn contains chunk <tags>...</tags> and <lu>...</lu> (lexical unit) blocks (separated by <b/> spaces) defining the lexical unit and corresponding tags to be output. For multiple units being output as a single lexical unit, <lu>...</lu> blocks should be wrapped in an <mlu>...</mlu> block.

Each lexical unit consists of <clip/>s, which contain the attributes pos="" for position matched in the pattern, side="" for the side to output, and part="" for the part of the material to output. Parts can be specified as lem for the lemma, whole for the entirety, and any set of tags (as a list of <attr-item/>s) you define as <def-attr>...</def-attr> in <section-def-attrs>...</section-def-attrs>.

Some things to note

  • You can't match subpatterns or superpatterns. If more than one overlapping pattern match in a sentence, only one is chosen. You can write macros to call from multiple rules to deal with overlapping patterns.

Examples of implemented Apertium transfer systems

Plenty of examples are available:

  • eng-kir transfer that covers the example above and basically nothing else.
  • eng-spa (in-class): a basic example from class showing how to transfer adjective+noun from English to Spanish ("big houses → casas largas": number and gender agreement and reordering) using chunking (chunker+interchunk).
  • eng-spa (apertium): a mature translation pair with well developed structural transfer for English-Spanish and Spanish-English translation.
  • And lots in between. See especially the pairs available in Apertium's GitHub.

Writing rules

One of the best documented features of Apertium are its transfer rules. Here are some places to read, in approximate order of level of complexity

Evaluating

Scrape a mini test corpus

  1. First make sure you have scrapeTransferTests. Test that running scrapeTransferTests gives you information on using the tool. If not, clone the tools repo (or git pull to update it, if you already have it cloned from other assignments) and run sudo make. Test again.
  2. Scrape the transferTests from your contrastive grammar page into a small parallel corpus. E.g., scrapeTransferTests -p abc-xyz "Language1_and_Language2/Contrastive_Grammar" will result in an abc.tests.txt and xyz.tests.txt file that contain the respective sides of any transferTests on your contrastive grammar page specified as being for abc-to-xyz translation.
  3. Add these two files to your bilingual corpus repository and add mention of their origin (the wiki page) to the MANIFEST file.

WER and PER

WER or word error rate is a measure of how different two texts are. You will want to know how different the translation your translation pair performs (the "test translation") is from the known good translation of phrases in your parallel corpus (the "reference translation").

PER (position-independent error rate) is the same measurement, just not sensitive to position in a phrase. I.e., a correct translation of every word but in an entirely wrong word order will give you high (bad) WER but low (good) PER.

To test WER and PER:

  1. First make sure you have apertium-eval-translator. Test that running apertium-eval-translator gives you information on using the tool. If not, clone the tools repo (or git pull to update it, if you already have it cloned from other assignments) and run make.
  2. You need two files: one test translation, and one reference translation. The reference translation is the parallel text in your corpus, e.g. abc.tests.txt. To get a test translation, run the source text through apertium and direct the output into a new file, e.g. cat xyz.tests.txt | apertium -d . xyz-abc > xyz-abc.tests.txt. You should add the [final] test translation to your repository.
  3. The following command should then give you WER and PER measures and some other useful numbers:
    • apertium-eval-translator -r abc.tests.txt -t xyz-abc.tests.txt

The assignment

This assignment is early in week 13 (this semester, noon on Wednesday, April 17, 2019).

Getting set up

  1. Add a page to the wiki called Language1_and_Language2/Structural_transfer, linking to it from the main page on the language pair.
    • Put the page in the category Category:Sp19_StructuralTransfer and the categories for the two languages.
    • Perform WER, PER, and coverage tests on your short sentences corpus, and add this in to a pre-evaluation section.

Adding stems

  1. Add all the words for the transfer tests (from the last assignment) to analyse to bilingual dictionary.
    • And make sure both analysers can analyse all sentences correctly, which includes adding the words to the relevant monolingual dictionaries as necessary.

Write structural transfer rules

  1. Implement at least one item from your contrastive grammar.
    • Each person in each group should implement at least one item for the direction that translates into the language that they have been primarily working with. The same item does not need to be used for each direction.
    • If the contrastive grammar item only involves relabelling or reordering tags within the same form, then please do at least two items.

Wrapping up

  1. Add to your structural transfer wiki page:
    • Add at least one example sentence for each item you implement. Show the outputs of the following modes for your translation system: tagger, biltrans, chunker, interchunk, postchunk, and the pair itself (abc-xyz).
    • Perform WER, PER, and coverage tests again, and add into a post-evaluation section on the wiki page.