Difference between revisions of "Lexical selection"

From LING073
Jump to: navigation, search
m (The assignment)
m (The assignment)
(14 intermediate revisions by the same user not shown)
Line 4: Line 4:
 
== Writing rules ==
 
== Writing rules ==
  
Lexical selection rules are in the <code>apertium-abc-xyz.abc-xyz.lrx</code> file in your language pair.  You can find information about and examples of the formalism on the Apertium wiki: [[:apertium:Constraint-based lexical selection module|constraint-based lexical selection module]], [[:apertium:How to get started with lexical selection rules|how to get started with lexical selection rules]].
+
Lexical selection rules are in the <code>apertium-xyz-abc.xyz-abc.lrx</code> file in your language pair.  You can find information about and examples of the formalism on the Apertium wiki: [[:apertium:Constraint-based lexical selection module|constraint-based lexical selection module]], [[:apertium:How to get started with lexical selection rules|how to get started with lexical selection rules]].
  
 
In short, you use the surrounding words (with morphological analyses) as context to decide which translation is the correct one.
 
In short, you use the surrounding words (with morphological analyses) as context to decide which translation is the correct one.
Line 17: Line 17:
 
Can we decide which translation is right from the surrounding words?
 
Can we decide which translation is right from the surrounding words?
  
* (kir) кол → (eng) hand; arm
+
Words can have multiple translations from Language A to Language B in any of these situations:
* (kir) күн → (eng) sun; day
+
* There are two words in Language A that happen to have the same lemma and POS tags
* (kir) жаш → (eng) age; tear (eye water)
+
** (kir) жаш → (eng) age; tear (eye water)
* (kir) бир → (eng) one; a/an
+
** (kir) ат → (eng) horse; name
* (eng) paper → (kir) кагаз (material); доклад (written work)
+
** (eng) right → (kir) оң (side); туура (correct)
* (eng) horse → (kir) ат (for riding, working); жылкы (for herding, eating)
+
** (eng) lie (kir) жат (lie down); калп айт (state a falsehood)
* (eng) sister → (kir) сиңди (younger, of female); карындаш (younger, of male); эже (older)
+
** (eng) bat → (kir) жарганат (animal); таякча (stick for hitting a ball)
 +
* There is a semantic distinction in Language B that is not expressed in Language A
 +
** (kir) кол → (eng) hand; arm
 +
** (kir) күн → (eng) sun; day
 +
** (kir) бир → (eng) one; a/an
 +
** (kir) кара → (eng) look at; watch
 +
** (eng) paper → (kir) кагаз (material); доклад (written work)
 +
** (eng) horse → (kir) ат (for riding, working); жылкы (for herding, eating)
 +
** (eng) sister → (kir) сиңди (younger, of female); карындаш (younger, of male); эже (older)
 +
** (eng) move → (kir) жыл/жылдыр (location iv/tv); кыймылда/кыймылдат (movement iv/tv); көч/көчүр (residence, files iv/tv)
 +
** (eng) fall → (kir) түш (drop from a height / fall down); кула (to the side / fall over); жыгыл (due to loss of footing)
  
 
What do we do with examples like this?
 
What do we do with examples like this?
Line 29: Line 39:
 
* (eng) tear → (kir) жаш (eye water); айыр (rip)
 
* (eng) tear → (kir) жаш (eye water); айыр (rip)
 
* (eng) plant → (kir) өсүмдүк (growing thing); отургуз (put in the ground to grow)
 
* (eng) plant → (kir) өсүмдүк (growing thing); отургуз (put in the ground to grow)
 +
* (eng) fall → (kir) күз (season, {{tag|n}}); күзгү (relating to the season, {{tag|adj}}) [the verbs above]
 
* (kir) өч → (eng) revenge; be extinguished
 
* (kir) өч → (eng) revenge; be extinguished
 
* (kir) кыл → (eng) strand of hair; string (of instrument); do; make
 
* (kir) кыл → (eng) strand of hair; string (of instrument); do; make
 +
* (kir) кара → (eng) black; look at; watch
 +
* (kir) жыл → (eng) year; move
 +
 +
How do we know when to apply [[lexical selection]] and when to apply [[morphological disambiguation]]?
  
 
=== The structure of lrx files ===
 
=== The structure of lrx files ===
Line 36: Line 51:
 
An <code>lrx</code> file is written in xml.  It consists of a list of rules surrounded by {{tag|rules}}<code>...</code>{{tag|/rules}}.  Each {{tag|rule}}<code>...</code>{{tag|rule}} block contains some combination of {{tag|match}}{{tag|select ... /}}{{tag|/match}} blocks and {{tag|match .../}} blocks.  {{tag|match ...}} tags can contain attributes like <code>lemma=""</code> and <code>tags=""</code>.  Multiple {{tag|match .../}} blocks may be included within {{tag|or}}<code>...</code>{{tag|/or}} so that multiple rules don't have to be written in cases of multiple matching criteria.
 
An <code>lrx</code> file is written in xml.  It consists of a list of rules surrounded by {{tag|rules}}<code>...</code>{{tag|/rules}}.  Each {{tag|rule}}<code>...</code>{{tag|rule}} block contains some combination of {{tag|match}}{{tag|select ... /}}{{tag|/match}} blocks and {{tag|match .../}} blocks.  {{tag|match ...}} tags can contain attributes like <code>lemma=""</code> and <code>tags=""</code>.  Multiple {{tag|match .../}} blocks may be included within {{tag|or}}<code>...</code>{{tag|/or}} so that multiple rules don't have to be written in cases of multiple matching criteria.
  
Besides the additional documentation linked to above, countless examples can be found in the [https://github.com/apertium/ Apertium codebase], and some examples from class will also [https://github.swarthmore.edu/Ling073-sp18 be posted].
+
Besides the additional documentation linked to above, countless examples can be found in the [https://github.com/apertium/ Apertium codebase], and some examples from class will also [https://github.swarthmore.edu/Ling073-sp22/ling073-kir-eng be posted].
  
Here is an example to try in class:
+
Here is an '''example''' to try in class:
 
* Көзүнөн жаш акты.
 
* Көзүнөн жаш акты.
 
* Анын жашын билбейм.
 
* Анын жашын билбейм.
  
 
=== Additional guidelines ===
 
=== Additional guidelines ===
You may wish to make a '''"default" translation''' and then some more specific rules.  Because of what appears currently be [https://sourceforge.net/p/apertium/tickets/119/ a bug in the lexical selection code], the more less specific rule (the "default") may override the more specific rules. Currently, the best way to do this seems to be with [[:apertium:How_to_get_started_with_lexical_selection_rules#Rule_weighting|weights]]—i.e., give a lower weight to the default rule than to the specific ones (e.g., by just giving a weight below 1 to the default rule).  You will also want to go through your <code>modes.xml</code> and replace <code>lrx-proc</code> with <code>lrx-proc -m</code> so that the lexical selection module actually uses these weights.
+
You may wish to make a '''"default" translation''' and then some more specific rules.  Because of how <code>lrx-proc</code> works, the less specific rule (the "default") may override any more specific rules.
 +
 
 +
Currently, the best way to get the less specific rule to be the default seems to be with [[:apertium:How_to_get_started_with_lexical_selection_rules#Rule_weighting|weights]]—i.e., give a lower weight to the rule you wish to be default than to the specific ones (e.g., by just giving a weight above 1 to the specific rule). <!-- (This is now default) You will also want to go through your <code>modes.xml</code> and replace <code>lrx-proc</code> with <code>lrx-proc -m</code> so that the lexical selection module actually uses these weights. -->
  
 
== The assignment ==
 
== The assignment ==
Due on Friday of the 10th week of class (this semester, '''Friday, April 6th, 2018, by the end of the day (midnight)''')
+
Due on Friday of the 10th week of class (this semester, '''Friday, April 8th, 2022, by the end of the day (midnight)''')
  
# Make a new page on the wiki, <code>Language1_and_Language2/Lexical_selection</code>, linking to it from the main page for your language pair, and adding it to the category [[:Category:Sp19_LexicalSelection]].
+
# Make a new page on the wiki, <code>Language1_and_Language2/Lexical_selection</code>, linking to it from the main page for your language pair, and adding it to the category [[:Category:Sp22_LexicalSelection]] and the categories for the two languages.
# Add two sections, <code>abc → xyz</code> and <code>xyz → abc</code>.
+
# Add a section <code>xyz → abc</code>.
# In each section, add the two cases of ambiguous translations you came up with in [[Lexical transfer|the previous assignment]].
+
# Find and document at least '''two cases of a one-to-many mapping''' for translation.
# For each ambiguous translation you work on, determine some generalisation for how to decide which translation is right, documenting these generalisations on the wiki. It may be that there is no generalisation that can be made for a given lexical selection problem; in this case, choose a default form to select and '''find some other one-to-many translation''' that can be solved with lexical selection.  You must have at least one [mostly] solvable problem per translation direction.
+
#* First check any dictionary resources you have for words with multiple translations.  Things like lacking a ''hand''/''arm'' or ''sun''/''day'' or ''moon''/''month'' distinction in the source language but having it in the target language are potentially good places to start.
# Add example sentences or phrases that can be tested for each destination of the word with the ambiguous translation (minimum: two sentences).
+
#* If you're translating to English and are having trouble coming up with examples for this assignment, consider words in English that are used in lexically specific ways, like ''in''/''on''/''at'' (e.g., one ''rides in a car'' but one ''rides on a bus'') or ''watch''/''look at'' or ''say''/''tell''.
# In your <code>lrx</code> file, set up any necessary default translations, and at least one set of lexical selection rules for one solvable lexical selection problem.  Add the example sentences/phrases in a comment before the relevant rule.
+
#* See more examples under the second bullet point above where the need for lexical selection is discussed.
# Commit your code.
+
# For each ambiguous translation you work on, '''determine some generalisation''' for how to decide which translation is right, documenting these generalisations on the wiki.
 +
#* It may be that there is no obvious generalisation that can be made for a given lexical selection problem; in this case, choose a default form to select and '''find some other one-to-many translation''' that can be solved with lexical selection.  You must have at least one [mostly] solvable problem.
 +
# For each translation of the word, '''add an example sentence''' (or phrase if necessary—but check with me) where the meaning is unambiguous (minimum: two sentences).
 +
# In the appropriate <code>lrx</code> file (for the <code>xyz-abc</code> direction), set up any necessary default translations, and at least one set of lexical selection rules for one solvable lexical selection problem.  Add the example sentences/phrases in a comment before the relevant rule.
 +
# Compile and test that the sentences are translated with the correct stem.
 +
# Commit and push your code!
  
 
[[Category:Assignments]]
 
[[Category:Assignments]]
 
[[Category:Lexical selection]]
 
[[Category:Lexical selection]]

Revision as of 23:14, 4 April 2022

Lexical selection

Lexical selection provides a way to deal with what you do when a single word in one language translates to more than one word in another language. The formalism used in Apertium allows you to choose translations based on the lemma and/or part of speech of surrounding words. In the Apertium pipeline, it modifies the stream after lexical transfer and before structural transfer.

Writing rules

Lexical selection rules are in the apertium-xyz-abc.xyz-abc.lrx file in your language pair. You can find information about and examples of the formalism on the Apertium wiki: constraint-based lexical selection module, how to get started with lexical selection rules.

In short, you use the surrounding words (with morphological analyses) as context to decide which translation is the correct one.

Linguistic examples

An example that might need lexical selection is the following:

  • yard (a unit of length)
  • yard (an enclosed area)

What's different about these words? Spelling, pronunciation, part of speech? What [different] contexts might each occur in?

Can we decide which translation is right from the surrounding words?

Words can have multiple translations from Language A to Language B in any of these situations:

  • There are two words in Language A that happen to have the same lemma and POS tags
    • (kir) жаш → (eng) age; tear (eye water)
    • (kir) ат → (eng) horse; name
    • (eng) right → (kir) оң (side); туура (correct)
    • (eng) lie → (kir) жат (lie down); калп айт (state a falsehood)
    • (eng) bat → (kir) жарганат (animal); таякча (stick for hitting a ball)
  • There is a semantic distinction in Language B that is not expressed in Language A
    • (kir) кол → (eng) hand; arm
    • (kir) күн → (eng) sun; day
    • (kir) бир → (eng) one; a/an
    • (kir) кара → (eng) look at; watch
    • (eng) paper → (kir) кагаз (material); доклад (written work)
    • (eng) horse → (kir) ат (for riding, working); жылкы (for herding, eating)
    • (eng) sister → (kir) сиңди (younger, of female); карындаш (younger, of male); эже (older)
    • (eng) move → (kir) жыл/жылдыр (location iv/tv); кыймылда/кыймылдат (movement iv/tv); көч/көчүр (residence, files iv/tv)
    • (eng) fall → (kir) түш (drop from a height / fall down); кула (to the side / fall over); жыгыл (due to loss of footing)

What do we do with examples like this?

  • (eng) tear → (kir) жаш (eye water); айыр (rip)
  • (eng) plant → (kir) өсүмдүк (growing thing); отургуз (put in the ground to grow)
  • (eng) fall → (kir) күз (season, <n>); күзгү (relating to the season, <adj>) [the verbs above]
  • (kir) өч → (eng) revenge; be extinguished
  • (kir) кыл → (eng) strand of hair; string (of instrument); do; make
  • (kir) кара → (eng) black; look at; watch
  • (kir) жыл → (eng) year; move

How do we know when to apply lexical selection and when to apply morphological disambiguation?

The structure of lrx files

An lrx file is written in xml. It consists of a list of rules surrounded by <rules>...</rules>. Each <rule>...<rule> block contains some combination of <match><select ... /></match> blocks and <match .../> blocks. <match ...> tags can contain attributes like lemma="" and tags="". Multiple <match .../> blocks may be included within <or>...</or> so that multiple rules don't have to be written in cases of multiple matching criteria.

Besides the additional documentation linked to above, countless examples can be found in the Apertium codebase, and some examples from class will also be posted.

Here is an example to try in class:

  • Көзүнөн жаш акты.
  • Анын жашын билбейм.

Additional guidelines

You may wish to make a "default" translation and then some more specific rules. Because of how lrx-proc works, the less specific rule (the "default") may override any more specific rules.

Currently, the best way to get the less specific rule to be the default seems to be with weights—i.e., give a lower weight to the rule you wish to be default than to the specific ones (e.g., by just giving a weight above 1 to the specific rule).

The assignment

Due on Friday of the 10th week of class (this semester, Friday, April 8th, 2022, by the end of the day (midnight))

  1. Make a new page on the wiki, Language1_and_Language2/Lexical_selection, linking to it from the main page for your language pair, and adding it to the category Category:Sp22_LexicalSelection and the categories for the two languages.
  2. Add a section xyz → abc.
  3. Find and document at least two cases of a one-to-many mapping for translation.
    • First check any dictionary resources you have for words with multiple translations. Things like lacking a hand/arm or sun/day or moon/month distinction in the source language but having it in the target language are potentially good places to start.
    • If you're translating to English and are having trouble coming up with examples for this assignment, consider words in English that are used in lexically specific ways, like in/on/at (e.g., one rides in a car but one rides on a bus) or watch/look at or say/tell.
    • See more examples under the second bullet point above where the need for lexical selection is discussed.
  4. For each ambiguous translation you work on, determine some generalisation for how to decide which translation is right, documenting these generalisations on the wiki.
    • It may be that there is no obvious generalisation that can be made for a given lexical selection problem; in this case, choose a default form to select and find some other one-to-many translation that can be solved with lexical selection. You must have at least one [mostly] solvable problem.
  5. For each translation of the word, add an example sentence (or phrase if necessary—but check with me) where the meaning is unambiguous (minimum: two sentences).
  6. In the appropriate lrx file (for the xyz-abc direction), set up any necessary default translations, and at least one set of lexical selection rules for one solvable lexical selection problem. Add the example sentences/phrases in a comment before the relevant rule.
  7. Compile and test that the sentences are translated with the correct stem.
  8. Commit and push your code!