Difference between revisions of "Hfst-minimise trick"

From LING073
Jump to: navigation, search
 
(2 intermediate revisions by 2 users not shown)
Line 9: Line 9:
 
* Run <code>make clean</code>, reinitialise your module (<code>./autogen.sh</code>), and run <code>make</code> to recompile.
 
* Run <code>make clean</code>, reinitialise your module (<code>./autogen.sh</code>), and run <code>make</code> to recompile.
 
* Test the output of <code>lt-proc</code> and <code>hfst-proc</code> again.
 
* Test the output of <code>lt-proc</code> and <code>hfst-proc</code> again.
 +
 +
The relevant section should end up looking something like this (modulo any spellrelax or twoc elements):
 +
<pre>
 +
$(LANG1).autogen.hfst: .deps/$(LANG1).RL.hfst
 +
hfst-minimise $< | hfst-fst2fst -O -o $@
 +
 +
$(LANG1).automorf.hfst: .deps/$(LANG1).LR.hfst
 +
hfst-invert $< | hfst-minimise | hfst-fst2fst -O -o $@
 +
 +
$(LANG1).autogen.att.gz: $(LANG1).autogen.hfst
 +
hfst-fst2txt $< | gzip -9 -c -n > $@
 +
 +
$(LANG1).automorf.att.gz: $(LANG1).automorf.hfst
 +
hfst-fst2txt $< | gzip -9 -c -n > $@
 +
</pre>
 +
 +
 +
== The reason ==
 +
 +
This happens because there are two fairly minor bugs in the software we use that no-one has got around to fixing yet (partly because there are workarounds):
 +
 +
* In <code>hfst-proc</code> there is a problem with longest-match left-to-right tokenisation, see the bug description [http://sourceforge.net/tracker/?func=detail&aid=3383731&group_id=224521&atid=1061990 here].  For this reason we convert the compiled hfst transducer to lttoolbox format, which doesn't have this bug in processing the transducer.
 +
* In <code>lt-comp</code> (the ATT compiler) there is a problem with compiling entries where one of the sides begins with an epsilon from the start state but the other doesn't.  One work-around is the hfst-minimise trick; another is to just make sure none of your start states are empty on one side and not on the other.
 +
  
 
[[Category:Tutorials]]
 
[[Category:Tutorials]]

Latest revision as of 21:16, 9 April 2017

There's a "trick" using hfst-minimise that will get around some issues in converting hfst-format transducers to lttoolbox-format transducers. That is, if hfst-proc xyz.automorf.hfst and lt-proc xyz.automorf.bin seem to be acting differently, try this trick.

The trick

In your Makefile.am file:

  • In the rule for making $(LANG1).automorf.hfst, pipe the output of hfst-invert into hfst-minimise before piping the output of that into hfst-fst2fst. So the relevant part of this line should now look something like this:
    ...| hfst-invert | hfst-minimise | hfst-fst2fst -O -o $
  • In the rule for $(LANG1).autogen.hfst, first run hfst-minimise on the input before piping that into hfst-fst2fst. That rule should now look something like this:
    hfst-minimise $< | hfst-fst2fst -O -o $@
  • Run make clean, reinitialise your module (./autogen.sh), and run make to recompile.
  • Test the output of lt-proc and hfst-proc again.

The relevant section should end up looking something like this (modulo any spellrelax or twoc elements):

$(LANG1).autogen.hfst: .deps/$(LANG1).RL.hfst
	hfst-minimise $< | hfst-fst2fst -O -o $@

$(LANG1).automorf.hfst: .deps/$(LANG1).LR.hfst
	hfst-invert $< | hfst-minimise | hfst-fst2fst -O -o $@

$(LANG1).autogen.att.gz: $(LANG1).autogen.hfst
	hfst-fst2txt $< | gzip -9 -c -n > $@

$(LANG1).automorf.att.gz: $(LANG1).automorf.hfst
	hfst-fst2txt $< | gzip -9 -c -n > $@


The reason

This happens because there are two fairly minor bugs in the software we use that no-one has got around to fixing yet (partly because there are workarounds):

  • In hfst-proc there is a problem with longest-match left-to-right tokenisation, see the bug description here. For this reason we convert the compiled hfst transducer to lttoolbox format, which doesn't have this bug in processing the transducer.
  • In lt-comp (the ATT compiler) there is a problem with compiling entries where one of the sides begins with an epsilon from the start state but the other doesn't. One work-around is the hfst-minimise trick; another is to just make sure none of your start states are empty on one side and not on the other.