Difference between revisions of "Hfst-minimise trick"
From LING073
(3 intermediate revisions by 2 users not shown) | |||
Line 5: | Line 5: | ||
* In the rule for making <code>$(LANG1).automorf.hfst</code>, pipe the output of <code>hfst-invert</code> into <code>hfst-minimise</code> before piping the output of that into <code>hfst-fst2fst</code>. So the relevant part of this line should now look something like this: | * In the rule for making <code>$(LANG1).automorf.hfst</code>, pipe the output of <code>hfst-invert</code> into <code>hfst-minimise</code> before piping the output of that into <code>hfst-fst2fst</code>. So the relevant part of this line should now look something like this: | ||
*: ...<code>| hfst-invert | hfst-minimise | hfst-fst2fst -O -o $</code> | *: ...<code>| hfst-invert | hfst-minimise | hfst-fst2fst -O -o $</code> | ||
− | * In the rule for <code>$(LANG1).autogen. | + | * In the rule for <code>$(LANG1).autogen.hfst</code>, first run <code>hfst-minimise</code> on the input before piping that into <code>hfst-fst2fst</code>. That rule should now look something like this: |
− | *: <code>hfst-minimise $< | hfst- | + | *: <code>hfst-minimise $< | hfst-fst2fst -O -o $@</code> |
* Run <code>make clean</code>, reinitialise your module (<code>./autogen.sh</code>), and run <code>make</code> to recompile. | * Run <code>make clean</code>, reinitialise your module (<code>./autogen.sh</code>), and run <code>make</code> to recompile. | ||
* Test the output of <code>lt-proc</code> and <code>hfst-proc</code> again. | * Test the output of <code>lt-proc</code> and <code>hfst-proc</code> again. | ||
+ | |||
+ | The relevant section should end up looking something like this (modulo any spellrelax or twoc elements): | ||
+ | <pre> | ||
+ | $(LANG1).autogen.hfst: .deps/$(LANG1).RL.hfst | ||
+ | hfst-minimise $< | hfst-fst2fst -O -o $@ | ||
+ | |||
+ | $(LANG1).automorf.hfst: .deps/$(LANG1).LR.hfst | ||
+ | hfst-invert $< | hfst-minimise | hfst-fst2fst -O -o $@ | ||
+ | |||
+ | $(LANG1).autogen.att.gz: $(LANG1).autogen.hfst | ||
+ | hfst-fst2txt $< | gzip -9 -c -n > $@ | ||
+ | |||
+ | $(LANG1).automorf.att.gz: $(LANG1).automorf.hfst | ||
+ | hfst-fst2txt $< | gzip -9 -c -n > $@ | ||
+ | </pre> | ||
+ | |||
+ | |||
+ | == The reason == | ||
+ | |||
+ | This happens because there are two fairly minor bugs in the software we use that no-one has got around to fixing yet (partly because there are workarounds): | ||
+ | |||
+ | * In <code>hfst-proc</code> there is a problem with longest-match left-to-right tokenisation, see the bug description [http://sourceforge.net/tracker/?func=detail&aid=3383731&group_id=224521&atid=1061990 here]. For this reason we convert the compiled hfst transducer to lttoolbox format, which doesn't have this bug in processing the transducer. | ||
+ | * In <code>lt-comp</code> (the ATT compiler) there is a problem with compiling entries where one of the sides begins with an epsilon from the start state but the other doesn't. One work-around is the hfst-minimise trick; another is to just make sure none of your start states are empty on one side and not on the other. | ||
+ | |||
[[Category:Tutorials]] | [[Category:Tutorials]] |
Latest revision as of 21:16, 9 April 2017
There's a "trick" using hfst-minimise
that will get around some issues in converting hfst-format transducers to lttoolbox-format transducers. That is, if hfst-proc xyz.automorf.hfst
and lt-proc xyz.automorf.bin
seem to be acting differently, try this trick.
The trick
In your Makefile.am
file:
- In the rule for making
$(LANG1).automorf.hfst
, pipe the output ofhfst-invert
intohfst-minimise
before piping the output of that intohfst-fst2fst
. So the relevant part of this line should now look something like this:- ...
| hfst-invert | hfst-minimise | hfst-fst2fst -O -o $
- ...
- In the rule for
$(LANG1).autogen.hfst
, first runhfst-minimise
on the input before piping that intohfst-fst2fst
. That rule should now look something like this:-
hfst-minimise $< | hfst-fst2fst -O -o $@
-
- Run
make clean
, reinitialise your module (./autogen.sh
), and runmake
to recompile. - Test the output of
lt-proc
andhfst-proc
again.
The relevant section should end up looking something like this (modulo any spellrelax or twoc elements):
$(LANG1).autogen.hfst: .deps/$(LANG1).RL.hfst hfst-minimise $< | hfst-fst2fst -O -o $@ $(LANG1).automorf.hfst: .deps/$(LANG1).LR.hfst hfst-invert $< | hfst-minimise | hfst-fst2fst -O -o $@ $(LANG1).autogen.att.gz: $(LANG1).autogen.hfst hfst-fst2txt $< | gzip -9 -c -n > $@ $(LANG1).automorf.att.gz: $(LANG1).automorf.hfst hfst-fst2txt $< | gzip -9 -c -n > $@
The reason
This happens because there are two fairly minor bugs in the software we use that no-one has got around to fixing yet (partly because there are workarounds):
- In
hfst-proc
there is a problem with longest-match left-to-right tokenisation, see the bug description here. For this reason we convert the compiled hfst transducer to lttoolbox format, which doesn't have this bug in processing the transducer. - In
lt-comp
(the ATT compiler) there is a problem with compiling entries where one of the sides begins with an epsilon from the start state but the other doesn't. One work-around is the hfst-minimise trick; another is to just make sure none of your start states are empty on one side and not on the other.