There's a "trick" using
hfst-minimise that will get around some issues in converting hfst-format transducers to lttoolbox-format transducers. That is, if
hfst-proc xyz.automorf.hfst and
lt-proc xyz.automorf.bin seem to be acting differently, try this trick.
- In the rule for making
$(LANG1).automorf.hfst, pipe the output of
hfst-minimisebefore piping the output of that into
hfst-fst2fst. So the relevant part of this line should now look something like this:
| hfst-invert | hfst-minimise | hfst-fst2fst -O -o $
- In the rule for
$(LANG1).autogen.hfst, first run
hfst-minimiseon the input before piping that into
hfst-fst2fst. That rule should now look something like this:
hfst-minimise $< | hfst-fst2fst -O -o $@
make clean, reinitialise your module (
./autogen.sh), and run
- Test the output of
The relevant section should end up looking something like this (modulo any spellrelax or twoc elements):
$(LANG1).autogen.hfst: .deps/$(LANG1).RL.hfst hfst-minimise $< | hfst-fst2fst -O -o $@ $(LANG1).automorf.hfst: .deps/$(LANG1).LR.hfst hfst-invert $< | hfst-minimise | hfst-fst2fst -O -o $@ $(LANG1).autogen.att.gz: $(LANG1).autogen.hfst hfst-fst2txt $< | gzip -9 -c -n > $@ $(LANG1).automorf.att.gz: $(LANG1).automorf.hfst hfst-fst2txt $< | gzip -9 -c -n > $@
This happens because there are two fairly minor bugs in the software we use that no-one has got around to fixing yet (partly because there are workarounds):
hfst-procthere is a problem with longest-match left-to-right tokenisation, see the bug description here. For this reason we convert the compiled hfst transducer to lttoolbox format, which doesn't have this bug in processing the transducer.
lt-comp(the ATT compiler) there is a problem with compiling entries where one of the sides begins with an epsilon from the start state but the other doesn't. One work-around is the hfst-minimise trick; another is to just make sure none of your start states are empty on one side and not on the other.