Hello, In certain files, as in the attached one, a font is used to store text. Here, for instance, the text is tibetan and it is stored using the TibetanMachineWeb truetype font. The problem here is that there seems to be no way to tellantiword that the font should have an impact on how the characters are translated to UTF8. I would be interested in helping to patch the program so that this kind of file is handled corectly but would appreciate tobe able to discuss with the author,maintainer and anybody interested about the best way to do so. Thanks, Sébastien.
I'm afraid I don't know the antiword code well enough to offer much
useful advice about making such a change. If you can come up with a
sensible looking patch which works for your example file and doesn't
break others, I'm happy to add it to the package.
You're probably best off talking to the upstream author, though I don't
think he's actively working on antiword now as the last release was
2005-10-21. I haven't needed to communicate with him since taking over
maintenance of the debian package, so I'm not sure if he's still
interested in antiword or not.
Good luck!
Cheers,
Olly
Hello Olly, many thanks for your response. No problem. Many thanks. I really appreciate. The "doesn't break others" part might be a bit difficult to prove but I'll try. That was also my guess. Do you think the address I used in Cc of the original bug report is the best one to try to contact him? I'm asking because I noticed that you didn't Cc this address in your response, so that made me think that perhaps the addess is not good. I understand. Even if he does not code, I'd really appreciate to be able to talk to him because some aspects ofboth the code and the word format reman a bit obscure to me. By the way, if you know persons I could talk with about the word format that would also be helpful! Thanks. If I can't talk to anybody and have to discover things by myself it may take me some time to come up with a patch because the spec of the format offered by Microsoft is non-trivial and, for me, not so easy to read and understand. Best wishes, Sébastien.
I'm not expecting absolute proof, but it'd be good to test it on a
selection of word documents, and compare output with and without
the patch.
Don't read anything into that - it's just an artifact of how I replied
(I just fetched the mailbox for the bug with bts show -m, so replied
to the message as it was before the X-Debbugs-Cc got processed).
It might be worth trying some of the other options (if you haven't
already).
wv has a command line extractor (wvText), which in my experience handles
some files better than antiword (and others less well). Sadly it isn't
actively maintained upstream either these days (last release was just
under 3 years ago). ISTR antiword is faster than wvText.
There's wv2, but that doesn't come with a command line tool - it's
just a library. That's also not active upstream (last release nearly 4
years ago).
There's also unoconv which uses libreoffice to do the extraction - that
means the extraction code is actively maintained upstream, and it seems
to work with most files I've tried. The downside is it is rather slow
and memory hungry, and I've found it randomly fails sometimes. I think
the issues stem from trying to remote control libreoffice, which of
course thinks it's a GUI application rather than a command line tool
or library.
Cheers,
Olly
Hello Olly, thanks for your e-mail. Okay, will do once the patch is ready, which as I said will not happen shortly because it's a lot of work. I understand. So far I tried catdoc and maybe wordview, which were not more successful. Will give a new look to all these, thanks. I think libreoffice also misses the conversion. I don't knowbutthis font-as-codepag trick seemsnot very well supported. It looks as if people aremostly unaware of the problem. Perhaps i's because it has been used only for exotic fonts such as tibetan and sanskrit ones. Best wishes, Sébastien.