#722923 Fonts used as character sets are not supported

Package:
antiword
Source:
antiword
Description:
Converts MS Word files to text, PS, PDF and XML
Submitter:
Sebastien Hinderer
Date:
2013-09-20 14:18:05 UTC
Severity:
normal
Tags:
#722923#5
Date:
2013-09-14 15:03:52 UTC
From:
To:
Hello,

In certain files, as in the attached one, a font is used to store text.
Here, for instance, the text is tibetan and it is stored using the
TibetanMachineWeb truetype font.

The problem here is that there seems to be no way to tellantiword that
the font should have an impact on how the characters are translated to
UTF8.

I would be interested in helping to patch the program so that this kind
of file is handled corectly but would appreciate tobe able to discuss
with the author,maintainer and anybody interested about the best way to
do so.

Thanks,
Sébastien.

#722923#10
Date:
2013-09-15 13:01:23 UTC
From:
To:
I'm afraid I don't know the antiword code well enough to offer much
useful advice about making such a change.  If you can come up with a
sensible looking patch which works for your example file and doesn't
break others, I'm happy to add it to the package.

You're probably best off talking to the upstream author, though I don't
think he's actively working on antiword now as the last release was
2005-10-21.  I haven't needed to communicate with him since taking over
maintenance of the debian package, so I'm not sure if he's still
interested in antiword or not.

Good luck!

Cheers,
    Olly

#722923#15
Date:
2013-09-17 07:37:16 UTC
From:
To:
Hello Olly, many thanks for your response.

No problem.

Many thanks. I really appreciate. The "doesn't break others" part might
be a bit difficult to prove but I'll try.

That was also my guess. Do you think the address I used in Cc of
the original bug report is the best one to try to contact him? I'm
asking because I noticed that you didn't Cc this address in your
response, so that made me think that perhaps the addess is not good.

I understand. Even if he does not code, I'd really appreciate to be able
to talk to him because some aspects ofboth the code and the word format
reman a bit obscure to me. By the way, if you know persons I could talk
with about the word format that would also be helpful!

Thanks. If I can't talk to anybody and have to discover things by myself
it may take me some time to come up with a patch because the spec of the
format offered by Microsoft is non-trivial and, for me, not so easy to
read and understand.

Best wishes,
Sébastien.

#722923#20
Date:
2013-09-17 10:23:08 UTC
From:
To:
I'm not expecting absolute proof, but it'd be good to test it on a
selection of word documents, and compare output with and without
the patch.

Don't read anything into that - it's just an artifact of how I replied
(I just fetched the mailbox for the bug with bts show -m, so replied
to the message as it was before the X-Debbugs-Cc got processed).

It might be worth trying some of the other options (if you haven't
already).

wv has a command line extractor (wvText), which in my experience handles
some files better than antiword (and others less well).  Sadly it isn't
actively maintained upstream either these days (last release was just
under 3 years ago).  ISTR antiword is faster than wvText.

There's wv2, but that doesn't come with a command line tool - it's
just a library.  That's also not active upstream (last release nearly 4
years ago).

There's also unoconv which uses libreoffice to do the extraction - that
means the extraction code is actively maintained upstream, and it seems
to work with most files I've tried.  The downside is it is rather slow
and memory hungry, and I've found it randomly fails sometimes.  I think
the issues stem from trying to remote control libreoffice, which of
course thinks it's a GUI application rather than a command line tool
or library.

Cheers,
    Olly

#722923#25
Date:
2013-09-20 14:15:29 UTC
From:
To:
Hello Olly, thanks for your e-mail.

Okay, will do once the patch is ready, which as I said will not happen
shortly because it's a lot of work.

I understand.

So far I tried catdoc and maybe wordview, which were not more
successful.

Will give a new look to all these, thanks. I think libreoffice also
misses the conversion. I don't knowbutthis font-as-codepag trick
seemsnot very well supported. It looks as if people aremostly unaware of
the problem. Perhaps i's because it has been used only for exotic fonts
such as tibetan and sanskrit ones.

Best wishes,
Sébastien.