#470221 unicode ligatures to ASCII

Package:
poppler-utils
Source:
poppler
Description:
PDF utilities (based on Poppler)
Submitter:
Date:
2014-12-10 00:03:05 UTC
Severity:
minor
#470221#3
Date:
2008-03-09 20:54:56 UTC
From:
To:
I would like to discuss today the Unicodes
¯ ’“”− ff fi fl ffi ...
that is
00AF 2019 201C 201D 2212 FB00 FB01 FB02 FB03 ...

You see, I noticed them when I used pdftotext on
http://www.cs.ucr.edu/~anirban/Anir-networking07.pdf
and then tired to read the results on my ASCII PDA.

I wish pdftotext had a flag to make the output ASCII.

Anyway, even uni2ascii -ydpxef wouldn't get all of them into ASCII.
The ligatures remained -- but turned into 0x codes. (P.S., I wish
there was one flag to "give me best ASCII", lest one ponder the man
page too long.) Also apparently there is no way to get uni2ascii to
not turn what it can't deal with to 0x codes, and let sail thru for
some other filter to complete the job.

Now turning to pstotext, whose man page says "pstotext deals better
with punctuation and ligatures." Not in this case.

Now turning to Text::Unidecode: sorry: mangled ligatures.

Anyways, I ended up having to write by hand:

#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
    s/¯/_/g; #just a guess
    s/’/'/g;
    s/“/"/g;
    s/”/"/g;
    s/−/-/g;
    s/ff/ff/g;
    s/fi/fi/g;
    s/fl/fl/g;
    s/ffi/ffi/g;
    s/ffl/ffl/g;
    s/ſt/ft/g;
    s/st/st/g;
    print;
}

#470221#14
Date:
2008-09-22 11:41:29 UTC
From:
To:
Hello,

Supporting that is a matter of using the //translit support of iconv:

$ echo ff | iconv -t ASCII//translit
ff

Samuel

#470221#19
Date:
2014-12-09 14:29:37 UTC
From:
To:
[Resent with the correct bug number, as "bts show --mbox" doesn't
fix it after a clone, and updated.]

For pdftotext, it seems that this is now done... except in some cases:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=772637

The behavior is rather strange!

BTW, what pdftotext does applies only to ligatures like ff, fi
and fl (as in the bug title). Output is still UTF-8 when need be
(e.g. for accented characters).

#470221#24
Date:
2014-12-09 23:59:55 UTC
From:
To:
VL> BTW, what pdftotext does applies only to ligatures like ff, fi
VL> and fl (as in the bug title). Output is still UTF-8 when need be
VL> (e.g. for accented characters).

OK you guys take care of it. Thanks.