#279221 should transcode characters from utf-8 if the terminal is not utf-8 capable

Package:
w3m
Source:
w3m
Description:
WWW browsable pager with excellent tables/frames support
Submitter:
Joey Hess
Date:
2014-10-12 15:00:05 UTC
Severity:
minor
#279221#5
Date:
2004-11-01 05:50:05 UTC
From:
To:
Here's the problem:

joey@dragon:~>locale | grep CtypE
LC_CTYPE="POSIX"
joey@dragon:~>echo '—' > foo.html
joey@dragon:~>w3m -dump foo.html
?

That comes out as a '?' because w3m apparently internally converts it to the
utf-8 character for mdash (which is not '-', but the other dash), and then
discovers it's not in the character set for this terminal and decides to render
it as a question mark. When reading a document with lots of —, “,
&helip; and other fancy entities, this gets very annoying.

Instead, w3m should be aware of the character set and just use available
characters that are close to the right ones, like "-". Other browsers, such
as lynx, do that.

#279221#10
Date:
2005-06-07 12:56:16 UTC
From:
To:
Hi,

For this, iconv can be much helpful:

$ hexdump foo
0000000  e2 80 94 0a
$ iconv -f utf-8 -t latin1//translit < foo
--
$

The //translit suffixe tells iconv to translate everything.

So w3m should do something like:

#define TRANSLIT "//translit"
char *codeset = nl_langinfo(CODESET);
int len = strlen(codeset);
char *charset = malloc(len+strlen(TRANSLIT)+1);
memcpy(charset,codeset,len);
memcpy(charset+len,TRANSLIT,strlen(TRANSLIT)+1);
conv = iconv_open(charset, page_charset);
iconv(conv, ...);

Regards,
Samuel

#279221#21
Date:
2007-06-06 12:16:57 UTC
From:
To:
Hi,

Any news?

#279221#30
Date:
2014-10-12 12:31:24 UTC
From:
To:
Dear Maintainer,

I wonder it this bug report can be closed for w3m in Debian 7.

I got the correct output

$ echo '—' > foo.html
$ w3m -dump < foo.html
—

Regards
Markus

#279221#35
Date:
2014-10-12 14:46:45 UTC
From:
To:
Still not improved.

    $ w3m -dump foo.html
    ?
    $ w3m -dump -T text/html < foo.html
    ?

Thanks,
--
Tatsuya Kinoshita