#532494 vilistextum: Enable multibyte (utf8) support

Package:
vilistextum
Source:
vilistextum
Description:
a HTML to text converter
Submitter:
Christian Ohm
Date:
2010-12-03 23:48:04 UTC
Severity:
wishlist
#532494#5
Date:
2009-06-09 16:53:53 UTC
From:
To:
Hello,

It would be nice if you could enable the multibyte support of vilistextum, so
it supports UTF8.

Best regards,
Christian Ohm

#532494#10
Date:
2009-06-18 10:47:56 UTC
From:
To:
tags 532494 + pending
done

2009/6/9 Christian Ohm <chr.ohm@gmx.net>:

Thanks for taking the time to file this request and helping improve
Debian. I'll upload a fixed version to mentors.debian.net within the
next hours.

#532494#17
Date:
2009-12-28 19:14:33 UTC
From:
To:
This bug is marked pending upload for about half an year. Debian uploads
hopefully don't take that long yet.

Also http://mentors.debian.org/ returns something which Firefox does not
recognize as HTML so in absence of browseability of that site an exact
URL of the supposed fixed package would be welcome.

Thanks

Michal

#532494#22
Date:
2010-01-30 17:37:53 UTC
From:
To:
Hi,

I am sorry I didn't reply to this before. I had some difficulties with
it and it ended slipping out of radar.

Basically what I did to get Vilistextum to compile with multibyte is
adding a build-dependency on "locales-all" and adding the following
line to debian/rules:
    DEB_CONFIGURE_EXTRA_FLAGS = --enable-multibyte
--with-unicode-locale=en_US.utf8

However, the resulting package wouldn't work on my system (running
vilistextum resulted in error message "setlocale failed with:
en_US.utf8") unless I installed "locales-all" on it. I suppose adding
locales-all as a dependency of the binary package is not the right
solution, but I don't really know what's going on here or what I'm
doing wrong.

If you can provide any pointers on how to fix this, or even a patch,
this would be great. (By the way, if someone out there is really
interested in this package, I'd be okay with handling over
maintainance or co-maintainance of it).

Have a nice day,

#532494#27
Date:
2010-01-30 18:10:44 UTC
From:
To:
Hello,

I've compiled it with only --enable-multibyte, and that compiles and runs.
Needs -u to output UTF8 though, maybe that should be default.

Best regards,
Christian Ohm

#532494#32
Date:
2010-01-30 18:15:52 UTC
From:
To:
2010/1/30 Christian Ohm <chr.ohm@gmx.net>:

Yeah, it does when you compile it locally, but not when you do so in a
chroot :/.

#532494#37
Date:
2010-02-04 17:11:06 UTC
From:
To:
Christian Ohm <chr.ohm@gmx.net> wrote:

I downloaded and run the debian-testing-i386-netinst.iso and then
dpkg-source -x vilistext_2.6.9-1.dsc
cd vilistextum-2.6.9
vim debian/rules
and added DEB_CONFIGURE_EXTRA_FLAGS = --enable-multibyte (and in a
second test also with --with-unicode-locale=en_US.utf8)
dpkg-buildpackage -rfakeroot
and installed the deb-file as root.

Both versions just worked.

But the building of package worked? And compiling it manually with
--with-unicode-locale=en_US.utf8 produced a working binary? That
sounds strange.

What does 'locale -a' and 'fakeroot locale -a' show?

I got for both:
C
en_US.utf8
POSIX

If only --enable-multibyte is specified, the configure script should
select a utf8 locale by itself.

That's because the output character set is defined by the input HTML
and not the user environment.

If there's not character set specified in the HTML file, vilistextum
falls back to latin1 for HTML and UTF-8 for XHTML.

Bye
Patric

#532494#42
Date:
2010-02-04 17:34:50 UTC
From:
To:
2010/2/4 Patric Mueller <bhaak@gmx.net>:

I guess that ISO comes with the locales for English then.

(By the way, if you want a build environment more like that of the
buildds, you can use eg. pbuilder. If you decide to try it out, I
recommend using the pbuilder-dist wrapper script, which is available
in package ubuntu-dev-tools)

Yeah. Note however that the package was build in a chroot with
locales-all installed, but I tried it on my local system.
C
ca_ES.utf8
POSIX

Inside the chroot, without installing locales-all:
C
POSIX

Inside the chroot, after installing locales-all (like it has when it's
building the package when locales-all is set as a build-dependency):
C
POSIX
aa_DJ
aa_DJ.iso88591
aa_DJ.utf8
[... almost enough lines to crash Iceweasel when copy-pasting more...]
zu_ZA
zu_ZA.iso88591
zu_ZA.utf8

Yep, it picks up a random one though, which is why I set it to English
explicitly.

Have a nice day,

#532494#47
Date:
2010-02-04 19:01:53 UTC
From:
To:
If the random one is aa_DJ.utf8 then most people won't have that.

Note also that the buildd has zu_ZA.utf8 while what I get is
en_US.UTF-8 (or zu_ZA.UTF-8 should I build it).

So you will probably want locales-all for build as there is no other
way to ensure that there is at least one UTF-8 locale.

It would be also reasonable to move the check for random (preferably
the one currently in use) UTF-8 locale from build time to run time.
However, this would need patching and I did not get around to looking
at what vilistextum does and how ugly it would be to make that a run
time check instead of a build time check.

Thanks

Michal

#532494#52
Date:
2010-02-05 22:56:13 UTC
From:
To:
FWIW here is a patch that

a) tries to detect if user is running in utf-8 locale by a heuristic
similar that that used in the autoconf test
b) allows setting the unicode locale at runtime (changes the define to
a variable)

The autoconf script is not modified, removing the define should be
trivial for anybody who knows autoconf, though.

Thanks

Michal

#532494#57
Date:
2010-02-06 14:23:16 UTC
From:
To:
Michal Suchanek <hramrach@centrum.cz> wrote:

setlocale(LC_CTYPE, "") only sets the current LC_CTYPE to the value of
the user environment.

If e.g. the user has LC_ALL=C the program will fail even if there is a
utf-8 locale it could use installed on the computer.


The attached patch first tries to set the locale found in the autoconf
script.

If that fails, it popens 'locale -a' and searches for a working utf-8
locale to use.

Bye
Patric

#532494#62
Date:
2010-02-06 15:20:36 UTC
From:
To:
Yes, and it's expected that users that want utf-8 output do have utf-8
locale set so it just works in most cases.

It also works if you specify the locale as argument to -L.

So to use vilistextum in some odd build environment which runs in "C"
locale but wants utf-8 output just specify an utf-8 locale as argument
(and possibly build-depend on locales-all).

This is something I wanted to avoid because it requires parsing
setlocale -a in the program and does not guarantee that a working
locale is found.

Still it should work in the common case either way.

Thanks

Michal

#532494#67
Date:
2010-02-07 10:38:21 UTC
From:
To:
Michal Suchanek <hramrach@centrum.cz> wrote:

But the utf-8 locale isn't only used for utf-8 output. As utf-8 is
used as internal standard encoding it is also used when e.g.
converting a latin1 or a shift_jis html file.

Requiring that the user has set an utf-8 locale for those cases is
unnecessary.

I don't think the problem lies in building. For building you need an
utf-8 locale otherwise the tests will fail and build-depend on
locales-all ensures that.

Moreover, IMO changing the behavior of 2.6.9 to differ from the ones
in OpenPKG, Gentoo and the official distribution isn't something a
package maintainer should do, if there is no serious need for it.

But there is not much more parsing going on than with your patch. My
patch even uses the two method of yours that check if the locale is a
utf-8 locale.

If your patch finds a utf-8 locale, mine would too. But mine would
also in cases where the OS has a utf-8 locale but the user doesn't
want to use it.

I'm not sure that the not so common cases you're thinking of are that
rare. For example, my OS is completely utf-8 capable, but I prefer my
shell to be in latin1.

Nevertheless, IMO the finding-the-locale-at-runtime approach has a
bigger chance of just working in more cases with less user
intervention.

Bye
Patric

#532494#72
Date:
2010-02-07 13:59:23 UTC
From:
To:
2010/2/7 Patric Mueller <bhaak@gmx.net>:

Yes, you are right that the utf-8 locale is used in some odd way
during the conversion.

That's certainly unusual, especially given that latin-1 should work in
utf-8 without additional conversion.

Yes, that's what I would like to see happening. This is less of a
problem in Gentoo because every user builds their own packages but in
Debian the detection of utf-8 locale has to be done at runtime.

Thanks

Michal

#532494#77
Date:
2010-02-11 21:58:37 UTC
From:
To:
Michal Suchanek <hramrach@centrum.cz> wrote:

That's not correct, the latin1 characters 127-255 are not legal utf-8
characters. You're probably thinking of 7-bit ASCII or that all latin1
characters are equal to the first 255 unicode codepoints?

The problem vilistextum faced was how to support multiple character
sets at a time when locale support itself was rather flaky in C
compilers.

I chose to require UTF-8 and then use the wide character functions for
handling all strings, so I didn't have to program for specific
character sets.

But still I had to also require libiconv as iconv in glibc wasn't
developed enough at that time and the iconv implementation in Solaris
was completely brain-dead AFAIK.

Bye
Patric

#532494#82
Date:
2010-02-12 09:02:44 UTC
From:
To:
2010/2/11 Patric Mueller <bhaak@gmx.net>:

For some reason they are correct enough that cat somefile.txt would show them.
However, less somefile.txt would mark the characters with 8th bit set
as invalid as they usually don't compose into a valid utf-8 sequence.

So you can view them but it's probably not correct behaviour that it
is possible.

Thanks

Michal