- Package:
- vilistextum
- Source:
- vilistextum
- Description:
- a HTML to text converter
- Submitter:
- Christian Ohm
- Date:
- 2010-12-03 23:48:04 UTC
- Severity:
- wishlist
Hello, It would be nice if you could enable the multibyte support of vilistextum, so it supports UTF8. Best regards, Christian Ohm
tags 532494 + pending done 2009/6/9 Christian Ohm <chr.ohm@gmx.net>: Thanks for taking the time to file this request and helping improve Debian. I'll upload a fixed version to mentors.debian.net within the next hours.
This bug is marked pending upload for about half an year. Debian uploads hopefully don't take that long yet. Also http://mentors.debian.org/ returns something which Firefox does not recognize as HTML so in absence of browseability of that site an exact URL of the supposed fixed package would be welcome. Thanks Michal
Hi,
I am sorry I didn't reply to this before. I had some difficulties with
it and it ended slipping out of radar.
Basically what I did to get Vilistextum to compile with multibyte is
adding a build-dependency on "locales-all" and adding the following
line to debian/rules:
DEB_CONFIGURE_EXTRA_FLAGS = --enable-multibyte
--with-unicode-locale=en_US.utf8
However, the resulting package wouldn't work on my system (running
vilistextum resulted in error message "setlocale failed with:
en_US.utf8") unless I installed "locales-all" on it. I suppose adding
locales-all as a dependency of the binary package is not the right
solution, but I don't really know what's going on here or what I'm
doing wrong.
If you can provide any pointers on how to fix this, or even a patch,
this would be great. (By the way, if someone out there is really
interested in this package, I'd be okay with handling over
maintainance or co-maintainance of it).
Have a nice day,
Hello, I've compiled it with only --enable-multibyte, and that compiles and runs. Needs -u to output UTF8 though, maybe that should be default. Best regards, Christian Ohm
2010/1/30 Christian Ohm <chr.ohm@gmx.net>: Yeah, it does when you compile it locally, but not when you do so in a chroot :/.
Christian Ohm <chr.ohm@gmx.net> wrote: I downloaded and run the debian-testing-i386-netinst.iso and then dpkg-source -x vilistext_2.6.9-1.dsc cd vilistextum-2.6.9 vim debian/rules and added DEB_CONFIGURE_EXTRA_FLAGS = --enable-multibyte (and in a second test also with --with-unicode-locale=en_US.utf8) dpkg-buildpackage -rfakeroot and installed the deb-file as root. Both versions just worked. But the building of package worked? And compiling it manually with --with-unicode-locale=en_US.utf8 produced a working binary? That sounds strange. What does 'locale -a' and 'fakeroot locale -a' show? I got for both: C en_US.utf8 POSIX If only --enable-multibyte is specified, the configure script should select a utf8 locale by itself. That's because the output character set is defined by the input HTML and not the user environment. If there's not character set specified in the HTML file, vilistextum falls back to latin1 for HTML and UTF-8 for XHTML. Bye Patric
2010/2/4 Patric Mueller <bhaak@gmx.net>: I guess that ISO comes with the locales for English then. (By the way, if you want a build environment more like that of the buildds, you can use eg. pbuilder. If you decide to try it out, I recommend using the pbuilder-dist wrapper script, which is available in package ubuntu-dev-tools) Yeah. Note however that the package was build in a chroot with locales-all installed, but I tried it on my local system. C ca_ES.utf8 POSIX Inside the chroot, without installing locales-all: C POSIX Inside the chroot, after installing locales-all (like it has when it's building the package when locales-all is set as a build-dependency): C POSIX aa_DJ aa_DJ.iso88591 aa_DJ.utf8 [... almost enough lines to crash Iceweasel when copy-pasting more...] zu_ZA zu_ZA.iso88591 zu_ZA.utf8 Yep, it picks up a random one though, which is why I set it to English explicitly. Have a nice day,
If the random one is aa_DJ.utf8 then most people won't have that. Note also that the buildd has zu_ZA.utf8 while what I get is en_US.UTF-8 (or zu_ZA.UTF-8 should I build it). So you will probably want locales-all for build as there is no other way to ensure that there is at least one UTF-8 locale. It would be also reasonable to move the check for random (preferably the one currently in use) UTF-8 locale from build time to run time. However, this would need patching and I did not get around to looking at what vilistextum does and how ugly it would be to make that a run time check instead of a build time check. Thanks Michal
FWIW here is a patch that a) tries to detect if user is running in utf-8 locale by a heuristic similar that that used in the autoconf test b) allows setting the unicode locale at runtime (changes the define to a variable) The autoconf script is not modified, removing the define should be trivial for anybody who knows autoconf, though. Thanks Michal
Michal Suchanek <hramrach@centrum.cz> wrote: setlocale(LC_CTYPE, "") only sets the current LC_CTYPE to the value of the user environment. If e.g. the user has LC_ALL=C the program will fail even if there is a utf-8 locale it could use installed on the computer. The attached patch first tries to set the locale found in the autoconf script. If that fails, it popens 'locale -a' and searches for a working utf-8 locale to use. Bye Patric
Yes, and it's expected that users that want utf-8 output do have utf-8 locale set so it just works in most cases. It also works if you specify the locale as argument to -L. So to use vilistextum in some odd build environment which runs in "C" locale but wants utf-8 output just specify an utf-8 locale as argument (and possibly build-depend on locales-all). This is something I wanted to avoid because it requires parsing setlocale -a in the program and does not guarantee that a working locale is found. Still it should work in the common case either way. Thanks Michal
Michal Suchanek <hramrach@centrum.cz> wrote: But the utf-8 locale isn't only used for utf-8 output. As utf-8 is used as internal standard encoding it is also used when e.g. converting a latin1 or a shift_jis html file. Requiring that the user has set an utf-8 locale for those cases is unnecessary. I don't think the problem lies in building. For building you need an utf-8 locale otherwise the tests will fail and build-depend on locales-all ensures that. Moreover, IMO changing the behavior of 2.6.9 to differ from the ones in OpenPKG, Gentoo and the official distribution isn't something a package maintainer should do, if there is no serious need for it. But there is not much more parsing going on than with your patch. My patch even uses the two method of yours that check if the locale is a utf-8 locale. If your patch finds a utf-8 locale, mine would too. But mine would also in cases where the OS has a utf-8 locale but the user doesn't want to use it. I'm not sure that the not so common cases you're thinking of are that rare. For example, my OS is completely utf-8 capable, but I prefer my shell to be in latin1. Nevertheless, IMO the finding-the-locale-at-runtime approach has a bigger chance of just working in more cases with less user intervention. Bye Patric
2010/2/7 Patric Mueller <bhaak@gmx.net>: Yes, you are right that the utf-8 locale is used in some odd way during the conversion. That's certainly unusual, especially given that latin-1 should work in utf-8 without additional conversion. Yes, that's what I would like to see happening. This is less of a problem in Gentoo because every user builds their own packages but in Debian the detection of utf-8 locale has to be done at runtime. Thanks Michal
Michal Suchanek <hramrach@centrum.cz> wrote: That's not correct, the latin1 characters 127-255 are not legal utf-8 characters. You're probably thinking of 7-bit ASCII or that all latin1 characters are equal to the first 255 unicode codepoints? The problem vilistextum faced was how to support multiple character sets at a time when locale support itself was rather flaky in C compilers. I chose to require UTF-8 and then use the wide character functions for handling all strings, so I didn't have to program for specific character sets. But still I had to also require libiconv as iconv in glibc wasn't developed enough at that time and the iconv implementation in Solaris was completely brain-dead AFAIK. Bye Patric
2010/2/11 Patric Mueller <bhaak@gmx.net>: For some reason they are correct enough that cat somefile.txt would show them. However, less somefile.txt would mark the characters with 8th bit set as invalid as they usually don't compose into a valid utf-8 sequence. So you can view them but it's probably not correct behaviour that it is possible. Thanks Michal