Yesterday I noticed, that the UTF-8 encoding doesn't seem to be correctly supported by the current locales package. I have problems using the lower and upper case conversion. Here are two different ways to exploit this behaviour. In both cases I used an "xterm -u8 -fn -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1" with LC_ALL=de_DE.UTF-8 to test the programs. In this email I display the umlaut characters in latin1, I will append the typescript with the real utf-8 encoding of the characters. - Programs like tr (textutils 2.0-12): $ tr [:lower:] [:upper:] oauöäü # the input OAUöäü # the output The ASCII alphabetic characters are correctly transformed, the utf-8 encoding umlauts are not. - The bash (2.05a-9): $ for i in a A ä Ä; do case $i in [[:lower:]]) echo "$i is l"; esac; done a is lc # the output The ä umlaut should also be output. My /etc/locale.gen has the following contents: # /etc/locale.gen de_DE ISO-8859-1 de_DE.UTF-8 UTF-8 de_DE@euro ISO-8859-15 Here is the locale setting while doing the tests: ~$ locale LANG=de_DE LC_CTYPE="de_DE.UTF-8" LC_NUMERIC="de_DE.UTF-8" LC_TIME="de_DE.UTF-8" LC_COLLATE="de_DE.UTF-8" LC_MONETARY="de_DE.UTF-8" LC_MESSAGES="de_DE.UTF-8" LC_PAPER="de_DE.UTF-8" LC_NAME="de_DE.UTF-8" LC_ADDRESS="de_DE.UTF-8" LC_TELEPHONE="de_DE.UTF-8" LC_MEASUREMENT="de_DE.UTF-8" LC_IDENTIFICATION="de_DE.UTF-8" LC_ALL=de_DE.UTF-8 Here is the typescript file, as binary attachment to allow correct transmission of utf-8.
It should be fixed in sid glibc 2.3.1, please check. Regards, -- gotom
GOTO Masanori <gotom@debian.or.jp> writes:
[Problems with [:lower:] and [:upper:] in UTF-8 locale]
I have now installed the latest versions of theses programs:
ii bash 2.05b-3 The GNU Bourne Again SHell
ii coreutils 4.5.7-1 The GNU core utilities
ii grep 2.5.1-2 GNU grep, egrep and fgrep
ii locales 2.3.1-14 GNU C Library: National Language (locale) da
ii libc6 2.3.1-14 GNU C Library: Shared libraries and Timezone
The following statements work as expected:
$ grep [[:lower:]]
$ grep [[:upper:]]
$ case ... in [[:lower:]]) ... esac # bash
$ case ... in [[:upper:]]) ... esac # bash
The following don't work with non-ASCII characters when LC_CTYPE is
set to de_DE.UTF8
$ tr [:lower:] [:upper:]
Using "tr [:alpha:] '-'" I found out that non-ASCII letters (valid
letters in the de_DE locale) are not even recognized. In the
de_DE.ISO-8859-1 locale both statements work correctly.
I don't know if this is related to this single program or can be
caused by problems in libc6 oder locales data. Please tell me if you
think that I should report to coreutils instead.
So half the bug report is resolved,
Torsten
At Thu, 27 Feb 2003 20:01:25 +0100, Torsten Hilbrich wrote: Coreutils uses old regex engine, so tr is not ready for UTF-8. I think it's TODO item for coreutils/textutils. I reassign this bug to coreutils.
I've just verified, and with version 5.2.1-2 of coreutils, I can still reproduce the bug: using LANG=es_AR.UTF8 $ echo áéí | tr áéí ÁÉÍ ÁÉÍ $ echo áéí | tr [:lower:] [:upper:] áéí $ echo aeiáéí | tr [:lower:] [:upper:] AEIáéí $ echo áéí | grep [[:lower:]] áéí Please try and fix it.
Yes, coreutils does not claim to handle utf-8. Mike Stone
Hola Michael Stone! Will it ever handle it? I guess you must know that UTF8 seems to be "the encoding of the future" :), and thus it's a good idea to handle it. The bug had no activity since 2003, and I'm re-checking old bugs to see if they are still present, so that they don't go forgotten.
Margarita Manterola <debian@marga.com.ar> wrote: FWIW, converting some of the textutils to deal with utf-8 is most definitely on the upstream to-do list, but I don't know when it'll be done.
I'm not sure if it's fixed upstream but RedHat coreutils package (5.2.1-48) is definitely utf-8 aware (tested on fold). So there might be a sort of patch that could be applied to debian package. Thanks.
Last time I looked, redhat hacked kinda utf-8 support onto the package. It wasn't complete, it was just enough to pass the tests. I'd rather not do it than do it wrong. Mike Stone
forcemerge 139861 388689 431231 tags 139861 + upstream confirmed wontfix found 139861 6.10~20071127-1 thanks Hi, I'm merging these bugs (all about tr not supporting UTF-8), that still affects the current coreutils in experimental. "wontfix" indicates that this is not going to be fixed by a debian-specific patch, but that the problem should be fixed upstream first.
tags 139861 - wontfix quit Lucas Nussbaum wrote: That isn't what wontfix usually means, is it? Thanks; I do agree that it seems best for anyone working on this to just communicate directly with upstream.
More examples: $ echo -e "日\n本\nで\nは" | sort -u | wc -l 4 $ echo -e "日\n本\nで\nは" | sort | wc -l 3 Something is quite wrong (eg. this is definitely *incorrect behaviour*, rather than merely a difference of opinion over implementation).
here is yet another example of tr working wrong with cyrillic chars: 0000000: d0be d0bb d0be d0bb d0be 0a ........... 0000000: d09e d09b d09e d09b d09e 0a ........... 0000000: b09e b09b b09e b09b b09e 0a ........... first is cyrillic text in lowercase, second in uppercase, and the last is what tr produced for given range substitution. as we can see, the range was mistakenly moved from 0xd0XX (where cyrillic chars reside in unicode) to 0xb0XX (don't know, what's that), while each second byte is correct
dimas wrote: Yes. Please read through the bug log that you are adding information to. It is well known that the coreutils does not support UTF-8 characters. Please start at the top and read through to the bottom. https://bugs.debian.org/139861 When you specify 'а-я' 'А-Я' you *think* you are specifying a range from 'а' to 'я' but since the utilities are not multibyte aware what you are actually specifying is: printf аяАЯ | od -c 0000000 320 260 321 217 320 220 320 257 Here is the mapping. а \320\260 я \321\217 А \320\220 Я \320\257 Therefore what you are *actually* telling tr is: tr '\320\260-\321\217' '\320\220-\320\257' It is a known deficiency in coreutils that the utilities are not multibyte aware. The following can be found in the upstream source package TODO file. Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be multibyte aware. The problem is that I want to avoid duplicating significant blocks of logic, yet I also want to incur only minimal (preferably `no') cost when operating in single-byte mode. Some vendors have hacked in patches to make the utilities multibyte aware but none of those patches have been considered clean enough to incorporate into the upstream source yet. Debian's maintainer has stated that he does not want to diverge from upstream this radically especially since there have been bugs reported with the multibyte hacks. The patches are very messy and incomplete. The best course of action would be to get this resolved upstream with the functionally properly integrated. Until then this remains a known deficiency. Bob
Laba diena, Noriu Jus informuoti apie šių metų pasikeitimą dėl atnaujintos visos Lietuvos įmonių bazės 2018 metų sausio vidurio. Visi juridiniai asmenys pateikti bazėje yra veikiantys, realiai vykdantys veiklą, turintys įdarbintų darbuotojų. Duomenys pagal Sodrą, Registrų centrą. Bazėje nurodoma ir apyvarta, darbuotojų atlyginimai, darbuotojų skaičius, transporto skaičius ir daug kitų duomenų, kuriuos matysite pavyzdyje. Duomenis galima filtruoti pagal veiklas, miestus ir kitus duomenis. Šią bazę verta turėti visoms įmonėms. Pateiksiu priežastis: 1) Kontaktai pateikti bazėje direktorių ir kitų atsakingų asmenų, didelė tikimybė Jums surasti naujų klientų, partnerių, tiekėjų, kai tiesiogiai bendrausite su direktoriais, komercijos vadovais. 2) Konkurentų analizavimas, tiekėjų atsirinkimas pagal Jums reikalingus kriterijus, galite atsifiltruoti pagal įmonės dydį, bazėje nurodoma kiek įmonės skolingos Sodrai. 3) Lengva, greita ir patogu dirbti su šia baze, elektroninius pašto adresus galite importuoti į elektroninių laiškų siuntimo programas ar sistemas iš kurių siunčiate elektroninius laiškus. Taip pat galite importuoti mobiliųjų telefonų numerius į SMS siuntimo programas. Išsirinkite iš "Veiklų sąrašo" veiklas kurių Jums reikia. ( Sąrašas prisegtas laiške excel faile ) Parašykite, kurias veiklas išsirinkote ir atsiųsime pavyzdį ir pasiūlymą su sąlygomis įmonių bazei įsigyti Pagarbiai, Tadas Giedraitis Tel. nr. +37067881041
Is this the same bug, or a separate one should be opened? % echo '¡Hola!' | tr -d '¿' �Hola!
Oh wow is this an old bug. I thought, at first, it’s just character classes… $ echo mäÄH | tr '[:upper:]' '[:lower:]' mäÄh … but apparently, yes, multibyte support is broken: $ echo 'mäæn' | tr ä Ȁ mȀȦn