Fabre

#139861 tr: no UTF-8 support #139861

Package:: coreutils

Source:: coreutils

Description:: GNU core utilities

Submitter:: Torsten Hilbrich

Date:: 2025-07-13 13:01:02 UTC

Severity:: normal

Tags:

#139861#5

Date:: 2002-03-25 17:11:00 UTC

From:

To:

Yesterday I noticed, that the UTF-8 encoding doesn't seem to be
correctly supported by the current locales package.  I have problems
using the lower and upper case conversion.

Here are two different ways to exploit this behaviour.  In both cases
I used an
"xterm -u8 -fn -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1"
with LC_ALL=de_DE.UTF-8 to test the programs.  In this email I display
the umlaut characters in latin1, I will append the typescript with the
real utf-8 encoding of the characters.

- Programs like tr (textutils 2.0-12):

$ tr [:lower:] [:upper:]
oauöäü                           # the input
OAUöäü                           # the output

The ASCII alphabetic characters are correctly transformed, the utf-8
encoding umlauts are not.

- The bash (2.05a-9):
$ for i in a A ä Ä; do case $i in [[:lower:]]) echo "$i is l"; esac; done
a is lc                          # the output

The ä umlaut should also be output.

My /etc/locale.gen has the following contents:

# /etc/locale.gen
de_DE ISO-8859-1
de_DE.UTF-8 UTF-8
de_DE@euro ISO-8859-15

Here is the locale setting while doing the tests:

~$ locale
LANG=de_DE
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=de_DE.UTF-8

Here is the typescript file, as binary attachment to allow correct
transmission of utf-8.

#139861#10

Date:: 2003-02-26 11:29:55 UTC

From:

To:

It should be fixed in sid glibc 2.3.1, please check.

Regards,
-- gotom

#139861#15

Date:: 2003-02-27 19:01:25 UTC

From:

To:

GOTO Masanori <gotom@debian.or.jp> writes:

[Problems with [:lower:] and [:upper:] in UTF-8 locale]

I have now installed the latest versions of theses programs:

ii  bash           2.05b-3        The GNU Bourne Again SHell
ii  coreutils      4.5.7-1        The GNU core utilities
ii  grep           2.5.1-2        GNU grep, egrep and fgrep
ii  locales        2.3.1-14       GNU C Library: National Language (locale) da
ii  libc6          2.3.1-14       GNU C Library: Shared libraries and Timezone

The following statements work as expected:

$ grep [[:lower:]]
$ grep [[:upper:]]
$ case ... in [[:lower:]]) ... esac  # bash
$ case ... in [[:upper:]]) ... esac  # bash

The following don't work with non-ASCII characters when LC_CTYPE is
set to de_DE.UTF8

$ tr [:lower:] [:upper:]

Using "tr [:alpha:] '-'" I found out that non-ASCII letters (valid
letters in the de_DE locale) are not even recognized.  In the
de_DE.ISO-8859-1 locale both statements work correctly.

I don't know if this is related to this single program or can be
caused by problems in libc6 oder locales data.  Please tell me if you
think that I should report to coreutils instead.

So half the bug report is resolved,

        Torsten

#139861#20

Date:: 2003-02-28 01:15:02 UTC

From:

To:

At Thu, 27 Feb 2003 20:01:25 +0100,
Torsten Hilbrich wrote:

Coreutils uses old regex engine, so tr is not ready for UTF-8.
I think it's TODO item for coreutils/textutils.
I reassign this bug to coreutils.

#139861#27

Date:: 2005-01-27 19:40:40 UTC

From:

To:

I've just verified, and with version  5.2.1-2 of coreutils, I can still
reproduce the bug:

using LANG=es_AR.UTF8

$ echo áéí | tr áéí ÁÉÍ
ÁÉÍ

$ echo áéí | tr [:lower:] [:upper:]
áéí

$ echo aeiáéí | tr [:lower:] [:upper:]
AEIáéí

$ echo áéí | grep [[:lower:]]
áéí

Please try and fix it.

#139861#32

Date:: 2005-01-27 20:08:21 UTC

From:

To:

Yes, coreutils does not claim to handle utf-8.

Mike Stone

#139861#37

Date:: 2005-01-27 20:35:13 UTC

From:

To:

Hola Michael Stone!

Will it ever handle it?  I guess you must know that UTF8 seems to be "the
encoding of the future" :), and thus it's a good idea to handle it.

The bug had no activity since 2003, and I'm re-checking old bugs to see if
they are still present, so that they don't go forgotten.

#139861#42

Date:: 2005-01-27 21:02:51 UTC

From:

To:

Margarita Manterola <debian@marga.com.ar> wrote:

FWIW, converting some of the textutils to deal with utf-8 is
most definitely on the upstream to-do list, but I don't know
when it'll be done.

#139861#49

Date:: 2006-06-03 07:16:41 UTC

From:

To:

I'm not sure if it's fixed upstream but RedHat coreutils package (5.2.1-48)
is definitely utf-8 aware (tested on fold).

So there might be a sort of patch that could be applied to debian package.

Thanks.

#139861#54

Date:: 2006-06-03 12:05:15 UTC

From:

To:

Last time I looked, redhat hacked kinda utf-8 support onto the package.
It wasn't complete, it was just enough to pass the tests. I'd rather not
do it than do it wrong.

Mike Stone

#139861#59

Date:: 2008-01-22 19:56:51 UTC

From:

To:

forcemerge 139861 388689 431231
tags 139861 + upstream confirmed wontfix
found 139861 6.10~20071127-1
thanks

Hi,

I'm merging these bugs (all about tr not supporting UTF-8), that still
affects the current coreutils in experimental. "wontfix" indicates that
this is not going to be fixed by a debian-specific patch, but that the
problem should be fixed upstream first.

#139861#70

Date:: 2011-02-11 10:04:13 UTC

From:

To:

tags 139861 - wontfix
quit

Lucas Nussbaum wrote:

That isn't what wontfix usually means, is it?

Thanks; I do agree that it seems best for anyone working on this to
just communicate directly with upstream.

#139861#83

Date:: 2012-04-13 10:37:04 UTC

From:

To:

More examples:

$ echo -e "日\n本\nで\nは" | sort -u | wc -l
4
$ echo -e "日\n本\nで\nは" | sort | wc -l
3

Something is quite wrong (eg. this is definitely *incorrect behaviour*,
rather than merely a difference of opinion over implementation).

#139861#88

Date:: 2014-09-29 17:30:27 UTC

From:

To:

here is yet another example of tr working wrong with cyrillic chars:
0000000: d0be d0bb d0be d0bb d0be 0a              ...........
0000000: d09e d09b d09e d09b d09e 0a              ...........
0000000: b09e b09b b09e b09b b09e 0a              ...........
first is cyrillic text in lowercase, second in uppercase, and the last is what
tr produced for given range substitution. as we can see, the range was
mistakenly moved from 0xd0XX (where cyrillic chars reside in unicode) to 0xb0XX
(don't know, what's that), while each second byte is correct

#139861#91

Date:: 2014-09-29 18:58:35 UTC

From:

To:

dimas wrote:

Yes.  Please read through the bug log that you are adding information
to.  It is well known that the coreutils does not support UTF-8
characters.  Please start at the top and read through to the bottom.

https://bugs.debian.org/139861

When you specify 'а-я' 'А-Я' you *think* you are specifying a range
from 'а' to 'я' but since the utilities are not multibyte aware what
you are actually specifying is:

  printf аяАЯ | od -c
  0000000 320 260 321 217 320 220 320 257

Here is the mapping.

  а \320\260
  я \321\217
  А \320\220
  Я \320\257

Therefore what you are *actually* telling tr is:

  tr '\320\260-\321\217' '\320\220-\320\257'

It is a known deficiency in coreutils that the utilities are not
multibyte aware.  The following can be found in the upstream source
package TODO file.

  Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
    multibyte aware.  The problem is that I want to avoid duplicating
    significant blocks of logic, yet I also want to incur only minimal
    (preferably `no') cost when operating in single-byte mode.

Some vendors have hacked in patches to make the utilities multibyte
aware but none of those patches have been considered clean enough to
incorporate into the upstream source yet.  Debian's maintainer has
stated that he does not want to diverge from upstream this radically
especially since there have been bugs reported with the multibyte
hacks.  The patches are very messy and incomplete.  The best course of
action would be to get this resolved upstream with the functionally
properly integrated.  Until then this remains a known deficiency.

Bob

#139861#100

Date:: 2018-02-15 07:15:29 UTC

From:

To:

Laba diena,

Noriu Jus informuoti apie šių metų pasikeitimą dėl atnaujintos visos Lietuvos įmonių bazės 2018 metų sausio vidurio.
Visi juridiniai asmenys pateikti bazėje yra veikiantys, realiai vykdantys veiklą, turintys įdarbintų darbuotojų. Duomenys pagal Sodrą, Registrų centrą.

Bazėje nurodoma ir apyvarta, darbuotojų atlyginimai, darbuotojų skaičius, transporto skaičius ir daug kitų duomenų, kuriuos matysite pavyzdyje.

Duomenis galima filtruoti pagal veiklas, miestus ir kitus duomenis.

Šią bazę verta turėti visoms įmonėms. Pateiksiu priežastis:

1) Kontaktai pateikti bazėje direktorių ir kitų atsakingų asmenų, didelė tikimybė Jums surasti naujų klientų, partnerių, tiekėjų, kai tiesiogiai bendrausite su direktoriais, komercijos vadovais.

2) Konkurentų analizavimas, tiekėjų atsirinkimas pagal Jums reikalingus kriterijus, galite atsifiltruoti pagal įmonės dydį, bazėje nurodoma kiek įmonės skolingos Sodrai.

3) Lengva, greita ir patogu dirbti su šia baze, elektroninius pašto adresus galite importuoti į elektroninių laiškų siuntimo programas ar sistemas iš kurių siunčiate elektroninius laiškus.
Taip pat galite importuoti mobiliųjų telefonų numerius į SMS siuntimo programas.

Išsirinkite iš "Veiklų sąrašo" veiklas kurių Jums reikia.
( Sąrašas prisegtas laiške excel faile )

Parašykite, kurias veiklas išsirinkote
ir atsiųsime pavyzdį ir pasiūlymą su sąlygomis įmonių bazei įsigyti

Pagarbiai,
Tadas Giedraitis
Tel. nr. +37067881041

#139861#105

Date:: 2022-06-14 14:50:20 UTC

From:

To:

Is this the same bug, or a separate one should be opened?

% echo '¡Hola!' | tr -d '¿'
�Hola!

#139861#110

Date:: 2023-03-13 14:39:57 UTC

From:

To:

Oh wow is this an old bug.

I thought, at first, it’s just character classes…

$ echo mäÄH | tr '[:upper:]' '[:lower:]'
mäÄh

… but apparently, yes, multibyte support is broken:

$ echo 'mäæn' | tr ä Ȁ
mȀȦn

#139861 tr: no UTF-8 support #139861

Just Reply to ...

Reply to submitter ...

Send control command (Silently)

Set Architecture Tags (Silently)