#993258 /usr/bin/cut: -d only accepts a single byte, must accept character

Package:
coreutils
Source:
coreutils
Description:
GNU core utilities
Submitter:
наб
Date:
2026-05-25 17:41:02 UTC
Severity:
normal
Tags:
#993258#5
Date:
2021-08-29 14:00:48 UTC
From:
To:
Dear Maintainer,

POSIX.1-2016 says:
-- >8 --
The following options shall be supported:

[...]

#993258#10
Date:
2025-11-14 07:27:59 UTC
From:
To:
Hi,

I find it sad that such a bug in one of the most important packages
doesn't even get a reply from the maintainer in more than four years.

That being said, it works on OpenSUSE:

ef9046429ed4:/ # echo яйцояЙЦО | cut -f 2 -d я
йцо
ef9046429ed4:/ # cat /etc/os-release
NAME="openSUSE Tumbleweed"
# VERSION="20251112"
ID="opensuse-tumbleweed"

OpenSUSE Tumbleweed currently has coreutils 9.9, so this might
theoretically be something that got very recently implemented, but I
doubt that. I guess it's some configure option. I recommend taking a
look at OpenSUSE's compile time configuration.

On the other hand, this change has the potential of breaking existing
shell scipts. So it has to be carefully handled.

Greetings
Marc

#993258#13
Date:
2025-11-14 07:27:59 UTC
From:
To:
Hi,

I find it sad that such a bug in one of the most important packages
doesn't even get a reply from the maintainer in more than four years.

That being said, it works on OpenSUSE:

ef9046429ed4:/ # echo яйцояЙЦО | cut -f 2 -d я
йцо
ef9046429ed4:/ # cat /etc/os-release
NAME="openSUSE Tumbleweed"
# VERSION="20251112"
ID="opensuse-tumbleweed"

OpenSUSE Tumbleweed currently has coreutils 9.9, so this might
theoretically be something that got very recently implemented, but I
doubt that. I guess it's some configure option. I recommend taking a
look at OpenSUSE's compile time configuration.

On the other hand, this change has the potential of breaking existing
shell scipts. So it has to be carefully handled.

Greetings
Marc

#993258#18
Date:
2026-02-04 15:47:03 UTC
From:
To:
This still occurs. Contrary to bug 992667 (with "-c"), one at least
gets a failure, so that the issue should be detectable in scripts.
"We're working on it. Something is coming soon.", but then this
remained silent.

[...]
[...]

In the upstream bug, Eric Blake said: "Several distros have add-on
patches that add wide char support, but to date, no one has yet
submitted a patch upstream that is both easy to maintain (doesn't
needlessly duplicate big blocks of code over char vs. wchar_t) and
which doesn't penalize speed on single-byte locales."

I could test with coreutils 9.9 in Termux (Android), and I get
a failure.

#993258#27
Date:
2026-02-04 15:47:03 UTC
From:
To:
This still occurs. Contrary to bug 992667 (with "-c"), one at least
gets a failure, so that the issue should be detectable in scripts.
"We're working on it. Something is coming soon.", but then this
remained silent.

[...]
[...]

In the upstream bug, Eric Blake said: "Several distros have add-on
patches that add wide char support, but to date, no one has yet
submitted a patch upstream that is both easy to maintain (doesn't
needlessly duplicate big blocks of code over char vs. wchar_t) and
which doesn't penalize speed on single-byte locales."

I could test with coreutils 9.9 in Termux (Android), and I get
a failure.

#993258#32
Date:
2026-02-04 21:04:03 UTC
From:
To:
FTR, in voreutils cut (0BSD:
 <http://ro.ws.co.ls/cut.1>,
 <https://git.sr.ht/~nabijaczleweli/voreutils/tree/trunk/item/cmd/cut.cpp>),
this is implemented with the -d argument being a byte span ("field_sep"),
so delimiter search reduces to memmem()/memchr() ("l.find(*field_sep)"),
which means -d: -dя -d$'\377' -dupa are all equivalent;
this seemed like an obvious generalisation to me,
so cut(1), STANDARDS, just notes that

$ echo QWEaQWEabQWE | cut -d'ab' -f2
QWE
$ echo QWEaQWEabQWE | /bin/cut -d'ab' -f2
/bin/cut: the delimiter must be a single character
Try '/bin/cut --help' for more information.

I believe you get the same result as the first line on the illumos gate
(I tested this on tribblix, if memory serves).
Parsing the input as characters only happens in -nb and -c modes,
and only for mbrlen(), which is the minimum required.

So duplication is not necessary. Of course, one can construe of an
encoding where you could encode я into bytes two different ways,
and you'd want cut -dя to match both. Whether that is real,
whether you consider that to be real, and whether that would be
a useful behaviour vs byte span matching will inform whether
that implementation model is viable for coreutils.

Best,