Dear Maintainer, POSIX.1-2016 says: -- >8 -- The following options shall be supported: [...]
Hi, I find it sad that such a bug in one of the most important packages doesn't even get a reply from the maintainer in more than four years. That being said, it works on OpenSUSE: ef9046429ed4:/ # echo яйцояЙЦО | cut -f 2 -d я йцо ef9046429ed4:/ # cat /etc/os-release NAME="openSUSE Tumbleweed" # VERSION="20251112" ID="opensuse-tumbleweed" OpenSUSE Tumbleweed currently has coreutils 9.9, so this might theoretically be something that got very recently implemented, but I doubt that. I guess it's some configure option. I recommend taking a look at OpenSUSE's compile time configuration. On the other hand, this change has the potential of breaking existing shell scipts. So it has to be carefully handled. Greetings Marc
Hi, I find it sad that such a bug in one of the most important packages doesn't even get a reply from the maintainer in more than four years. That being said, it works on OpenSUSE: ef9046429ed4:/ # echo яйцояЙЦО | cut -f 2 -d я йцо ef9046429ed4:/ # cat /etc/os-release NAME="openSUSE Tumbleweed" # VERSION="20251112" ID="opensuse-tumbleweed" OpenSUSE Tumbleweed currently has coreutils 9.9, so this might theoretically be something that got very recently implemented, but I doubt that. I guess it's some configure option. I recommend taking a look at OpenSUSE's compile time configuration. On the other hand, this change has the potential of breaking existing shell scipts. So it has to be carefully handled. Greetings Marc
This still occurs. Contrary to bug 992667 (with "-c"), one at least gets a failure, so that the issue should be detectable in scripts. "We're working on it. Something is coming soon.", but then this remained silent. [...] [...] In the upstream bug, Eric Blake said: "Several distros have add-on patches that add wide char support, but to date, no one has yet submitted a patch upstream that is both easy to maintain (doesn't needlessly duplicate big blocks of code over char vs. wchar_t) and which doesn't penalize speed on single-byte locales." I could test with coreutils 9.9 in Termux (Android), and I get a failure.
This still occurs. Contrary to bug 992667 (with "-c"), one at least gets a failure, so that the issue should be detectable in scripts. "We're working on it. Something is coming soon.", but then this remained silent. [...] [...] In the upstream bug, Eric Blake said: "Several distros have add-on patches that add wide char support, but to date, no one has yet submitted a patch upstream that is both easy to maintain (doesn't needlessly duplicate big blocks of code over char vs. wchar_t) and which doesn't penalize speed on single-byte locales." I could test with coreutils 9.9 in Termux (Android), and I get a failure.
FTR, in voreutils cut (0BSD:
<http://ro.ws.co.ls/cut.1>,
<https://git.sr.ht/~nabijaczleweli/voreutils/tree/trunk/item/cmd/cut.cpp>),
this is implemented with the -d argument being a byte span ("field_sep"),
so delimiter search reduces to memmem()/memchr() ("l.find(*field_sep)"),
which means -d: -dя -d$'\377' -dupa are all equivalent;
this seemed like an obvious generalisation to me,
so cut(1), STANDARDS, just notes that
$ echo QWEaQWEabQWE | cut -d'ab' -f2
QWE
$ echo QWEaQWEabQWE | /bin/cut -d'ab' -f2
/bin/cut: the delimiter must be a single character
Try '/bin/cut --help' for more information.
I believe you get the same result as the first line on the illumos gate
(I tested this on tribblix, if memory serves).
Parsing the input as characters only happens in -nb and -c modes,
and only for mbrlen(), which is the minimum required.
So duplication is not necessary. Of course, one can construe of an
encoding where you could encode я into bytes two different ways,
and you'd want cut -dя to match both. Whether that is real,
whether you consider that to be real, and whether that would be
a useful behaviour vs byte span matching will inform whether
that implementation model is viable for coreutils.
Best,