#1026231 debian-policy: document droppage of support for legacy locales

#1026231#5
Date:
2022-12-16 18:21:37 UTC
From:
To:
Hi!
As of Bookworm, legacy locales are no longer officially supported.  In order
to not break testsuites, they're mostly working if you install locales-all,
and you may manually request their generation by editing /etc/locale.gen --
but functionality is expected to bit rot and/or be removed in the future.

Thus, what about spelling this in the Policy?:

* Software may assume they always run in an UTF-8 locale, and emit or
  require UTF-8 input/output without checking.
* The execution environment (usually init system or a container) must
  default to UTF-8 encoding unless explicitly configured otherwise.
* Legacy locales are no longer officially supported, and packages may
  drop support for them and/or exclude them from their testsuites.
* Packages may retain support for legacy locales, but related bug reports
  (unless security related) are considered to be of wishlist severity.
* Filesystems may be configured to reject file names that are not valid
  printable UTF-8 encoded Unicode.
* So-called BOM (U+FEFF) must not be added to plain-text output, and if
  present, editors/viewers customarily used for editing code should not
  hide its presence.
* Human-readable files outside of packages' private data must be encoded
  in UTF-8.  This applies especially to files in /usr/share/doc and /etc
  but applies to eg. executable scripts in /bin or /sbin as well.

Rationale: it takes non-trivial amount of code to support diverse encodings;
Unicode is a strict superset of all legacy charsets thus there's no loss of
functionality by switching to it exclusively.  In todays Unicode world, text
files of other encodings present a barrier to being read by the user.

While data received from outside the network may legitimately use legacy
locales, requiring all of stdin/stdout/stderr and filesystem data to use
UTF-8 would simplify code.  It's not like we pay more than lip service to
other encodings anymore...

While diversity in software is welcome, diversity in standards is not:
UTF-8 will not damage your pinky finger nor require Alt-F2 kill -9 to
exit; will not make your computer fail to boot or require a trip to the
data center; nor infect your K desktop with gnomeitis.  [Of course, there's
no plausible reason to use Postfix, ever!].  In other words, having multiple
phone vendors is essential but having multiple charging connectors is bad.

As for BOM, it is explicitly discouraged by the Unicode Consortium, and can
cause security vulnerabilities where scripts that pass human review act
different than it appears.  <FEFF>#!/bin/perl gets executed by bash, and
this is just one of examples.

As for inits/containers declaring LC_CTYPE=C.UTF-8, systemd has been doing
this for a while, in sysvinit land we debated whether that's still needed
when glibc started to consider unset locale to mean C.UTF-8 rather than C
-- but then, some language compilers do not use glibc.  debootstrap doesn't
configure a default locale, while not all higher-level tools do so,
rendering a system installed in non-standard but reasonable way to lack
the setting, to the surprise of the admin.


Meow!

#1026231#10
Date:
2022-12-16 20:30:16 UTC
From:
To:
Hi Adam,

How do you define a legacy locale ?
What do you mean by "officially supported" ?  By whom ?

Cheers,

#1026231#15
Date:
2022-12-19 19:08:09 UTC
From:
To:
For clarity, I think when you say "legacy locales" you mean locales
whose character encoding is either explicitly or implicitly something
other than UTF-8 ("legacy national encodings"), like en_US (implicitly
ISO-8859-1 according to /usr/share/i18n/SUPPORTED) and en_GB.ISO-8859-15
(explicitly ISO-8859-15 in its name). True?

Many of the non-UTF-8 encodings are single-byte encodings in the
ISO-8859 family, but if I understand correctly, your reasoning applies
equally to multi-byte east Asian encodings like BIG5, GB18030 and EUC-JP.
Also true?

Meanwhile, locales with a UTF-8 character encoding, like en_AG
(implicitly UTF-8 according to /usr/share/i18n/SUPPORTED) or en_US.UTF-8
(explicitly UTF-8), are the ones you are considering to be non-legacy.
Also true?

I think for Policy use, this would have to say something more precise,
like "locales with a non-UTF-8 character encoding". I wouldn't want to
get en_US speakers trying to argue that en_GB.UTF-8 is a legacy locale,
or en_GB speakers like me trying to argue that en_US.UTF-8 is a legacy
locale :-)

When you say "officially supported" here, do you refer to the extent
to which they are supported by the glibc maintainers, or some other
group? Or are you describing a change request that they *should not*
be officially supported by Debian - something that is not necessarily
true yet, but in this bug you are asking for it to become true?
UTF-8-only and ignores locales' character sets, which is arguably a bug
right now but would become a non-bug with your proposed policy.

This is a "may" so it can't possibly make a package gain bugs. It might
make packages have fewer bugs.

Is this already true? This seems like the sort of thing which should be
fixed in at least the major init systems and container managers before it
goes into Policy, in the interests of not making those init systems and
container managers retroactively buggy.

Is the C (aka POSIX) locale still a non-UTF-8 locale (if I remember
correctly its character encoding is officially 7-bit ASCII), or has it
been redefined to be UTF-8? Given the special status of the C locale in
defaults and standards, it might be necessary to say that it's the only
supported locale with a non-UTF-8 character encoding.
is this really a should/must in disguise: packages should/must not
assume that they can successfully read/write filenames that are not valid
printable UTF-8-encoded Unicode?

This seems like a change with a wider scope: not only is it excluding
filenames in Latin-1 or whatever, it's also excluding filenames with
non-printable characters (tabs, control characters etc.), or with
the UTF-8 representation of a noncharacter like U+FDEF. Perhaps that
should be a change orthogonal to de-supporting the non-UTF-8 locales?

It's not immediately obvious to me what "human-readable files" means here.
Text files? Text files in ASCII-compatible encodings? Files intended to be
read and written by standard text editors?

I assume the intention here is to make it a policy violation to ship
documentation, scripts, configuration files, etc. encoded in something
like ISO-8859-1 or EUC-JP?

Is this intended to make it a policy violation to ship documentation, etc.
encoded in UTF-16?

This seems to me like it should perhaps be out-of-scope here, and treated
as a separate change: UTF-8 is still UTF-8, whether it starts with U+FEFF
or not, and I think deprecating en_GB in favour of en_GB.UTF-8 (and so on)
is orthogonal to deprecating the use of a U+FEFF prefix on UTF-8 text.

I think "UTF-8 output" is probably a better scope for this than
"plain-text output": my understanding is that when emitting UTF-16, UCS-2
or UCS-4 it's conventional (perhaps even recommended?) to emit a BOM
first, because in those encodings of Unicode, either LE or BE byte order
is reasonable (unlike UTF-8, which is always MSB-first by design). Perhaps
you meant this to be implicit, because to a Unix developer, "plain text"
is implicitly something ASCII-compatible (which rules out every Unicode
encoding except UTF-8), and legacy national encodings cannot represent
U+FEFF (which rules those out), leaving UTF-8 as the only "plain text"
encoding where U+FEFF is even representable?

It seems to me that it shouldn't be a Policy violation for things
like text editors and character set converters to have the option to
emit UTF-8-with-U+FEFF-prefix, but maybe it should instead be a Policy
violation for that to be the default.

    smcv

#1026231#20
Date:
2022-12-19 21:44:12 UTC
From:
To:
Which raise the question: does the corresponding user group moved to UTF-8 ?
Judging from <https://en.wikipedia.org/wiki/Chinese_character_encoding>,
neither Chinese nor Japanese users have overwhelmingly moved to UTF-8,
so it would be problematic to stop supporting BIG5, GB18030 and EUC-JP.

Cheers,

#1026231#25
Date:
2022-12-21 14:23:09 UTC
From:
To:
We actually do have data about locale usage in Debian.
I've copied .report files from bugs-mirror, and
    grep -arm1 ^Locale: */*/*.report
shows that:
* most recent use of BIG5 is #925894 from March 2019
* there's no use of any GB locale (other than en_GB :p) past #609517 (2011)
* for EUC there's #1001207 (2021) #953616 #939588 #939494 #893625

Thus:
* Chinese encodings are totally dead for being used as a system locale
* Japanese are nearly so

That Wikipedia page presents stuff from 2008 as new developments, thus is
a wee bit outdated...


Meow!

#1026231#30
Date:
2022-12-21 17:15:11 UTC
From:
To:
Aye.

Aye.  Anything but UTF-8.

Right.

English (traditional) vs English (simplified) :p

My primary source is glibc, especially the debconf questions from "locales",
although bit-rot and/or outright droppage is widespread in other packages.

Exactly, I want to declare that a non-bug, thus saving developer time.

Aye.

Systemd does so since version 240, sysvinit relies on settings in /etc/
thus in the case of bare debootstrap the variables might be unset -- which
is mostly moot since glibc 2.35.  We briefly discussed an one-line patch
to ensure there's a fallback default, it's currently not applied (but can
be).  This would be relevant only for corner cases like an unconfigured
system running non-glibc non-musl binaries that rely on LC_*.

I'm less knowledgeable about containers, but they appear to work.  It might
be due to copying variables from the host or having template defaults...

Anyway, my aim is more to tell packages that they are allowed to misbehave
when the settings are missing than to hunt misuse scenarios.  But, if such
a scenario is found, with the current Policy there is no recourse, while
if this rule is added it would be a bug.

Hmm... if I recall correctly, old POSIX left the behaviour of characters
above 126 undefined, making C.UTF-8 _almost_ match the requirements (with
only exception being iswblank() IIRC), but current version specifies ASCII
(rather than C standard's "portable subset") with no additions to character
classes other than cntrl and punct allowed.

This is the locale all processes start with, until they call setlocale().
I'm still not decided whether it should be allowed as the system locale
(ie, when a process says it wants locale handling enabled).

Having it breaks non-ASCII in GUIs, some text output, causes misalignment,
etc.  Thus maybe we can relegate it to the "you can set it if you want,
but if it breaks, you get to handle both pieces" status?  Which probably
needs no explicit mention in the Policy.

AFAIK valid Unicode is the case for eg. remotely mounted SMBFS.  It also
used to be required for JFS, but because of (at the time) widespread
non-Unicode encodings it got changed to an unsightly on-disk double encoded
format.

I wonder how viable a gradual change would be.  For the filesystem behaviour
to change, it would be good to allow it in the Policy, without explicitly
requiring the matching change in packages.

Maybe; you have a point.  I do run my boxen with a kernel that disallows
non-printables (especially tabs/newlines/...), and generally the only
fails I see are testsuites.

Thus you can unsmuggle the word "printable", I reflexively added it as it's
something I care about but is indeed orthogonal to non-UTF8.

There's no clear threshold for "human-readable".  There's so many formats
that sometimes are meant for the user to read/edit and sometimes are not.
Eg. there's HTML you may edit and HTML that's the hellish output of a pile
of templates.  Some folks claim that XML is "human readable".  And shell
scripts produced by autoconf are no more meant for human consumption than
disassembly of a binary executable.

I've thus intentionally left the definition vague.

Exactly.  They can't be conveniently read by users; and even without the
Policy change that's a problem for 99.9% of users today.

For an Unix person, UTF-16 doesn't make a text file.  Users already can't
read that without special tools, so that's no change.

While I stand by my point that BOMs are harmful, you have a point that this
may be a separate change.  Agreed.
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Exactly, UTF-16/UCS-2/UCS-4 are not plain text.  Any other definition is
Wrong™ and its proponents need to be burned at a stake. :)

The distinction between a programmers' editor and a normal people's editor
is vague.  Only the former cares -- but there's no gain for the latter for
having a BOM, either.

An option to allow the user do what he wants is not a sin, indeed: after
all, it's the user who owns the computer.  I'm speaking of defaults.

And I just tested Windows 11 notepad.exe: it defaults to UTF-8, and when
saving it allows "ANSI" "UTF-16 LE" "UTF-16 BE" "UTF-8" (default) and
"UTF-8 with BOM".  Thus even if the Great Enemy has switched to no-BOM
UTF-8, I see no reason to do otherwise.


Meow!

#1026231#35
Date:
2022-12-21 17:32:59 UTC
From:
To:
I do not think bug submitters expect the Locale field to be used for locale
usage statistics, so it does not seem fair to use it for that purpose.

Cheers,

#1026231#40
Date:
2022-12-22 11:31:07 UTC
From:
To:
There are three major categories of container that I'm aware of:

1. Full-system containers like lxc/lxd run a complete init system like
   systemd or sysv-rc for the container (they aim to behave like a
   lighter-weight alternative to VMs), so their init system would be
   responsible for making this work. This seems non-problematic for your
   proposed requirement: if an init system does the right thing on bare
   metal or in a VM, it will also do the right thing in lxc containers.

2. Per-service containers like the upstream-recommended use of Docker either
   have no init system at all, or a minimal reaper process like tini
   (they aim to behave like a heavier-weight alternative to chroots).
   chroot managers that have subsequently gained namespace functionality,
   like some uses of pbuilder and schroot, also work like this.
   I think these are the category that is most likely to have trouble
   complying with the requirement you propose, because the container manager
   is intentionally "hands-off" (mechanism more than policy), while the
   processes run inside the container are not under Debian's control (they
   are chosen by whoever wrote the Dockerfile or equivalent, and might come
   from either Debian or another distribution).

3. Per-app containers like Flatpak and Snap are not intended to emulate a
   whole system, so they are expected to inherit locale settings from the
   host system like a non-sandboxed app would. They shouldn't be a problem
   here, as long as your proposed requirement is worded in such a way that
   it is valid for these container managers to rely on the host system
   locale being correct (in other words, if someone using the legacy en_GB
   locale reports a bug "flatpak: does not set a UTF-8 locale", I should be
   able to close it with "This is not flatpak's job, set your host system
   locale to en_GB.UTF-8 instead").

podman and systemd-nspawn can be used as either the first category
(like lxc) or the second (like Docker), depending how they were invoked.

Not every bug necessarily needs to be a Policy violation.

Yes, that's the sort of UX that I think needs to be allowed. I would
personally not expose that choice in the UI of an intentionally simple
text editor like Notepad or gnome-text-editor, but I would expect similar
behaviour in an editor with more elaborate programmer-oriented features,
like vim, emacs, gnome-builder or kate.

If iconv(1) or a similar program has an option for "UTF-8 with BOM" then
that also needs to not be accidentally declared to be a Policy violation.

    smcv

#1026231#45
Date:
2023-01-18 23:30:46 UTC
From:
To:
Hello,

Bill, thank you, thank you, thank you!  You speak the voice of reason!

Adam, we living in the West may think of BIG5, GB18030 and EUC-JP as
legacy/obsolete encodings, but in Mainland China, GB18030 is anything
but legacy.  It is a mandatory national standard that has recently
been brought up to date in GB 18030-2022, synchronizing with ISO/IEC
10646:2017 (equivalent to Unicode version 11.0).

"GB 18030 is a national standard with stringent conformance
requirements that regulate eligibility for products or services to be
sold in China."  I personally went through this trying to get the now
defunct ThizLinux distro GB 18030-2000 conformant 20 years ago.  GB
18030-2022 will become mandatory on 2023-08-01.  Why the urgency?  To
add support 17000+ rarer CJK Han characters found in people's and
place names, as well as improving support for minor ethnic languages
in China.  And the GB18030 standard committee is really serious about
promoting GB18030 because they are eager to resolve some real issues
of "missing characters" that are negatively affecting the people
living in China.  To my pleasant surprise, they are putting out a
public lecture webinar series that explains the why and the how of
implementing GB 18030-2022, with the 3rd video published on
2022-12-30.  In their mind, GB 18030 encompasses a lot more than just
a character encoding mapping table.  It is the full support package
(including fonts, display, printing, input methods, etc.) for Han
Chinese and all other minority languages used in China.

See e.g. the following excellent articles for more information:

 * https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132
 * https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf

Even though Debian is not proprietary/commercial software, the GB
18030 authority highly recommends that free/libre and open-source
software _do_ implement GB 18030-2022.  That's especially true
considering the fact that vendors in China may be offering Debian as a
solution for clients, but they would be prevented from doing so if
Debian Policy spells out "We support UTF-8 and UTF-8 only.  Think of
all the ARM and RISC-V single-board computers made in China where
Debian is the default OS image; Debian or derivatives (Ubuntu, Ubuntu
Kylin, etc.) that are pre-installed on PCs sold in China, etc.

As a matter of fact, I have been recently approached recently to
update the IANA charset technical summary for "GB18030" (i.e. the
original GB 18030-2000) in
https://www.iana.org/assignments/charset-reg/GB18030 with the latest
updates for GB 18030-2022.  (Haha, I am starting to fret about it
because I am no expert in GB18030, but many thanks to e.g. Dr. Ken
Lunde, the expert in CJKV information processing, who has kindly
allowed me to borrow any of his articles in updating the IANA charset
documentation for GB18030.

I'm not asking you to spend any time working on GB18030; that would be
the job of Debian Chinese i18n/L10n team as well as the wider
community (glibc, libiconv, Qt, etc.)  All I am asking you is to
maintain the status quo, and don't discount anything other than UTF-8
as legacy.  Debian already supports GB 18030-2000 (or GB 18030-2005)
rather well.  Don't let that existing support die!  If anything, we'd
need to improve GB18030 support to conform with GB 18030-2022, though
thankfully much of that work would likely come from upstream projects
or from Debian derivatives or other distros that are actually selling
their products in China.

Many thanks for your understanding!

Kind regards,

Anthony Fok

#1026231#50
Date:
2023-01-19 01:30:33 UTC
From:
To:
Anthony Fok <foka@debian.org> writes:

This topic comes up a lot, and I'd love to put something in either Policy
or the Developer's Reference proactively to at least explain what we know
about what our users need and to point people at the right questions to
ask if it's been another decade and they want to standardize on UTF-8
again.  Do you have an idea of something suitable we should say?

I do think we probably should say more *somewhere* about making UTF-8 the
default choice in most situations if you otherwise have no reason to
choose anything specific.  For example, as you point out, files written in
Chinese for Chinese people may or may not want to use UTF-8, but at this
point I do think anything written in, say, French or German probably
should just use UTF-8.  Also, file names in the file system shipped in
Debian packages probably should use UTF-8 since there's no way to declare
the character set and there are some solid reasons for picking one and
sticking with it.  (Obviously, users can create files with any character
set they want.)

How do I configure a locale that uses this as the default character set?
I'd like to be able to test this configuration (at least for my own
packages), but since recent changes to locales it doesn't appear to be an
option in debconf and I was confused trying to figure out how I should
make it work.

#1026231#55
Date:
2023-01-19 11:47:42 UTC
From:
To:
If I'm reading correctly, the character encoding part of GB 18030-2022 is
a subset of a sufficiently new version of Unicode, in the same way that
(say) ISO-8859-15 is a subset of Unicode: for every character representable
in GB 18030-2022, you can point at an equivalent Unicode character and say
"this is the GB 18030-2022 encoding of U+4E00" or similar? Is that true?

If that's the case, then supporting text files written in GB 18030
does not *necessarily* require the internal representation or the
system locale to be GB 18030, the same way I can still work with legacy
en_GB.ISO-8859-15 files on my en_GB.UTF-8 system: it could equally well
be done by using iconv() or equivalent to transcode to UTF-8, UTF-16 or
UCS-4 on input, doing all text editing operations on that Unicode, and
then transcoding back into GB 18030 on output. Most language frameworks
already do this as a matter of API: Qt, Java and Windows tend to work
with UTF-16 internally, while GLib/GTK uses UTF-8 internally.

iconv() seems very unlikely to drop support for GB 18030, ISO-8859-15 and
other non-Unicode encodings altogether. What this bug report is about is
dropping support for locales whose associated encoding is non-Unicode,
such as en_GB.ISO-8859-15 and zh_CN.GB18030, so that the data stream
between a CLI program and the terminal emulator will be assumed to be UTF-8
instead of ISO-8859-15 or GB18030.

The main thing I can see that would be a problem for GB 18030 users
if the zh_CN.GB18030 locale was dropped is that various programs might
assume that the locale encoding is the right one to assume when loading
existing files and unable to guess the encoding, or the right one to
write into new files by default - and so users who have moved from
zh_CN.GB18030 to zh_CN.UTF-8 might find themselves unintentionally
producing new UTF-8 files.

Preferring to use Unicode does seem to be the direction that all of
computing is going in, as a simplifying assumption - for example W3C
advice for HTML is "You should always use the UTF-8 character encoding"[1]
- and as we know, things that aren't tested usually don't work. So I
think the level of functionality for non-UTF-8 locales and encodings in
the software we package is going to decline over time, whether Debian
wants it to or not.

    smcv

[1] https://www.w3.org/International/questions/qa-html-encoding-declarations

#1026231#60
Date:
2023-01-20 13:57:17 UTC
From:
To:
If the world's most populous country uses something that is not UTF-8, I
think it's safe to say it's being tested, if only by people who will
file bugs when things go awry.

If the PRC government *requires* a non-UTF-8 encoding for things sold to
them, we would be doing our users who want to sell a product containing
Debian to the PRC government a disservice by dropping support for it
altogether.

We don't have to ensure it works perfectly out of the box; just that
support is achievable with a reasonable amount of work.

#1026231#65
Date:
2023-01-20 14:39:43 UTC
From:
To:
It is true for everything. Users know how to pick the software that works for their
environment. It is not relevant that software they do not use do not support their
environment.

Telling users to switch to UTF-8 because such and such software they never used
and were never going to use do not support GB18030 does not make sense.

It is like saying the Linux console is deprecated because there are Debian
packages that requires X or Wayland.

Cheers,

#1026231#70
Date:
2023-01-20 16:01:33 UTC
From:
To:
Hey Russ, thank you so much for your message!

Adam, I would like to apologize; while I still value that Debian
maintains its existing support for zh_CN.GB18030 locale, I did speak a
bit too soon.  I'll elaborate.

I totally agree.  Besides the Debian Policy A fellow DD on #debian-zh
IRC (linked with Telegram) channel suggests that UTF-8 being the
default should be mentioned in the Release Notes and probably with
pointers to fuller documentation, with instructions on how to manually
add locales with legacy and other non-UTF-8 encodings edit
/etc/locale.gen and /etc/default/locale, and run locale-gen.

And I should clarify: Actually, I would say, for the majority of end
users in Mainland China, zh_CN.UTF-8 would still be the best default,
though likely some government and financial institutions may require
the use of zh_CN.GB18030 probably for certain terminal applications.
I don't know the percentage though.

I asked around #debian-zh last night for more feedback, and most
existing users/developers definitely prefer UTF-8 and are using
zh_CN.UTF-8.  Some joked that those who choose zh_CN.GB18030 are the
ones who like to create difficulties for themselves.

And while support for zh_CN.GB18030 as a "system locale" was
apparently a requirement for conformance testing for GB 18030-2000
some twenty years ago — I went through that period personally when
there was a mad dash by all Linux vendors to get that as well as fonts
and input methods working properly — fellow Chinese DDs agree that
could be a requirement 20 years ago, but no longer today, and suggest
that all China homegrown nowadays use LANG=zh_CN.UTF-8 by default, and
apparently still pass the GB 18030(-2005?) conformance tests.  They
suggest that probably having the ability to read and write
GB18030-encoded documents, and being able to convert between UTF-8 and
GB18030 etc. should be sufficient.

I was initially unconvinced, but then after testing in virtual machine
various ISO images from latest releases of China homegrown Linux
distributions, e.g. Deepin Linux, openKylin, and even Red Flag Desktop
Linux, and they all use zh_CN.UTF-8 as the default system locale!
(Red Flag does have zh_CN.gb18030 locale precompiled though, but then
it seems to have all available locales precompiled according to
"locale -a".

Incidentally, Red Flag Desktop Linux is now based on Debian too!  They
used to co-develop the RHEL-based Asianux on which they built their
distro.  What a pleasant surprise!

Great point!  Totally agreed
from the locales dpkg configure menu... so that's why Adam was saying
official support for legacy locales have indeed been dropped. (Thanks
Adam!  You're just speaking the facts.)

Anyhow, to test how Debian and various desktop environments run under
zh_CN.GB18030 as system locale, here are the steps:

1. Create the /usr/local/share/i18n/SUPPORTED file with the line

        zh_CN.GB18030 GB18030

    (I actually started by prepending that line before "zh_CN.UTF-8 UTF-8"
     in /var/lib/dpkg/info/locales.config, but then saw that it has
provision for
     user-provided list of locale(s).)

 2. Run "sudo dpkg-reconfigure locales" and you'll be able to select
     zh_CN.GB18030 and set it as the default locale.

 3. Optionally, edit /etc/default/locale and make sure you have
      LANGUAGE=en or something similar so you can still see the
      UI in English.

 4. Reboot.

Alternatively, in lieu of running "dpkg-reconfigure locales", you may
also manually edit /etc/locale.gen, uncomment the line

    # zh_CN.GB18030 GB18030

therein, and run "sudo locale-gen".

And I went ahead to test to see how Debian runs under zh_CN.GB18030 as
the system locale with various desktop environments.

The result:
 * Crash upon starting: GNOME 43 and XFCE (Ouch!)
 * KDE, LXDE, LXQt, Cinnamon, MATE: Start up normally.

As for terminals:
 * GNOME Terminal: Crash
 * Console (kgx), Terminator: Do not crash but support UTF-8 only
 * LX Terminal: Follows LANG setting and seemingly supports GB18030 fully
 * Konsole: Full support for GB18030 and any other encodings

In conclusion:

Initially, before asking on #debian-zh and doing all the testing, I
was going to suggest adding the "zh_CN.GB18030" back in the locales
configuration so that at least the GB18030 conformance testers can
easily choose it and let Debian pass the test.

However, after seeing how GNOME 43 crashes under zh_CN.GB18030, and
how China homegrown Linux distros have all switched to using
zh_CN.UTF-8 as the default system locale, I am starting to believe
that setting zh_CN.GB18030 as the system locale is not a requirement
for the GB 18030-2022 conformance tests (as my friends on #debian-zh
were trying to tell me), so I am going to do a full 180° and think
what we have now in Debian's locales package is perfect.  (Maybe some
of the menu text may need changing as there are no "legacy encodings"
to choose from.)

So, all is good!  Well, I hope that GNOME 43 and XFCE crashing upon
startup could be diagnosed and fixed, preferably by the upstream
authors, but there is no urgency to do so now.

I apologize for the confusions that I created.

Cheers,

Anthony

#1026231#75
Date:
2023-01-20 16:39:18 UTC
From:
To:
If using ISO-8859-15 "legacy encoding" as comparison, in China that
would be the 1980 "GB2312" (GB 2312-80) standard and the 1993 "GBK"
extension.  The character repertoires that these legacy
encodings/charsets contain are far fewer than what Unicode or ISO/IEC
10646 encompasses, and in that sense, they are "subsets of Unicode".

GB18030, on the other hand, is actually a full UTF or Unicode
Transformation Format (i.e. an encoding of *all* Unicode code points),
as in GB18030 maps to all codepoints of Unicode while maintaining
backward compatibility with existing GB2312 and GBK documents, just
like how UTF-8 maps to all codepoints of Unicode while maintaining
backward compatibility with ASCII.

GB18030 encodes characters into 1-byte, 2-byte or 4-byte sequences.
1-byte essentially ASCII; 2-byte: essentially GBK; the 4-byte
sequences give a total of 1,587,600 (126×10×126×10) codepoints which
easily and sufficiently cover Unicode's 1,112,064 (17×65536 − 2048
surrogates) assigned, reserved, and noncharacter code points. (source:
Wikipedia)

Since GB18030 can be used to represent the entirety of all Unicode
code points, I would not call GB18030 a "subset" of Unicode.

And some people like to think of GB18030 as "UTF-GBK", e.g.
http://archives.miloush.net/michkap/archive/2013/03/28/10405914.html

Very true.  While GB18030 is another encoding form for Unicode (and
not a subset), indeed we don't need to use GB18030 as the "internal
representation or the system locale", you have put it very nicely.
GB18030 is also somewhat inefficient as a UTF as the required mapping
table and 4-byte conversion algorithm take up far more space and are
quite a bit slower than something as elegant as UTF-8.

Indeed, and thankfully, Google Chrome, Mozilla Firefox, LibreOffice
supposedly still support the reading (and writing) of GB18030
documents through iconv() or ICU or Qt's encoding conversions.

Yes.  These are some of the pains as we transition from legacy
GB2312/GBK encodings towards Unicode, and GB18030 (being a UTF) is
designed as a stepping stone.  But yes, moving to UTF-8 is indeed a
good thing, even in China, as China is not an isolated island.  China
people do value interoperability with the world too.

Very true, and it is already happening, even in China, thankfully.
(See my previous email from today to see my 180° turnaround, as I
finally realized that the GB18030 authorities are pragmatic and do not
actually require zh_CN.GB18030 to be the system locale, but rather
that GB18030 data can be processed; characters that were in PUA but
now in Unicode can be properly supported, etc.

Thank you for the discussion!  :-)

Cheers,

Anthony

#1026231#80
Date:
2023-01-20 16:54:21 UTC
From:
To:
supposedly some older Chinese websites are still using "GBK" as
encoding, probably something like:

     <meta http-equiv="Content-Type" content="text/html;charset=gbk">

which has less than 30,000 characters and thus a very limited subset
of Unicode.  And, presumably not everyone has the know how to convert
to UTF-8, the Chinese government wants those unable to at least change
that meta tag to:

     <meta http-equiv="Content-Type" content="text/html;charset=gb18030">

where GB18030, being a Unicode Transformation Format, albeit a
somewhat awkward one, would be able to display any characters in
Unicode.

I have the feeling that many tech-savvy Chinese have already switched
to UTF-8, but then perhaps in some circles there are lots of legacy
GB2312/GBK documents or systems that made GB18030 a necessity, as an
intermediate step to Unicode.

(Not so in Taiwan and Hong Kong, they jump straight to UTF-8 from Big5
or Big5-HKSCS.  For better or for worse.)

Thanks for the wonderful discussion, Bill!

Cheers,
Anthony

#1026231#85
Date:
2023-01-20 17:00:33 UTC
From:
To:
Thank you Wouter!  That is exactly my thought, although after my
initial message, I have been told that "zh_CN.GB18030 as system
locale" may not be a strict requirement, and thus an explicit UI for
selecting zh_CN.GB18030 as system locale may not be necessary.  A
fellow Chinese DD suggested that some documentation on how to edit
/etc/locale.gen to enable zh_CN.GB18030 or other non-UTF-8 encodings
would likely be sufficient.

That said, if the testing authorities do want zh_CN.GB18030 to be
easily selectable), I think we can always sneak "zh_CN.GB18030" into
the locales configuration interface in a point release.  <grin, duck,
run>

Cheers,
Anthony

#1026231#90
Date:
2023-01-20 17:12:31 UTC
From:
To:
And we do know there's not a single bug filed with a GB* locale within the
last 12 years.

There's far fewer reports from Chinese people than the population would
indicate: 0.75% of those with locale information, but that's still 3241
reports; I find it implausible that there's a non-negligible number of
users that go with GB* yet not a single of them gave a single bit of
feedback.
that all glyphs can be displayed, they can be entered from keyboard, etc.
The standard talks a lot about font support, etc.

Our installer doesn't allow choosing such a locale, thus if indeed the
encoding not character set is legally required, then we'd need to change
so before the release.

But I don't expect that to be the case -- a few years ago I played with
Deepin and don't remember any weird encoding being used.  It would be good
to either check again, or ask one of their maintainers.

But for now, I gotta run.


Meow!

#1026231#95
Date:
2023-01-20 17:16:43 UTC
From:
To:
Sure, but neither of those actually require us to support GBK or GB
18030 as a system locale, only as something that iconv() (or whatever
browsers actually use, which is probably their own thing) can convert
into their preferred internal representation (which is almost certainly
UTF-8, UTF-16 or UCS-4).

Analogously, we've never supported using Windows-1252 (Microsoft's
legacy Latin-1 variant) as a system locale encoding in some hypothetical
locale like en_US.windows-1252, but HTML documents with
text/html;charset=windows-1252 still work fine.

That doesn't seem so far away from how in some English-speaking circles
there are lots of legacy ISO-8859-1, ISO-8859-15 or (more likely)
Windows-1252 documents, and we can cope OK with those via transcoding,
even in UTF-8 system locales.

    smcv

#1026231#100
Date:
2023-01-20 21:36:36 UTC
From:
To:
Hi Adam,

You are correct indeed.  Yes, they worry more about the correct
coverage of characters, especially those that were added in 2022
corresponding to the latest ISO/IEC 10646 standard.

That said, they do require the ability to open, edit, and convert
GB18030-encoded files, but that is at the iconv() / ICU / application
level, but, like you said, they are NOT enforcing the use of
zh_CN.GB18030 as system locale.  (I now stand corrected.)

Incidentally, they have published three pre-recorded webinars thus
far, and I have reuploaded them to YouTube here for easier access for
the rest of the world:

https://www.youtube.com/watch?v=6gByuPXth7s&list=PLWCc17-QBkRjwhRfvCpxM8ez3b0qWO59a

I have yet to figure out how to add automatic Chinese subtitles and to
have it translated to English.  Maybe some day.  :-)

Indeed, and that is what friends on #debian-zh IRC channel are trying
to tell me, and I have personally confirmed that not only Deepin, but
also Red Flag, openKylix, Ubuntu Kylix all use zh_CN.UTF-8 as the
default system locale (see my one of the really long-winded response
in this thread for details.  So you are indeed correct, and I stand
corrected too.  Sorry for the false alarm!  My mindset was still
living in 2002 when zh_CN.GB18030 was assumed to be a requirement by
the industry, but apparently all distros have switched over to
zh_CN.UTF-8 by default.

Debian does still support zh_CN.GB18030 with KDE, LXDE, LXQt,
Cinnamon, MATE, etc., but crashes with GNOME 43 and XFCE at the moment
(at least not on my system), but that's good enough to pass the most
basic GB18030 test.  Just like you correctly observe, "zh_CN.GB18030
as system locale" is not legally required, and thus no need to change
the Debian installer or the locales package for that, so I wouldn't
worry about that for Debian 12.0.  (If the winds do change, we could
hypothetically sneak in the change in, say, Debian 12.1.  And the
myriad of Debian derivatives in China can easily make that change for
basic conformance too.

Cheers,

Anthony

#1026231#105
Date:
2023-01-21 20:27:43 UTC
From:
To:
Those files need to be edited *somewhere*. If that somewhere is a Debian
desktop, then you also need editors that know how to write such files,
etc.

Sometimes it's just easier if the whole thing uses the same encoding.

Windows-* encodings were native on Windows, and we only needed to
be able to read files that were generated on such systems.

We're talking here instead about a government-mandated encoding that
systems are expected to support; not only to consume data, but also to
*produce* data.

Windows-* encodings never had that attached to them.

#1026231#110
Date:
2023-01-21 20:58:19 UTC
From:
To:
Wouter Verhelst <wouter@debian.org> writes:

Both Emacs and vim will edit files in whatever (supported) encoding you
want, regardless of the locale encoding.  I would assume this is not that
uncommon of a feature for other editors as well.  This is therefore a bit
like Simon's web browser example (although may be somewhat less
transparent, admittedly).

(Also, if you're editing files written in Chinese, presumably you're using
an editor with good Chinese input support, and thus one that's more likely
to also have good Chinese encoding support.)

#1026231#115
Date:
2023-01-21 21:16:34 UTC
From:
To:
This is true but this is missing an important point: it is usually not possible
to detect the characther encoding of a plain text file.
That is where a default encoding is required.

Cheers,