#644270 strcoll(3): please make locale(5) easier to find

Package:
manpages-dev
Source:
manpages
Submitter:
Filipus Klutiero
Date:
2011-10-06 16:24:05 UTC
Severity:
wishlist
Tags:
#644270#5
Date:
2011-10-04 16:51:58 UTC
From:
To:
strcoll(3) explains that:

Unfortunately, it seems the way collation happens depending on the
LC_COLLATE locale is unspecified. I couldn't find any description, even
upstream.

#644270#10
Date:
2011-10-04 19:37:51 UTC
From:
To:
reassign 644270 manpages-dev 3.32-0.2
retitle 644270 strcoll(3): please make locale(5) easier to find
severity 644270 wishlist

Hi,

Filipus Klutiero wrote:

The strcoll(3) manpage is from manpages-dev, not eglibc.  Anyway, the
behavior is as described in POSIX[1].  Clarifying text welcome.

Hope that helps,
Jonathan

[1] http://www.unix.org/2008edition/

#644270#25
Date:
2011-10-04 19:56:41 UTC
From:
To:
Hi again,

Filipus Klutiero wrote:
[...]

Cc-ing the manpages-dev maintainers (sorry I forgot to do so before).

Where to go from here?  I believe giving a summary of the collation
order chosen for each locale would be a lot of work for very little
gain (unless there are some important details common to most locales),
so I would suggest pointing to the locale generation tools and locale
sources to help the reader to find these things out for herself
instead.  Anyway, your best bet is is to work with upstream, as
described at [1], and let us know a relevant Message-ID so we can
track your work.

Making the platform's behavior more intuitive is certainly a valuable
goal.  Thanks!

Hope that helps,
Jonathan

[1] http://www.kernel.org/doc/man-pages/contributing.html
[2] http://www.unix.org/2008edition/

#644270#30
Date:
2011-10-05 04:42:15 UTC
From:
To:
That sounds a reasonable approach to me.

Cheers,

Michael

#644270#35
Date:
2011-10-06 05:21:49 UTC
From:
To:
Thanks Jonathan,
I had looked at package contents to figure out the sequence of fr_CA,
and had found /usr/share/i18n/locales/. I then looked at
/usr/share/i18n/locales/fr_CA, then at en_CA, then at iso14651_t1, and
finally at iso14651_t1_common. This is where I decided to stop guesswork
and looked for actual documentation.

I agree that not knowing the file's syntax was the final thing that
discouraged me, but even seeing what locale(5) contains now is of little
help (for me, it doesn't change anything).

I did mean this bug as being about the lack of *specification* of
collation. Linking to a manual giving hints on how to interpret the code
is better than nothing, but only a fraction of users will dare going
that way. This is not about strcoll's manpage. I probably shouldn't have
mentioned strcoll() specifically, this is about collation in general. I
believe this should be documented in glibc-doc-reference, in section 7
"Locales and Internationalization" and easily reachable from 5.6
Collation Functions.

I think an even more general issue is that the influence of choosing a
specific locale doesn't seem to be explained. The documentation explains
what different locales can change, but not what each locale does.
Debian's best-known interface to locale choice is dpkg-reconfigure
locales. I'm not sure my dad would find it obvious that he wants to pick
"fr_CA.UTF-8 UTF-8" there.

I don't think specifying the collation order of each locale would give
that little gain. What made me hit this issue is I was trying to
determine what locale a multilingual program should use (the best
compromise assuming that a single locale will be used). Collation is
important, and I think many people wonder how it works.
I however do agree that this will require important work.

Anyway, if we stick to the issue of collation, the Unicode collation
algorithm is documented on http://www.unicode.org/reports/tr10/
The specification is non-free, but specifying the parameters of each
locale and linking to it would be enough for me.
As for non-Unicode locales, I don't know.

POSIX 7.3.2 does contain a nice amount of useful information. It clearly
describes collating sequence definitions. It also gives the collating
sequence definition of C. That one is quite accessible. Thanks for that
too Jonathan.

#644270#40
Date:
2011-10-06 06:27:25 UTC
From:
To:
Filipus Klutiero wrote:

Thanks --- this information would have been useful in the original
report.  Was there was some particular question about the fr_CA
collating sequence that you were looking for an answer to, or were you
just curious in general?

I guess there is also an implied bug report here regarding the
locale(5) page.  But there's not much use in splitting the bug into
the various relevant tasks unless someone is actually doing the work
of writing text.  Patches implementing even partial progress towards
your goal (a "SEE ALSO" here, a clarifying sentence there) would be
welcome.

Thanks for your interest.  To be clear, I will not personally be
working on this, since when I have time to write manpages, there are
many others I would rather spend time on.

#644270#45
Date:
2011-10-06 07:51:25 UTC
From:
To:
On 2011/10/6 Filipus Klutiero wrote:
[...]
[...]

Hello,

Glibc locales implement ISO 14651, not Unicode collation.  Early
drafts are available, for instance at
http://www.dkuug.dk/jtc1/sc22/open/n2933.pdf

Denis

#644270#50
Date:
2011-10-06 16:07:24 UTC
From:
To:
Le 2011-10-06 02:27, Jonathan Nieder a écrit :

To be clear, I explained my story to show that I made some efforts to
learn about glibc's collation sequences, implying that I needed it
because it had some importance. I didn't have a very particular question
in mind. An application I developed used to collate without using
strcoll(). I changed it to collate with strcoll(), and erroneously chose
C.UTF-8. I was looking for another global locale which would be closer
to the previous behavior. In particular, the previous sort allowed
overriding how a string would sort prefixing it with special characters
such as "-" or "[". I was wondering if/how a different locale would
allow doing such hacks.

There are probably many more things actually relying on the collation.

Just to clarify, I consider this as a bug which should be [mostly]
addressed in the glibc reference manual (I did mean the upstream tag).
Having Debian maintainers verify the report, provide insight on it and
forward it upstream would make them entirely meet my expectations (this
has largely already been done).

#644270#55
Date:
2011-10-06 16:19:53 UTC
From:
To:
Le 2011-10-06 03:51, D. Barbier a écrit :

Oh, I guess that's a good point...
Thanks a lot for this, I suppose this shows there's something to
document. The standard also contains very useful information.

Apparently the current standard (ISO/IEC 14651:2007) is also available
(still non-free), both in English and French:
http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
The English version is
http://standards.iso.org/ittf/PubliclyAvailableStandards/c044872_ISO_IEC_14651_2007(E).zip