#99933 Encourage UTF-8 for documentation and clarify encoding issues

#99933#5
Date:
2001-06-07 14:52:28 UTC
From:
To:
Following proposed addition to policy clarifies encoding issues and
prepares for eventual later migration to utf-8 (see Bug#99324).
Note the use of word "should" - these are not strict requirements.
--- policy.sgml-old	Fri Jun  1 11:40:16 2001
+++ policy.sgml	Thu Jun  7 13:31:09 2001
@@ -1653,6 +1653,15 @@


       </sect>
+
+      <sect id="controlencoding"><heading>Encoding of control files</heading>
+	<p>
+            If, for whatever reason (such as upstream author's or maintainer's
+            names, foreign language package description and similar), you need to
+            use characters outside 7 bit ASCII range in control files, these
+            characters should be encoded using UTF-8 encoding.
+	</p>
+      </sect>
     </chapt>

     <chapt id="versions"><heading>Version numbering</heading>
@@ -2276,8 +2285,16 @@
 	    all.
 	  </p>
 	</sect1>
+
+	<sect1><heading>Character set of <tt>debian/changelog</tt></heading>
+
+	  <p>
+            Character set of <tt>debian/changelog</tt> should be either pure ASCII, or UTF-8.
+	  </p>
+	</sect1>
       </sect>

+
       <sect id="srcsubstvars"><heading><tt>debian/substvars</tt>
 	  and variable substitutions	  </heading>

@@ -7370,6 +7387,26 @@
 	  from <tt>/usr/share/doc/<var>package</var>/</tt>.
 	</p>

+	<p>
+          Documentation of debian packages in text format, if written in
+          language requiring characters outside of 7-bit ASCII range,
+          should use either well-established encoding for the given
+          language <footnote>such as ISO-8859-2 for some central- and easter
+          europian languages, KOI8-R for Russian, etc.</footnote>, or UTF-8
+          encoding.
+          Maintainers are being encouraged to use UTF-8, having in mind
+          the general debian migration toward unified character encoding.
+	</p>
+
+	<p>
+          Original upstream documentation, if in encoding other than UTF-8
+          or the well-established encoding for the particular language,
+          should be converted either to UTF-8 or to the well-established
+          encoding. Choice between UTF-8 and other encoding is left to the
+          maintainer discretion, however, one package should have all the
+          documentation in one consistent encoding for one language.
+	</p>
+
       </sect>

       <sect id="usrdoc">
@@ -7440,6 +7477,18 @@
 	  Other formats such as PostScript may be provided at the
 	  package maintainer's discretion.
 	</p>
+
+        <p>
+          HTML documents, if in encoding other than <tt>us-ascii</tt>, should
+          have in their header an appropriate META tag describing
+          the used encoding.
+
+          Example:
+          <example>
+            <META HTTP-Equiv="Content-Type" CONTENT="text/html; charset=UTF-8">
+          </example>
+        </p>
+
       </sect>

       <sect id="copyrightfile">
@@ -7555,6 +7604,24 @@
 	  changelog, then the Debian changelog should still be called
 	  <tt>changelog.Debian.gz</tt>.</p>
       </sect>
+
+      <sect id="charset">
+	<heading>Deafult character set</heading>
+
+	<p>
+          Names of maintainers, upstream authors and other data in
+          packages' descriptions and related debian data files (such as
+          <tt>debian/changelog</tt>, <tt>debian/copyright</tt>,
+          <tt>debian/control</tt>), as well as in English language
+          documentation, should be either transliterated or
+          transcribed to ASCII, or used in UTF-8 encoding at the
+          discretion of the maintainer. However, for names
+          in scripts based on non-latin alphabets, ASCII (or suitable
+          latin-script) version should be provided along with original
+          name.
+        </p>
+       </sect>
+
     </chapt>

     <appendix id="pkg-scope">

#99933#10
Date:
2001-06-08 19:57:33 UTC
From:
To:
On Fri, Jun 08, 2001 at 02:26:49PM -0500, Steve Greenland wrote:
...

...

Package can have documentations in several different
languages. What I meant is that each language is
in one encoding, so that there are not two files with
different encodings for one language.
Of course, other languages can have other encodings
(otherwise we would be forcing people to use UTF-8,
which we do not want to).

Sorry for my sloppy english, if the intention
was not clear from the description.

So I would probably say it as:

 "...however, the documentation for any single package should use
 only one encoding for one language."

Does it look clear enough?

#99933#15
Date:
2001-06-08 20:58:16 UTC
From:
To:
[*snip*]

Aaah, I did read the original wrong (because it did seem to imply that
all the docs in a given package needed to be the same language, which is
is why I left that part of it out). I might go with:

        "...however, in a single package, all the documents written in a
        particular language should share the same encoding."

I won't claim that is clearly superior to your phrasing though, so pick
whichever works better for you.

Steve

#99933#20
Date:
2001-06-09 11:24:08 UTC
From:
To:
Hello Radovan,

Thursday, June 07, 2001, 5:58:31 PM, you wrote:

Seconded.

#99933#25
Date:
2001-06-11 14:47:18 UTC
From:
To:
Mea culpa...
by mistake I sent my previous mail to 99324@bugs.debian.org, which was
Cesar Eduardo Barros's proposal about full unicode support,
while my proposal is #99933

does JIS X0208 allow chinese characters to be used together
with japanese?
There are actually two problems:
the first one, which was emphasised and used as the main argument
against unicode is in fact the less important:
unicode is not complete for CJK. That is (relatively :-)) easy
to fix, just write a proposal and get it accepted....

The second problem reflects a fundamental design decision:
Unicode unifies Chinese (traditional and simplified), Korean and
Japanese characters, and because of differencies in glyphs,
it means using appropriate font is required to view the text
properly.
The situation is IMHO quite similar to german for using Fraktur
(Sütterlin) script - it is a latin script, and unicode
consortium (IMHO rightfully) decided that it is a typesetting
difference - not an encoding one (you can - and sometimes you do -
typeset english text using Fraktur fonts, after all). If Germans
were using it still today, you would have exactly the same problems
as with CJK scripts now (of course, the complexity of CJK is
much greater than that of a latin scripts)

Or, similar example, I was reading a linguistic book in Russian,
and there were examples from Old Church Slavonic. To distinguish them
from normal text, they were typeset in a different font, using actual
ancient glyphs - again, according to unicode this is a typesetting
change, not an encoding one (it is cyrillic all the way)

I am really not sure if unicode went the right way, I feel the ability
to display Chinese name in a Japanese document using Chinese glyphs
(or vice versa) is something that should not be get rid of...

perhaps it should consider them to be different scripts with different
encodings, but  when would it stop? Making italics, boldface etc. to be
different characters?

You cannot display all of them at the console anyway.
This is for a future.
As for X11, fonts are being rapidly developped.

It was there, at the end.

Maybe.. but just let's do not overcomplicate things :-)

You do not know what is a particular font... one of
(traditional|simplified)C,J,K, or the full font name?

not really, since ascii cannot be used to display the particular language
(take slovak or russian). More appropriate example from the history
is the war between EBDIC, ASCII and other proprietary encodings...
thanks god one and only one encoding won. The situation repeats itself,
we have 2 competing encodings in Slovak, 3 in Russian.. and if
we want one of them to win, why not make the winner unicode, which has the
indisputable[1] advantage of being unified for the whole world?

[1] of course, problems with CJK remains and has to be addressed

and that is something terribly needed today, with this
world wired together.

#99933#30
Date:
2001-06-11 16:34:40 UTC
From:
To:
Thanks.

I don't think so.

However, JIS X0208 implies a japanese character set and the
japanese language, while unicode indicates no such thing.

I disagree.  The Han Unification issue is more like the difference
between the latin and the italic character sets.  Yes, many characters
are similar, however there are also some characters which are unique to
each representaiton.

Also, Unicode does include Fraktur characters.

And, this could be rectified -- with Unicode 3.1, they have the code
space to represent each major representation of the character set.

Unicode already does that.  Take a look at the mathematical alphanumeric
symbols [1D400-1D744].  For example:
1D400 MATHEMATICAL BOLD CAPITAL A
1D41A MATHEMATICAL BOLD SMALL A
1D434 MATHEMATICAL ITALIC CAPITAL A
1D44E MATHEMATICAL ITALIC SMALL A
1D468 MATHEMATICAL BOLD ITALIC CAPITAL A
1D482 MATHEMATICAL BOLD ITALIC SMALL A
1D49C MATHEMATICAL SCRIPT CAPITAL A
1D4B6 MATHEMATICAL SCRIPT SMALL A
1D4D0 MATHEMATICAL BOLD SCRIPT CAPITAL A
1D4EA MATHEMATICAL BOLD SCRIPT SMALL A
1D504 MATHEMATICAL FRAKTUR CAPITAL A
1D51E MATHEMATICAL FRAKTUR SMALL A
1D538 MATHEMATICAL DOUBLE-STRUCK CAPITAL A
etc. etc.

console vs. x is a not a character set issue.  Note that console has
other limitations (fixed width, uni directional).

For currently relevant policy it matters what actually works.

I'm not sure I understand this question (I don't know enough about
oriental languages and fonts to give a full answer in any event).

latin-1 doesn't solve this problem so that's a non-issue.

ebdic vs. ascii wasn't about supported languages.

I agree.

However, Unicode is not a mature standard, so we need to be careful in
places where it would cause problems.

Thanks,

#99933#35
Date:
2001-06-11 17:20:21 UTC
From:
To:
No, because latin (upright) and italics are used interchangebly,
whereas fraktur carries implicit connotation of language used -
just like different glyphs for unified CJK charset.

but in mathematical symbols - that is a completely different beast

if only they instead of talking how bad is unicode started working
on improving it (duck, run :-))

the reason and purpose of these characters is quite different
from "base" unicode characters

of course. That's why my proposal is very mildly worded and
gives a lot of freedom to maintainers to decide what charset they
want.

well, would you indicate just "this README needs japanese unicode font"
and the user has to figure out by himself what is that
or "this README needs -misc-fixed-*-*-*-ja-*-*-*-*-*-*-iso10646-1"
and the user is fubar when he does not have that font.

true, but the mess in encodings was quite comparable to what
is there today outside of Latin-1 world.
And the peace ASCII brought could be compared
to peace that (hopefully :-)) unicode brings one day.

Of course. Nobody is talking about compulsory switching
to unicode _right now_.

#99933#40
Date:
2001-06-11 17:48:24 UTC
From:
To:
I'm sorry.  Not italics, but Old Italic.  U10300-U1032F.

This includes letters like U10308 OLD ITALIC LETTER THE (a circle
with an X in it) as well as letters like U10301 OLD ITALIC LETTER BE
(essentially the same as a capital roman B).

Here, we could assume a common history, and define a map which relates
many of the characters.. much as has been done with Han Unification.

Please explain why it matters to the reader whether the letter A is
classifed by the unicode consortium as mathematical [or not]?

I don't have the technical skill nor the political connections to properly
contribute to the unicode consortium.  I can, however, point out major
problem areas, and I like to think of that as valuable [at least to
Debian -- I like to think that the members of the Unicode Consortium
are already aware of these problems].

The point is that unicode already does support the things you were
suggesting as more unreasonable than indicating oriental language.

Agreed.

I think "needs japanese unicode font" might suffice.  Perhaps a package
name which includes that font would also be good.  An X font spec would,
of course, be necessary if you wanted a program to "just work".

It depends on context.

I'll accept your analogy.  (In the name of peace :).

Thanks,

#99933#45
Date:
2001-06-11 17:54:53 UTC
From:
To:
Raul Miller <moth@debian.org> writes:

IMHO, a better mechanism are Unicode 3.1 language tags, see:

http://www.unicode.org/unicode/reports/tr27/#tag

#99933#50
Date:
2001-06-11 18:08:49 UTC
From:
To:
Which says:

   The characters in this block provide a mechanism for language tagging
   in Unicode plain text. However, the use of these characters is strongly
   discouraged. The characters in this block are reserved for use with
   special protocols. They are not to be used in the absence of such
   protocols, or with any protocols that provide alternate means for
   language tagging, such as HTML or XML.

Which implies that this mechanism isn't useful for representing different
languages in the same document.  That, instead, it's logically equivalent
to a MIME declaration of the document's language.

Maybe, in the future, the Unicode Consortium wants to change the standard
so that this mechanism can be used to represent multiple languages within
the same document.  But that's not the current standard.

#99933#55
Date:
2001-06-11 22:59:43 UTC
From:
To:
Because in a mathematical equation, a "script" A, for instance, is
semantically distinct from a latin capital A.  Fundamental, basic
information is lost without a distinction between these characters.

In text, italics or scripted letters for emphasis or whatever are stylistic
markup, not semantic distinctions.  For instance, people who chat with me
on IRC can deduce my meaning whether or not I elect to use bold and/or
inverse text, and in fact that's why people get yelled at when they do it.

#99933#60
Date:
2001-06-12 00:39:25 UTC
From:
To:
You're telling me why the context matters.  You're not telling me why
the unicode naming of the code points matters.

If the reader sees "Branden", why should it matter whether any underlying
code points were designated by the consortium as mathematical?  If the
reader sees A-B, why should it matter whether any underlying code points
were not designated as mathematical by the consortium?

#99933#65
Date:
2001-06-12 03:27:43 UTC
From:
To:
Why are you CC'ing me?  Are you interested in having a discussion of these
issues, or just in provoking me by filling my inbox?

#99933#70
Date:
2001-06-12 03:43:10 UTC
From:
To:
indicating otherwise.

Probably not the best reason, but you did ask.

FYI,

#99933#75
Date:
2001-06-12 04:45:14 UTC
From:
To:
-policy when discussing pending proposals, which are assigned bug numbers,
remember?  That way seconders and people who later want to consult the
"legislative record" regarding the adoption of a policy proposal can
easily look the information up.

Either the BTS should be enhanced, or you should learn to remember that I
don't like private CC's on mails to lists I read, like this one.  Except on
this list, you don't have to remember because I provide a handy mnemonic:

Mail-Copies-To: nobody
X-No-CC: I subscribe to this list; do not CC me on replies.

This is hardly the first time I've brought this up.  Ignorance is no excuse.

#99933#80
Date:
2001-06-12 06:38:21 UTC
From:
To:
Ok, we were talking about two different things

because mathematical letter is a different than "normal" letter.
They might look alike, but (depending on typography), often
do not.

well, this was not aimed at you :-)

It does not. Bold mathematic symbols are quite different
from bold text characters. MATHEMATICAL BOLD CAPITAL A
has a very different meaning than ITALIC MATHEMATICAL CAPITAL A
(e.g. one denotes a variable, other vector or matrix)

You can make a text bold, and meaning will remain.
If you make a mathematical expression all bold, it will
have a completely different meaning.

And, there is no such letter as
MATHEMATICAL BOLD CYRILLIC CAPITAL LETTER A, since
cyrillic letters are normally not used in mathematic context.
Yet, in your favourite typesetting software, you are able
to write boldface cyrillic (since it is again typesetting
issue, not encoding one)

Well, personally, I could survive without these mathematical chars in
unicode, but neither do I have any objections for using them.

because if code points are mathematical, I parse it as
B \times r  \times a \times n \times d \times e \times n

#99933#85
Date:
2001-06-12 07:07:25 UTC
From:
To:
Of course, this doesn't prevent other uses.  But you're right, that only
a limited selection (e.g. not Han) of characters enjoy bold code points.

So?  Let's imagine you're composing an html document.  What's to prevent
you from wrapping a mathematical alphanumeric character with <b></b>?

But if the context is not mathematical, how can you tell that mathematical
code points are used?

If I say xy-2yz=0, and I don't use mathematical characters, why would
you not interpet that as indicating multiplication?

#99933#90
Date:
2001-06-12 08:41:14 UTC
From:
To:
...

that is a different kind of "boldness", used to emphasise
bold mathematical symbols are different symbols from those not bold.
mathematical symbols enclosed in <b></b> are just emphasised normal
mathematical symbols, not bold mathematical symbols

I cannot, therefore there are special mathematical characters
to distinguish it.

because I would interpret it as a comparision in some kind of
programming language, the one that allows variables to begin
with digit.

#99933#95
Date:
2001-06-12 09:22:23 UTC
From:
To:
Not bold, agreed, but Sha is used for the Shafarevic-Tate group in
number theory, and I think it's also sometimes used in applied
mathematics for certain interesting functions used in Fourier
analysis.  I can't think of any other examples offhand, though.

   Julian

#99933#100
Date:
2001-06-12 09:41:58 UTC
From:
To:
Now let's imagine that a person is actually using this document.

How can this person tell which kind of boldness is in use?

Let's imagine that a person is actually reading this document.  What
difference does it make to this person that the Unicode Consortium has
named the code point using the word MATHEMATICAL?  How would the
person even find out about this?

[I guess they could do view source on the html document, then cut and
paste an individual character into some search dialog box which might
then be used to locate the character and (by association) the name of
the character.  But that seems a bit useless.]

How would you know that the Unicode Consortium hadn't used the word
MATHEMATICAL to describe the code points of those characters?  If you
didn't know about the code points which have MATHEMATICAL in the name (for
example, last week), would you have had a different interpretation of this
expression?  If there was surrounding text describing the character of the
variables x, y, and z, would you insist on this contrived intepretation
of yours?

If we assume that the user is using debian software which merely displays
the characters (and doesn't actually inform the user of the unicode names
for the underlying code points), would there be any particular reason
for the user to interpret some characters as algebraic variables and
others as word forming characters in some unknown programming language
(for some reason other than knowledge of the unicode code point numbers)?

Thanks,

#99933#105
Date:
2001-06-12 10:41:02 UTC
From:
To:
mathematical symbols could use different typesetting convention
(see latex)

decent html browser would render mathematical symbols differently.
But, of course, it need not, depeding on font used.

I do not insist on it... but as you can see, without a context
anything can be misinterpreted, and special symbols are just a tad
helpful.

this is just nitpicking...
unicode is full of characters having the same glyphs
how do you distinguish between LATIN CAPITAL LETTER A,
CYRILLIC CAPITAL LETTER A, GREEK CAPITAL LETTER ALPHA ?
they look the same in upright font, but if you select cursive
font to view the document, they will look differently.
The same with mathematical symbols... they might look the
same with one font, but if you prefer slanted font, you suddenly
see a difference... (or vice versa).

<utopia>
If you select a sentence in your favourite word processing software,
and apply LOWERCASE function, you suddenly see those 3
indistinguishable letters turn into 3 different lowercase letters
(ok, 2 in this case). And surprise, characters in an equation
were NOT lowercased, since your software was clever enough to know
it should not lowercase mathematical symbols automatically.
Neither would it run them through spellchecker.
</utopia>

#99933#110
Date:
2001-06-12 16:29:51 UTC
From:
To:
But this should depend on mathematical context, not code point.

Or are you suggesting that latex shouldn't render ascii characters
using mathematical typesetting conventions?

Again, this should depend on whether a mathematical context is
in use (e.g.  mathematical equation).  Unless (to use the same
example again) you wish to prohibit the use of ascii characters
in mathematical equations.

I'll agree that context is important.  I'll agree that special symbols
are important.

I disagree with the idea that special symbols may only be used in
certain contexts.  That's like saying that HTML should only be used
to describe the structure of a document and not its appearance --
fine language from a standards body, but with little to do with
how the standard is actually used.

Exactly.

Thanks,

#99933#115
Date:
2001-06-12 18:52:44 UTC
From:
To:
Which is exactly correct.  It does not appear that Raul is willing to
discuss this issue rationally, therefore I will content myself with
opposing his position.

Hopefully the rest of the technical committee has greater respect for
thoughtfully-scoped, coherent and directed standards, and not phenomena
like Visual Basic, which was famously derided as being a language designed
by "focus group".

#99933#120
Date:
2001-06-12 19:35:31 UTC
From:
To:
If this were true, we wouldn't have emerging standards such as
XHTML to rectify the problem.

This isn't a technical committee issue, nor is it about visual basic.
At least, not currently.

However, I am sorry for allowing this to devolve into a discussion of
tangential points.

                            * * * * *

It was just pointed out to me by a Unicode guy, that XML has an xml:lang
attribute which can be used on any xml tag.

If we structure our handling of multi-language documents based on this
aspect of XML (and use unicode tr27 to support this same functionality
in non-XML documents) we can address the "unicode doesn't have a way of
specifying the language" issue.

But that still leaves us with the "JIS has characters which aren't in
Unicode" issue.  [If that's an actual issue.]

#99933#125
Date:
2001-06-24 19:56:26 UTC
From:
To:
Here is the proposal with typos and mistakes fixed, with added
paragraph about possible use of other encodings. I left out
the requirenment to specify a font needed to view the
documentation, since IMHO that is overcomplication and
unnecessary.
--- policy.sgml-old Fri Jun 1 11:40:16 2001 +++ policy.sgml Thu Jun 7 13:31:09 2001 @@ -1653,6 +1653,15 @@ </sect> + + <sect id="controlencoding"><heading>Encoding of control files</heading> + <p> + If, for whatever reason (such as upstream author's or maintainer's + names, foreign language package description and similar), you need to + use characters outside 7 bit ASCII range in control files, these + characters should be encoded using UTF-8 encoding. + </p> + </sect> </chapt> <chapt id="versions"><heading>Version numbering</heading> @@ -2276,8 +2285,16 @@ all. </p> </sect1> + + <sect1><heading>Character set of <tt>debian/changelog</tt></heading> + + <p> + Character set of <tt>debian/changelog</tt> should be either pure ASCII, or UTF-8. + </p> + </sect1> </sect> + <sect id="srcsubstvars"><heading><tt>debian/substvars</tt> and variable substitutions </heading> @@ -7370,6 +7387,26 @@ from <tt>/usr/share/doc/<var>package</var>/</tt>. </p> + <p> + Documentation of debian packages in text format, if written in + language requiring characters outside of 7-bit ASCII range, + should use either well-established encoding for the given + language <footnote>such as ISO-8859-2 for some central- and eastern + europian languages, KOI8-R for Russian, etc.</footnote>, or UTF-8 + encoding. + Maintainers are being encouraged to use UTF-8, having in mind + the general debian migration toward unified character encoding. + </p> + + <p> + Original upstream documentation, if in encoding other than UTF-8 + or the well-established encoding for the particular language, + should be converted either to UTF-8 or to the well-established + encoding. Choice between UTF-8 and other encoding is left to the + maintainer's discretion, however, in a single package, all the + documents written in a particular language should share the same encoding. + </p> + + <p> + Package may (at the discretion of the maintainer) include documentation + files in other encodings, if they are present also in canonical encoding, + and if the encodings used are clearly marked. + </p> + </sect> <sect id="usrdoc"> @@ -7440,6 +7477,18 @@ Other formats such as PostScript may be provided at the package maintainer's discretion. </p> + + <p> + HTML documents, if in encoding other than <tt>us-ascii</tt>, should + have in their header an appropriate META tag describing + the used encoding. + + Example: + <example> + <META HTTP-Equiv="Content-Type" CONTENT="text/html; charset=UTF-8"> + </example> + </p> + </sect> <sect id="copyrightfile"> @@ -7555,6 +7604,24 @@ changelog, then the Debian changelog should still be called <tt>changelog.Debian.gz</tt>.</p> </sect> + + <sect id="charset"> + <heading>Deafult character set</heading> + + <p> + Names of maintainers, upstream authors and other data in + packages' descriptions and related debian data files (such as + <tt>debian/changelog</tt>, <tt>debian/copyright</tt>, + <tt>debian/control</tt>), as well as in English language + documentation, should be either transliterated or + transcribed to ASCII, or used in UTF-8 encoding at the + discretion of the maintainer. However, for names + in scripts based on non-latin alphabets, ASCII (or suitable + latin-script) version should be provided along with original + name. + </p> + </sect> + </chapt> <appendix id="pkg-scope">
#99933#130
Date:
2001-07-05 05:55:24 UTC
From:
To:
Raul Miller:

I don't know where you got this impression, but it's wrong. Read the
document. It introduces a  TAG START character, Ascii-equivelent tag
characters, and a TAG CANCEL character. <EN-US>You can label text like
this.<DE-DE>Ja, du kanst.<TAG CANCEL>

Because in theory, MATHEMATICAL ITALIC CAPITAL A won't be available on every
keyboard, nor in every font. Any software that translates ordinary,
non-mathematical italic characters to MATHEMATICAL ITALIC's would be
non-conformant to the Unicode standard. They shouldn't obey case mappings,
and HTML markup and the like probably won't and shouldn't work on them.
There's no way most people will be able to enter them without setting up
fairly unusual software. As a reader, you probably couldn't tell if my
message was in KOI8-R and that I was using the Cyrllic lookalike characters
whereever possible, but that doesn't make it more correct or more likely.

Japenese can travel in China and use 'Japenese' ideographs to comunicate
with the Chinese people who have no knowledge of Chinese. That's a indictive
sign that the characters being used are fundamentally the same characters.
Yes, there are characters that are written differently and unique
characters - such is true about two languages that use the Latin script. I'm
not arguing that all the unifications of individual characters were correct,
but the fundamental concept of unification is correct. (It's interesting
that it's almost always the Japenese that complain about the unificaition -
the Koreans and Chinese, for the most part, seem to find the variations
introduced by unification to be normal. One of the main forces behind
unificiation was Chinese, with GB 13000)

Actually, it can't be rectified. The code space has existed for almost half
a decade - the only change is that it's being used now. But part of the
fundamental nature of Unicode is the unification of CJK characters. You can
not change the meaning of 50,000 characters in the Unicode standard and
invalidate all Japenese/Chinese/Korean (pick two) data in Unicode, any more
than you can introduce case up and case down control characters into ASCII
and use the space of lower case characters for something else.

What? It's not mature? The majority of the world's desktops use, or will
soon use, Unicode, as it's fundamental to Mac OS X and Windows NT/2000/ME.
It's been around for ten years now, and has reached the point where it's
fundamentally stagnant. Sure, there will be a few more ideographs, a few
more mathematical characters, a few more obscure/dead/minority scripts
encoded but Unicode 3.1 is basically what Unicode 5.9 will be. The Unicode
people are committed to not breaking backward compatibility, and with the
wealth of support put by many of them into Unicode, they can't afford to
change anything major. It may be wrong, but it's mature.


 > But that still leaves us with the "JIS has characters which aren't in

All the characters from JIS X 0208 and JIS X 0212 are in Unicode (they were
one of the original primary sources of characters for Unicode). JIS X 0208
is the character set used in ISO-2022-JP, and I believe SJIS and EUC-JP use
the same set. JIS X 0213 should be completely included in Unicode, as the
same Japanese body that does JIS X 0213 is the ISO 10646 liason. I know that
a number of what Unicode would consider variants of preencoded characters
were encoded in Unicode for compatibility with JIS X 0213.

Radovan Garabik:

When would this be necessary? The appropriate fixed font should get picked
by locale (it's in xterm now; I don't know if the Debian unstable xterm has
it, or if it will be in XFree 4.1 or 4.2). So the issue is only when a user
is using an inappropriate choice of font (which we can't save a user from)
or is reading a Chinese readme in a Japanese locale or vice versa. If this
is unreadable, the knowledgable user would know to switch fonts. At worst,
it's no worse than what we have now with having to change locales and fonts
to read a Chinese readme in a Japenese locales.

#99933#135
Date:
2001-07-05 17:37:36 UTC
From:
To:
Raul Miller:

Except that you're not supposed to use this mechanism with HTML, and
unlike XML, in HTML the language can only be identified in the mime
header.

However, if unicode can act as a super set for every character set we
currently use then we can ignore this problem for the purpose of deciding
when to migrate.

Do you have any idea whether the problems identified at
http://support.microsoft.com/support/kb/articles/Q170/5/59.ASP
have been resolved?

I've not been able to find anybody knowledgeable about this issue.

I don't know what you mean.

Prior to Unicode 3.1 the code space was 16 bits.  With Unicode 3.1
the code space has been expanded to 21 bits.

In principle, at least, with the additional code space unicode can have a
1-to-1 mapping with the characters represented in the shift jis standards.

Once unicode can act as a super set for every character set we currently
support, we can use it as such.  Until then, we can't.

Thanks,

#99933#140
Date:
2001-07-06 03:36:25 UTC
From:
To:
Raul Miller <moth@debian.org>

That's an HTML problem. Does Debian use enough mixed language HTML to
actually make that a problem? If so, it's not a problem XHTML has.

Are they a problem for us? Windows Code Page 932 may or may not correspond
to anything that we care about. (At a glance, at least one of each pair that
both correspond to the same Unicode character is not in the real JIS X
0218.) The problems have not been resolved; they are inherent in the fact
Unicode was designed. Needless to say, not all the choices made for Unicode
were the same as those made for CP932, and that manifests in the fact that
characters do not always correspond one to one between the two standards.

NO. Since Unicode 2.0, the code space has been 21 bits. The ONLY thing that
Unicode 3.1 did, is put characters above U+FFFF. It did not change the
fundamental structure of Unicode in the least.

Unicode has a one to one mapping with the characters in JIS X 0208, the
basis for all Unix Japanese encodings. That it fails in completely encoding
some proprietory encodings is inevitable.

If Unicode were a super set for every character set that anyone needs to
support, it would be worthless and completely unusable. The creators also
realized that a perfect proposal, ignoring backward compatibility, would go
nowhere. Unicode is a carefully balanced compromise between the two
problems. However, if we currently support any character set well, it is
through a Unicode based glibc - I don't believe libc accepts the existance
of any character set that can't be mapped to Unicode. So arguably, yes,
Unicode is a super set for every character set we currently support well.

#99933#145
Date:
2001-07-06 08:23:42 UTC
From:
To:
There is no such thing as a MIME header in HTML.

Besides, HTML does include the lang attribute for most elements.  I wonder what
it's for if not for indicating the language.

#99933#150
Date:
2001-07-06 09:56:36 UTC
From:
To:
severity 99933 normal
retitle 99933 [AMENDMENT 06/07/2001] Encourage use of UTF-8 in documentation and clarify encoding issues
thanks

this proposal has 3 seconds (Arthur Korn, Roland Mas, Raul Miller).
Since it has been already discussed to death, I propose
one week discussion (which ends at 13 July 2001).
I am aware of oncoming policy freeze, if this does not make
into woody's policy, it should be considered for inclusion
into the next release.

#99933#159
Date:
2001-07-06 12:37:43 UTC
From:
To:
If it's indeed the case that this is a CP 932 problem and not a shift JIS
problem, and if it's indeed the case that we don't support CP 932, then
I'll agree that this isn't a problem.

I stand corrected.

I didn't say for any character set that anyone needs to support.
I said for every character set we currently support.  I hope you see the
difference.  [And, as an aside, I should have said "for each character
set that we currently support" -- I understand that unicode doesn't need
to support mixed character set usage before we migrate.]

Assuming we're using glibc support (e.g. toupper()) for all those
character sets, I'll agree that you have a good point.

I stand corrected.

Thanks,

#99933#164
Date:
2001-07-06 13:38:51 UTC
From:
To:
I'm not intending to include any substantive changes to policy, only
"bug-fix" type proposals.

   Julian

#99933#169
Date:
2001-07-08 11:18:59 UTC
From:
To:
Hello 99933,

  I second this.

#99933#174
Date:
2001-07-08 20:05:40 UTC
From:
To:
----- Original Message -----
currently

With my Debian hat on, of course I see the difference. With my Unicode hat
on, there is no difference. Every small group and company has their own
character sets that they need supported, and Debian's just another group.
Note that Unix locales tend to prefentially use standardized character sets
(JIS X 0218, ISO-8859-*) which ISO 10646 had to superset completely.

If you have a recent version of locales installed, look in
/usr/share/i18n/charmaps, which has every character set we support for use
in iconv or locales. For actual locale charsets, look in /etc/locale.gen. If
you remove ISO-8859-* (which are all Unicode compatible) and remove UTF-8,
you're left with 11 charsets: cp1251, tis-620, koi8-r, koi8-u, euc-tw,
euc-jp, gb2312, gb18030, gbk, big5, and big5hks. 3 of these have problems:
euc-tw, big5 and big5hks. All three have characters that can't be reversably
mapped to Unicode and back. euc-tw shouldn't be a problem, as its
irreversable mappings are due to duplication of an entire CNS plane of
characters, apparently due to an encoding quirk. big5 has some characters
mapped to private use segments; I don't know if this is because glibc
doesn't use Unicode 3.1 yet, or if that represents a private use segment in
big5 (the characters are contigious), or if they haven't been encoded in
Unicode yet. (Unlikely, IMO).

#99933#181
Date:
2003-01-02 18:57:35 UTC
From:
To:
Hmm, I searched the policy bug list, I don't know how I missed those.
Probably my fault for using galeon-snapshot and expecting its search
function to work :)

#99324 isn't really a proposal, just a discussion.

#99933 goes a lot farther than #174982.  First of all, we can't even
suggest that people use UTF-8 in package control fields until all our
tools support it.  Right now it is just plain broken to put anything but
ASCII in them.

I also personally don't like how it recommends using a "well-established
encoding" or UTF-8.  I mean, that's basically saying nothing.  It
doesn't help applications at all, which will still be forced to guess
what encoding files are in.  In short, it doesn't improve the situation
at all.  I think policy should be silent on the encoding for most files,
until we can usefully say it will just be UTF-8.  Perhaps though policy
could *suggest* UTF-8, and mention that it is the preferred encoding.

I do like the HTML META tag suggestion, although in the case of XHTML,
it should be fine to use the charset parameter of the XML processing
instruction, like in <?xml version="1.0" charset="UTF-8"?>.

Yes, I think the time is getting closer.  But I wanted my proposal to be
small and simple, just a way for Unicode to get a foothold in policy,
which we can expand later.

#99933#186
Date:
2003-01-02 22:25:15 UTC
From:
To:
I have a counter-proposal to #99933, which I have attached.  I believe
it fixes the problems I raised with your proposal, and should also cover
some new areas (like filenames).  I also hopefully fixed James' issue
with the RFC link.

This patch supplants the one in #174982.  It is more ambitious than
#174982, but still does not introduce any "must"s, only "should"s or
weaker.

Opinions?

#99933#191
Date:
2003-01-03 16:45:39 UTC
From:
To:
And I am going to use UTF-8 for Maintainer: in my packages, once
I have new stable mail address (and new UTF-8 GPG alias)

well, the whole proposal was a compromise after a long and bloody
flamewar :-)

It does help users, though. Most users are strictly monolingual
(English does not count) and use the well-established encoding.

As my proposal does - there is just "should" everywhere, no "must"

Yes, this is fine.

If you manage to persuade relevant persons (Manoj?). Good luck :-)

#99933#196
Date:
2003-01-03 18:24:26 UTC
From:
To:
Yes, and it is fundamentally broken to do so, because our tools do not
support it.  Displaying it might happen to work on the maintainer's
machine, but it will probably fail in many more places around the world,
where people use terminals with a different native encoding type.

Please only use ASCII until the tools support it, and file bugs against
packages with control fields with characters not in ASCII.  Otherwise
you are just worsening the problem by adding yet another encoding to the
mix of ISO-8859-1, ISO-8859-2, and who knows what else is already there.

I understand that, but I think we can just avoid the issue of general
file encodings for now, and only work on particular bits like
distributed documentation and filenames.

How does it help users?  It's basically saying "the current broken
situation is OK, but you may also unbreak your files if you want".
Putting this in policy doesn't help anyone at all.  I mean, "well
established" alone is a very vague criteria.

Let me ask this another way; what change do you expect to happen by
saying that files may be in the "well established" encoding or UTF-8?
It would basically be validating the current practice, which I consider
broken.  Policy shouldn't endorse it.

I think a better approach is just for policy to be silent on the general
encoding issue, set up a general Unicode infrastructure, start pushing
UTF-8 where it is really needed (like filenames), and let the pressure
build.

Do you agree?

So far I haven't seen any objections...

#99933#201
Date:
2003-01-03 23:11:59 UTC
From:
To:
Hello,

Is this meant to apply to programs like "ls", "bash", "touch", and
"emacs"?  I imagine that the transition period could be a hard time
for users who (like me) use non-ASCII characters in file-names.

As I see it, the current (broken ?) behaviour is, to use the user's
locale setting (LC_CTYPE) to encode file names.  During the
transition period non-ASCII file names will have two possible
representations in the file system (LC_CTYPE vs. UTF-8).  I think
we should clarify the following points before introducing the above
into policy:

    1) Should interpretation of existing files' names as UTF-8
       be implemented before the encoding of newly created files'
       names is switched?

    2) How should already existing files with non-ASCII names
       be converted?

What do you think?
Jochen

#99933#206
Date:
2003-01-04 02:50:26 UTC
From:
To:
Yes.

That is probably true.  But we really have no other choice.  See below.

It appears so, and yes, this behavior is completely and fundamentally
broken.  If you have say a Chinese friend who logs onto your computer,
and he sets LANG to something like cn_CN.BIG5, then when he tries to
'ls' your files, it will completely fail.  Likewise, when you try to
look at his, it will not work at all.

Moreover, say the system administrator does something like 'find
/home'.  The resulting stream will be a mixture of ISO-8859-X and BIG5,
and impossible to reliably differentiate.  And of course the problem
doesn't just occur when you have a multiuser system; your Chinese friend
could send you a .ogg file named using BIG5, and your Latin 1 system
would simply fail to encode the filename.

And finally, having the encoding of filenames dependent on the current
locale often doesn't make sense even for a single user; what if you are
a software developer in an ISO-8859-1 locale, and you want to test the
Japanese translation of your software.  So you run it with
LANG=ja_JP.ISO-2022-JP or something to get the translations displayed.
As a side effect, all the filenames on your system will fail to work.

In summary, UTF-8 is the *only* sane character set to use for
filenames.  Major upstream software for Debian like GNOME is moving
towards requiring UTF-8 for filenames, and we should too.  See for
example:
http://www.gtk.org/gtk-2.0.0-notes.html

Microsoft Windows has used Unicode for filenames for a long time because
of issues like these.  MacOS also uses Unicode.

And like Tollef said, Red Hat 8 has already switched to defaulting to
UTF-8 for new systems.

I am not sure what policy can say here.  For people using filenames in
legacy encodings, perhaps policy could suggest that programs try to fall
back to the user's locale encoding, if the filename is not valid UTF-8.
This might become common practise, but I don't think policy should
require it.

Again, major chunks of upstream software which have Unicode support
(like GNOME), are *already* defaulting to interpreting filenames as
UTF-8 by default.  I am just trying to bring policy in line with best
practise in this regard.

There are lots of different options; we could have a package
'unicode-transition' in base which would convert all local filesystems,
or we could do it as part of a base-files upgrade.  But mainly, this is
a technical issue separate from policy, in my opinion.  We can hash out
those detailed plans separately from this proposal.

#99933#211
Date:
2003-01-04 11:10:28 UTC
From:
To:
On Jan 04, Colin Walters <walters@debian.org> wrote:

 >In summary, UTF-8 is the *only* sane character set to use for
 >filenames.
True, but does not work in reality for too many people, so this cannot
be made mandatory.

 > Major upstream software for Debian like GNOME is moving
 >towards requiring UTF-8 for filenames, and we should too.  See for
 >example:
This is false. GNOME does not requires UTF-8, it's just a default.

#99933#216
Date:
2003-01-04 15:55:51 UTC
From:
To:
Colin Walters <walters@debian.org> writes:

Whether or not this is broken is debatable. It is the current status
quo, though, on a majority of systems. Breaking that nilly-willy is
not acceptable.

I'd prefer:

1. Programs are extended to handle UTF8 filenames iff LC_CTYPE is
   UTF8. Programs that right now cope with other charsets can keep
   this support if LC_CTYPE is set to any other value (even C).
   Filenames incompatible with the current locale must be handled
   reasonably.

Once this is implemented for a resonable percentage of packages:

2. An UTF8 locale is made the default on new installations. For
   upgrades scripts are provided to convert filesystem trees over to
   UTF8. Do a release.

3. Support for non-UTF8 charsets is deprecated, removed, or succumbs
   to bit rot.

Yeah, and the Gnome2 file dialog completely ignores my latin1
filenames. That's best practise?

Anyway, for my daily living Gnome2 is a quite irrelevant chunk of
software. aterm, zsh, xemacs, mozilla are much more important. Only
half of these support UTF8 right now AFAIK. I'd guess from the
80%-software in Debian less than 50 % handle UTF8.

#99933#221
Date:
2003-01-04 16:42:02 UTC
From:
To:
Note that in my proposal UTF-8 filenames are only mandatory (a "must")
for files *included directly* in Debian packages or created by
maintainer scripts.  Since I don't think we have any packages including
anything but ASCII filenames, this will not change a thing.  UTF-8
filenames for programs in general is just a "should", to be eventually
upgraded to a "must" when we have even more support in major programs.

But now is the time to get a strong statement of support for Unicode in
policy, and start fixing the remaining programs.

That's true, you can set a G_BROKEN_FILENAMES variable.  But we should
not expect upstream authors to implement such hacks in general.
G_BROKEN_FILENAMES is exactly what its name implies; a workaround for a
broken system. Plus, can you imagine setting a variable for each of the
different programs you use?

Other operating systems like Windows and MacOS have had this problem
solved for a long time.  We need to do it.

#99933#226
Date:
2003-01-04 17:10:42 UTC
From:
To:
I don't think so.  I have put forth many real-world scenarios in which
using national charsets for filenames simply breaks, in ways that are
basically impossible to fix.  You may be able to get away with using a
national charset on a machine where everyone speaks the same language,
and never interacts with speakers of another language, but that's about
it.

What *is* debatable is when and how to make the transition, which is
what we're doing now.

Again, my policy proposal does *not* (I am 95% sure) create any new RC
bugs.  The only "must" is for filenames actually included in packages.

I actually wrote another lintian patch for this (attached) which I ran
over my small sample of .debs, and found no new bugs.  It requires my
patch for GNU tar; see:
http://bugs.debian.org/175089

Using UTF-8 for programs in general, in my patch, is just a "should".

First of all, there is no need for 'if and only if'.  Programs can
always try to decode filenames in UTF-8, and if that fails, then try the
locale's charset.

Would this make you happy if I modified my policy proposal to do this?
Again, note this part of my proposal is still not a "must".  Your
programs will not get RC bugs for a lack of UTF-8 support for filenames.

I agree with this wholeheartedly.

Well, you might have to set G_BROKEN_FILENAMES.  But this is the whole
reason we are switching to UTF-8; so programs will not have to deal with
the nightmare of recoding filenames!  If you feel strongly however you
could lobby the GNOME maintainers to default to falling back
automatically to the national encoding if UTF-8 decoding fails.

I've noticed that UTF-8 sometimes makes zsh unhappy, but other than that
basically all the software I use every day (evolution, gnome-terminal,
GNU Emacs (well, from CVS), nautilus, and galeon) supports UTF-8
filenames.

#99933#231
Date:
2003-01-04 18:15:04 UTC
From:
To:
* Colin Walters

| Note that in my proposal UTF-8 filenames are only mandatory (a "must")
| for files *included directly* in Debian packages or created by
| maintainer scripts.  Since I don't think we have any packages including
| anything but ASCII filenames, this will not change a thing.

You are wrong in this regard.  inorwegian includes a file called
bokmål (which, ISTR, has symlinks for both ISO8859-1 and UTF8)

#99933#236
Date:
2003-01-04 19:14:17 UTC
From:
To:
Hm, I don't see the symlinks for UTF-8.  Anyways, such an approach with
symlinks would not really solve the problem.  Since the files in
question appear to be only used internally by ispell, it should not be
difficult to recode the filenames in UTF-8; the only program that would
have to be changed is ispell.

Also, inorwegian seems to have scripts which assume an ISO-8859-1
environment:

Setting up inorwegian (2.0-9) ...
Malformed UTF-8 character (unexpected end of string) at /usr/share/perl5/Debconf/Client/ConfModule.pm line 125, <STDIN> line 8.

#99933#241
Date:
2003-01-04 21:33:42 UTC
From:
To:
Colin Walters <walters@debian.org> writes:

Don't you think this is a common case? I'd even say more common than
your scenarios. At least common enough that it should be acknowledged.

I am not concerned about RC bugs in mine or others packages. My point
is that ways how things have worked up to now will no longer, and this
can be avoided.

This will invariably interpret some non-ASCII non-UTF8 filenames wrong.

But it will condone or even suggest broken behaviour like Gnome2's.

Considering old standards broken because a newer one exists is just
ridiculous.

I still think taking LC_CTYPE unconditionally as a hint is the best
solution. People who don't care (e.g. USians) are happy with any
solution. People that have it at an older encoding get some slack.
People like you should already have it at UTF8 and get all the fun
right away.

No argument there.

That's quite an understatement. The commandline editor can't deal with
multibyte characters in any way. So for example entering an o umlaut
and then deleting it gets you in trouble, because zsh does not handle
the two byte sequence as one character.

FWIW, I am quite content with mandating the contents of some files as
UTF8. We may want a BOM, at the start, though.

#99933#246
Date:
2003-01-04 21:45:13 UTC
From:
To:
Previously Colin Walters wrote:

Right. I'm tempted to make the next dpkg release abort if people try
that.

Wichert.

#99933#251
Date:
2003-01-04 21:46:25 UTC
From:
To:
Previously Colin Walters wrote:

I second this proposal.

Wichert.

#99933#256
Date:
2003-01-04 22:27:05 UTC
From:
To:
I agree, it is common enough.  But previously people had no choice but
to use a broken hack; now we have a solution.

It only "worked" for specific regions, and specific cases.  We should of
course try to ensure that for people using filenames with legacy
non-ASCII encodings, the transition is as painless as possible.  I fully
understand and agree with that.

That may be true.  However, UTF-8 was designed so that the chance of it
being interpreted as another charset was small, and decreasingly small
as the length of the input increases.  See RFC 2279.  That's why it is a
good strategy to try decoding as UTF-8 first; and if that fails, fall
back to the locale's encoding.

The whole point of this proposal is to move Debian more in line with
major chunks of upstream software like GNOME 2.  If you disagree with
their behavior, please suggest an alternative to solve all the problems
I named above.

The old "standards" such as they were are were a workaround for the lack
of Unicode support.  Now that we have it, we should stop using the
workaround.

No.  Even only English-speaking programmers like me are tired of dealing
with the multitude of national encodings, and having to make our
programs do stuff like unreliable charset autodetection.  ISO-8859-1 and
BIG5 are not solutions for filenames, they are workarounds.

I'm not sure what you are saying here.

Ok.  Well, this should not be impossible to fix, I hope.

Again, there is no mandate involved in my policy proposal.  It is all
just "should"s, except for file names.

We don't need one for UTF-8.  That's another one of the great things
about it.

#99933#261
Date:
2003-01-04 23:16:19 UTC
From:
To:
No, just difficult to fix without a nasty kludge.
#99933#266
Date:
2003-01-05 00:22:45 UTC
From:
To:
* Colin Walters

| On Sat, 2003-01-04 at 13:15, Tollef Fog Heen wrote:
| > * Colin Walters
| >
| > | Note that in my proposal UTF-8 filenames are only mandatory (a "must")
| > | for files *included directly* in Debian packages or created by
| > | maintainer scripts.  Since I don't think we have any packages including
| > | anything but ASCII filenames, this will not change a thing.
| >
| > You are wrong in this regard.  inorwegian includes a file called
| > bokmål (which, ISTR, has symlinks for both ISO8859-1 and UTF8)
|
| Hm, I don't see the symlinks for UTF-8.

Actually, the file names are in UTF8 already. :)

| Anyways, such an approach with symlinks would not really solve the
| problem.  Since the files in question appear to be only used
| internally by ispell, it should not be difficult to recode the
| filenames in UTF-8; the only program that would have to be changed
| is ispell.

And any hard coded scripts using -d norsk (or -d bokmal) for getting
Norwegian ispell output.

| Also, inorwegian seems to have scripts which assume an ISO-8859-1
| environment:
|
| Setting up inorwegian (2.0-9) ...
| Malformed UTF-8 character (unexpected end of string) at /usr/share/perl5/Debconf/Client/ConfModule.pm line 125, <STDIN> line 8.

This is due to debconf not knowing what charset the template is in.
It will be fixed.

#99933#271
Date:
2003-01-05 01:01:53 UTC
From:
To:
On Jan 04, Robert Bihlmeyer <robbe@orcus.priv.at> wrote:

 >Considering old standards broken because a newer one exists is just
 >ridiculous.
Agreed.

 >> I've noticed that UTF-8 sometimes makes zsh unhappy, [...]
 >
 >That's quite an understatement. The commandline editor can't deal with
 >multibyte characters in any way. So for example entering an o umlaut
 >and then deleting it gets you in trouble, because zsh does not handle
 >the two byte sequence as one character.
The same applies to bash. There has been patch in the BTS for a very
long time but it has never been applied.

#99933#276
Date:
2003-01-05 02:17:17 UTC
From:
To:
On Jan 04, Colin Walters <walters@debian.org> wrote:

 >> We may want a BOM, at the start, though.
 >
 >We don't need one for UTF-8.  That's another one of the great things
 >about it.
What do you know about international environments? Maybe you do not need
a BOM because your native language needs just ASCII and you do not have
any text file encoded with latin-1, but in the rest of the world the
situation is quite different.

I propose a new policy amendment: developers whose native language is
english should not discuss i18n-related policy matters.

#99933#281
Date:
2003-01-05 03:54:58 UTC
From:
To:
Well, hey, so they are.  Don't know why it didn't look like it before...

Hm, but if the filename is already UTF-8, what is the problem?

Cool.

#99933#286
Date:
2003-01-05 03:48:37 UTC
From:
To:
That would make sure that i18n is always an afterthought.  You need
to work *with* developers, not *against* them.  How are you planning
to impose an i18n policy on people who have been excluded from
discussing it?

Richard Braakman

#99933#291
Date:
2003-01-05 04:20:03 UTC
From:
To:
Hm, the latest bash appears to work for me at least.  I've been using it
when I want to do UTF-8 file manipulation until zsh is fixed.

#99933#296
Date:
2003-01-05 04:19:03 UTC
From:
To:
If you can make an argument that starting every text file with a BOM
would be a good idea on a Unix-like system such as Debian, please do.
Everything I have read argues otherwise.  Unix has always treated files
as just streams of bytes, and allowed you to concatenate streams with
pipes.  Having the BOM show up randomly, and expecting programs like
'cat' to remove it, or add it when it is missing, is too much to ask.
'cat' can't know whether its input is random binary data or UTF-8.

But you don't have to listen to me, here are some arguments from Markus
Kuhn against it, which I turned up in a quick Google search:

http://www.rosat.mpe-garching.mpg.de/mailing-lists/perl-unicode/1999-11/msg00004.html

In any case, whether or not to start every file with a BOM is basically
orthogonal to my proposal, so we can discuss the BOM after the core
proposal has been accepted.

#99933#301
Date:
2003-01-05 07:36:16 UTC
From:
To:
[ CC'd to the Debian Description Translation Project maintainer, as he
may be interested ]

Ok, I spent a little bit of time and hacked up some experimental patches
for dpkg to support UTF-8, and to recode it to the locale's encoding
type on output.  If you'd like to play, see:

http://bugs.debian.org/175363
http://bugs.debian.org/175370

Hopefully we can get these into dpkg soon, and at that point we can
start using UTF-8 in maintainer fields and package descriptions.

#99933#306
Date:
2003-01-05 10:49:45 UTC
From:
To:
* Colin Walters

| On Sat, 2003-01-04 at 19:22, Tollef Fog Heen wrote:
|
| > And any hard coded scripts using -d norsk (or -d bokmal) for getting
| > Norwegian ispell output.
|
| Hm, but if the filename is already UTF-8, what is the problem?

It isn't in stable, which means that I want to keep compat symlinks
around for at least one release.  (But that is just me :)  I tried to
fix this last night, but sed seemed to take far too long to do
anything; unsure what the bug there is. :/

#99933#311
Date:
2003-01-05 14:23:17 UTC
From:
To:
On Sat, Jan 04, 2003 at 12:10:42PM -0500, Colin Walters wrote:
[...]
[...]

So how to implement your proposal?
The main issue is to patch glibc API so that filenames are supposed
to be UTF-8 encoded.  Has this already been discussed?

Denis

#99933#316
Date:
2003-01-05 17:09:09 UTC
From:
To:
What do you mean?  What changes to the glibc API would be required?

If you are suggesting that functions like readdir() attempt to convert
filenames from UTF-8 into the user's current locale, I am completely
against that.  It will just exacerbate the problem.

#99933#321
Date:
2003-01-05 16:07:58 UTC
From:
To:
thanks

The DDTP has no problmes with UTF-8 in control fields. Some maintainer
use UTF-8 or something else with 'some translations' in the descriptions.

This is not nice.

The policy should be: use normal ACSII and UTF-8 encoding if you use
non-ACSII characters

Gruss
Grisu

#99933#326
Date:
2003-01-05 18:41:47 UTC
From:
To:
Well, you can reduce that to 'just use UTF-8', since UTF-8 is a strict
superset of ASCII.  So I agree with you, but we need to wait for this
patch to get into dpkg before we can add such a rule to policy.

In the meantime though, we should start removing ISO-8859-1 and
friends...

#99933#331
Date:
2003-01-05 20:13:03 UTC
From:
To:
      <p>
        Programs should expect filenames in general (whether from
        a Debian package or created by the user) to be encoded
        with UTF-8, although it is recommended for programs to try
        gracefully falling back to the current locale's encoding
        if this fails.  Programs included in Debian packages
        should, when creating new files, encode their names in
        UTF-8 by default.
      </p>

Consider a program written in C, which creates new files with open(2);
if I understand your proposal right, when a filename is not UTF-8
encoded, it should be converted into UTF-8 according to user's locale.
I am wondering how to perform this task:
  a. Let open() perform this conversion.
  b. Add a utility function in a common library and patch all programs
     to add calls to this routine.
  c. Let all programs perform their own checks.
  d. ... Others?

How do you think your proposal should be implemented?

Denis

#99933#336
Date:
2003-01-05 21:33:57 UTC
From:
To:
ok.

Gruss
Grisu

#99933#341
Date:
2003-01-06 02:12:36 UTC
From:
To:
Well, broadly speaking, there are two cases:

1) Programs which do not look at the contents of filenames, and just
treat them as mostly opaque arguments.  Commands like 'touch' fall into
this category.  We should not need to change them at all; you just start
passing UTF-8 instead of ASCII or ISO-8859-1 to them.  Any change to
glibc would break these programs.

2) Programs which do manipulate filenames. These are trickier.  Now,
there are several ways to make these programs handle UTF-8.  For some of
them, no change will be required; stuff like searching for ASCII
characters still works with UTF-8.  However, if these programs display
them to the user on a tty, it will be necessary to convert them to the
user's locale encoding (of course, once we make UTF-8 terminals
standard, programs will not need to do this.) If they stuff them in a
GUI widget, they will have to be sure to tell the widget that they are
in UTF-8 (if necessary).

No.  This would certainly ensure corruption.

It depends.  For some programs, instead of converting the filename back
to the user's locale's encoding for internal manipulation (which may
fail, remember, since UTF-8 can encode far more than say ISO-8859-1), it
would be better to change the program to handle all strings internally
as UTF-8.  For some programs this will be fairly trivial, for others it
may be difficult.  Another alternative is to have a small library which
will first try decoding a filename using UTF-8 back into the user's
locale encoding, and only if that fails, then just take the filename
as-is.  The best approach will depend on the program, and how it
manipulates filenames.

I hope that helps.

#99933#346
Date:
2003-01-06 03:00:32 UTC
From:
To:
Hmm.  Remember the far more common case of a program that takes a
filename on the command line and then tries to open it.  The user
would have typed it in the local encoding, so it needs conversion.
On the other hand, if the program was invoked by another program
then the filename is likely to already be in UTF-8.

I guess this conversion should be done by the user's shell, and all
filename arguments on the command line should be encoded in UTF-8.
Umm, except that the shell doesn't know which arguments are filenames.
How should this be done?

Richard Braakman

#99933#351
Date:
2003-01-06 05:21:27 UTC
From:
To:
That's true.  Hm.  Maybe the best approach will be to first just
implement Unicode and UTF-8 support for more programs, so it is how they
handle filenames (and strings in general) internally, much like how
GNOME programs do it now.  This is all well and good, I think.

The bigger question is what to do for programs that create or rename
files, especially from user input.  Should they try to convert filenames
back into the locale encoding?  I would say no, because 1) it could fail
if the locale encoding can't encode certain characters and 2) it will
just prolong the brokenness.  For programs like 'touch' though which do
not look at the filename at all, I think they should not be changed at
all.  They will create a file named using the same encoding given to it
as an argument.

After we have a "sufficient" number of programs supporting UTF-8
natively in this way, we change the policy on filenames to a "must",
drop support for legacy terminals and encodings, and switch everyone to
a UTF-8 terminal, and a UTF-8 locale.

My guess is that this could happen some time after sarge's release.  For
sarge, we could (and probably should) make the default locale for new
installations be UTF-8.  After we've switched to a UTF-8 locale for
everyone, programs will no longer need the code to handle legacy
encodings.  It will probably still be useful to keep it though, because
the legacy encodings will be around for a long time, and we want things
to Just Work as much as possible.

So again, after this current policy proposal is accepted, it will still
not be a RC bug to not have UTF-8 support; but people will know that it
is coming.

What do you think?

#99933#356
Date:
2003-01-06 06:45:15 UTC
From:
To:
Just to answer this a bit more directly; no, I think the shell should do
no conversion.  It should just pass its input on to programs in the
encoding it received it.

So for people using legacy encodings, yes, programs will receive
filenames in those encodings, not UTF-8.  But hopefully programs will
handle it, and convert them to UTF-8 internally, and write them out as
UTF-8.  But if they don't, then they don't (unless we fix the program).
There's not much we can do about it, until switching users to UTF-8
locales and terminals.

#99933#361
Date:
2003-01-06 18:45:31 UTC
From:
To:
Besides Sebastien's reply, there is another good reason not to do
recoding in the shell: for any program which actually manipulates
filenames, we will need to add Unicode/UTF-8 support *anyway*, even if
the shell did convert everything to UTF-8.  For example, any program
that used to do:

char *c;
for (c = some_function_that_gets_user_input(); c != NULL; c++)
  printf("%s\n", c);

will have to be changed to do something like:

char *c;
for (c = some_function_that_gets_user_input(); c != NULL; utf8_next_char(c))
  printf("%s\n", c);

Since we will have to change programs anyways, we might as well fix them
to decode filenames as well.  The shell is kind of tempting as a "quick
fix", but I don't think it will really help us.

Well, let's be clear; nothing we can do will truly work in all cases.
The vast majority of data is untagged, and charsets are not always
reliably distinguishable.  We are just trying to minimize what breaks.

For the case you named above, I think what should happen is that 'ls'
converts all the arguments to UTF-8 for internal processing.  For the
first argument, UTF-8 validation will fail, so ls will try converting
from the locale's charset, which will work.  The rest of the arguments
will validate as UTF-8, so ls just goes on its way.

I don't think the shell does in all cases.  Think about when arguments
are computed dynamically.

Generally speaking, I think the shell should just be a conduit for
bytes, and not modify them at all.  Much like 'cat'.

Well, this situation can already break horribly on systems whose users
use different character encodings.  So we aren't creating a regression
here, in my opinion.

We will definitely need UTF-8 support for the terminal.  I know
gnome-terminal works, and uxterm works too.  I don't know about support
for Linux consoles.

#99933#366
Date:
2003-01-06 21:07:08 UTC
From:
To:
Fixing progams that handle terminal input is a different matter IMHO, it's
something that should be decided on a more case by case basis, and alot of
cases might be effortless handled just by extending ncurses/slang

I think the philosophy should be that everything should be converted to
UTF-8 after it is read from the terminal. Programs that interface with the
terminal need to convert.

Changing programs that handle terminal input is a far smaller scope than
changing every program that touches argv and every program that does
terminal input.

If this route is followed then a huge swath of programs are half correct
already, their only problem is that they will not be converting utf-8 for
display. That might be best handled through glibc (again, changing
*everything* just to get around the lack of utf-8 terminals is insane)

Well, that's not true. At the shell level everything is tagged. The shell
knows things returned from readdir are utf-8 and things typed into the
console are something else.

When I mean 'all cases' I mean the cases the come up in a system with only
UTF-8 names in the filesystem, not one that has mixed encodings already
in the filesystem, that's hopeless.

Eww, that's gross, it isn't definate that UTF-8 validation will always
fail for non UTF-8 text, you could easially get lucky and type in a word
that is valid UTF-8, but needs conversion! That's a terribly subtle UI
bug.

Consider the shell to be a scripting language just like python/java and
look at how it's handled there - all internal strings are UTF-8, functions
that read/write to the terminal convert automatically, functions exist to
convert arbitary text/files.

You have everything needed to make the shell work uniformly in any
environment, but some cases might require an iconv, but the iconv is
required for *all* users, not just those with different locale settings. I
think that's a good goal.

The trouble is, the shell interfaces with the terminal, so it is the only
thing in a position to know how to convert characters coming from the
terimal to UTF-8, nothing else can do this.

Jason

#99933#371
Date:
2003-01-06 21:15:24 UTC
From:
To:
Hello Colin,
At least I agree to this :-)

I think that we need filename conversion between UTF-8 and the user's
character set, because we cannot ban all non-UTF8 terminal types.  In
my opinion the main problem is, where this conversion should take
place.

Because a lot of programs is affected, it would gain us much, if we
could move this as deep as into libc or even into the kernel.  I
remember there are some questions about character sets in the kernel
configuration.  Are there file-systems with in-kernel character set
conversion?
Does anybody know: how do they solve the problems we discuss here?
Where do they convert filenames, e.g. when I login via ssh and
type "ls -l Bär*" from my LC_CTYPE=ISO-8859-15 system?
And how is the conversion done there?
Ok, I see that this is no real problem.

Jochen

#99933#376
Date:
2003-01-06 21:01:51 UTC
From:
To:
Hello,

I think that this would be a really bad idea, because it would be a to
severe restriction on the set of supported terminal types.  Think of
remote logins from non-Debian machines: we cannot control the program
at the other end of the line.  And what about serial (hardware) VT-220
terminals?  We cannot change the hardware and to loose support for it
would be not nice.

So in my opinion we cannot drop support for non-UTF8 locales and
terminals.  We need to do file-name conversion here.

Jochen

#99933#381
Date:
2003-01-06 22:55:15 UTC
From:
To:
That's true, but I don't think there is really anything we can do to
solve that problem.

Well, such terminals should be explicitly marked as deprecated inside
Debian.  Actually, probably the best solution is for the terminal to be
able to switch encodings at runtime; the experimental gnome-terminal can
do this.

#99933#386
Date:
2003-01-06 23:30:06 UTC
From:
To:
[ CC's trimmed, since mail to the bug will reach -policy ]

A lot of programs don't use curses...

I generally agree with that.

If by 'touching argv' you mean 'modifying and creating output based on',
then I hope you agree that we will almost certainly have to make those
programs grok Unicode anyways, as I said before.  UTF-8 is a multibyte
encoding, and traversing and manipulating it correctly generally
requires one to use different string functions (although stuff like
strchr(foo, '.') will still work).

Output is a big problem, I agree.  But how exactly do you propose to
modify glibc?

No, it doesn't!  Even if we force users to run a script which converts
all legacy encodings to UTF-8, people will still have files NFS mounted
readonly on other systems, files that they created using a legacy
program, files on CD-ROM or DVD, etc.

What do you mean anyways that everything on the shell level is tagged?
How is that possible?

What if I do something like this:

touch $(nc www.random.org 80)

But mixed encodings will happen in the real world.  It is unavoidable.
There is a lot of legacy data.

I agree, it sucks and it's pretty gross.  But I don't think there is a
better solution.

Yes, but even in Python/Java/C# or whatever, you don't always know the
encoding for sure; what if you're opening up a Debian changelog?  By
default the strema will be opened using the user's locale encoding, but
we already mandated that Debian changelogs be UTF-8.

I don't see how you can make iconv just make everything work.

As I said, I don't think the shell knows everything, and I think just
modifying the shell will not fix everything, even if it did.

#99933#391
Date:
2003-01-07 08:07:55 UTC
From:
To:
Hello everybody,
UNIX-style programming should continue to "just work", I like
the idea that I can download any old program written in a past
decade and just type make.

And Yes!, there are several filesystems in the Linux kernel
which do character set conversions on the fly.  Specifically,
all the Microsoft/IBM compatible filesystems (*fat, ntfs, hpfs,
iso9660) allow the DOS-side and unix-side character sets to be
specified as mount options.  Some versions of the smb file
sharing tools also do this.  And I think there is some
conversion code in the text mode vt implementation (screen and
keyboard) too.

At least the filesystem character conversions already use
UNICODE as the intermediary format, and thus the kernel includes
an almost complete set of UNICODE to/from X conversion tables,
each as a separate module with kerneld autoload support and all.

So here is my idea of how to do it (no I have not checked what
RH or others do, but I know what MS did wrong 10 years ago and I
live with those mistakes as a cross platform programmer every
day).

1. Unless otherwise specified here, or there are very special
circumstances, all programs and libraries should assume that all
strings they receive or output (including, but not limited to
filenames) are in the same encoding, and make no externally
visible character encoding conversion.  (This is usually trivial
to do, just do nothing).

2. If a program really needs to make assumptions about the
character encoding of data, it should assume the character
encoding specified by the locale. As a minimum, the following 3
cases must work correctly:
   2.1. UTF8
   2.2. iso8859-1+ defined as the single byte encoding where
      each byte is one character, which is its own UNICODE
      equivalent, and where all byte values are treated as
      valid, even if the corresponding UNICODE codepoint is not
      defined.  (This character set is usually combined with the
      C locale to allow processing of arbitrary binary data in
      any unknown encoding).
   2.3. any other single byte encoding where the values 0..127
      are ASCII and 128..255 are graphic characters not
      interpreted in any particular way.

Support for other multi-byte character encodings than UTF8 is
not required for sarge and later, but should not be removed if
it is already there.  For new code, either use the libc
character handling functions, or just treat anything not UTF8 as
iso8859-1+ except when converting to/from UTF8.

Note 2.1: Code which just treats strings as binary data already
satisfy the above.

Note 2.2: Code which just checks for ASCII values such as \n, /
etc. and passes consecutive sequences of high-numbered chars
around as is, already satisfy the above thanks to the design
properties of UTF8.

3. Unless required for security or other functionality, programs
and libraries should not object to processing invalid
characters. (This increases the users chance of being able to
deal with data in inconsistent or broken encodings, e.g. with
commands such as mv M?nch.txt Maench.txt).

However no conversions should cause bytes to be treated as an
ASCII control char unless its encoding is exactly that ASCII
byte value alone.  This means not converting the "redundant"
UTF8 encodings to their shortest form, but either leaving them
as is or converting them to something harmless.  ? is not
harmless, any ASCII char other than a-zA-Z is not harmless in
general context.

Note 3.1: This is trivially satisfied by code which does not
do convert or check character encoding at all.

4. The low level software which converts keystrokes (or other
non-string input) to strings or converts strings to pixels (or
other non-string output), is responsible for doing so
consistently with the locale of the programs to which it
provides this service, unless those programs explicitly specify
otherwise.

For terminal-style input/output, there will be a tool or library
feature (existing or Debian-created) which does two-way
conversion of character sets around a pty.  This tool can /
should be plugged into ssh, telnet, serial line getty and other
conduits which allow terminal access from terminals that might
have different locales than preferred on a given Debian system.

Note 4.1: Editors, libreadline etc. are not under this rule.
Those are just regular software which needs to count characters
(and thus check for multibyte chars in the specified encoding).
This rule is about the actual terminal interfaces, whether text
or graphic.

5. Software which persists or transports strings outside the
current process group, such as the name processing in
filesystems, should convert strings from the current locale to a
common encoding chosen by the implementor, such as UTF8, UTF16,
UTF32 or in some cases another encoding.  It must be possible to
turn off the translation through an extra environment variable,
no matter what the locale or its character encoding.

For filenames or other data to which access must be possible
even if it is improperly encoded, the translation code should
include a well-defined escaping mechanism for accessing invalid
character encodings on the medium.  This code must not be
enabled in other contexts, due to serious security issues (it
could e.g. allow bad people to bypass code to filter out shell
metacharacters etc.).  This escape mechanism should allow things
like tar backups to just work, no matter how confused the
filenames on a disk.

A mechanism needs to be devised, either in kernel or libc, which
allows the conversion of filenames and console i/o to and from
the process locale to indeed match the process locale.  A
similar or identical mechanism should be put in Xlib.

6.  The base software in sarge, such as libc, Xlib, xterm must
support UTF8 variants of all locales as soon as possible.
Without this, the rest cannot even begin to be implemented.

P.S. I am not a DD, just trying to be helpful and constructive.

Cheers,

Jakob

#99933#396
Date:
2003-01-07 08:29:44 UTC
From:
To:
but unless someone starts actually _using_ UTF-8, we would never know
which tools are broken and which are not (I already found one bug
in handling of UTF-8 GPG alias - I'll file the bugreport after some more
testing).
And remember, this is debian *un*stable, so some breakage is to be
expected.

...

Yes.

But no sign of recognizing the urgent need of solving the problem either.

#99933#401
Date:
2003-01-07 09:29:33 UTC
From:
To:
On Tue, Jan 07, 2003 at 09:29:44AM +0100, Radovan Garabik wrote:
[...]

[Could this discussion take place on debian-i18n?]

Mixing legacy encodings and UTF-8 looks like a bad idea, except that
we can determine whether strings are UTF-8 encoded or not.  So it makes
automatic conversion a bit harder, but it is not a real problem.

The main problem with text files is that their encoding is not specified.
All human editable text files must *explicitly* tell their encoding,
either by their content (like XML/SGML/HTML) or by their file name
(.txt documentation or man pages must contain their encoding in their
full name, naming scheme must be standardized).  This allows support
for both UTF-8 and legacy encodings.  (To Colin: you did not notice any
problem because ASCII text is UTF-8, but problems arise with all other
legacy encodings).

A good example is debconf.  Joey Hess added encoding information in 1.2.0,
legacy encodings are currently the default, and switching to UTF-8 will
take place when it is time, without any trouble.  Automatic conversion to
user's locale (including UTF-8) is performed on output.
The only problem is that very few maintainers did manage to switch to
po-debconf in order to add encoding informations into their templates files.

A similar approach could be considered for deb control files, a new
mandatory Encoding field must be added to debian/control (and automatically
put in other files when needed), which tells encoding used by all control
files.  Dpkg and friends may then perform automatic conversion (to UTF-8 or
to current user's locale) if desired.

Denis

#99933#406
Date:
2003-01-07 11:38:36 UTC
From:
To:
On Jan 06, Jochen Voss <jvoss2@web.de> wrote:

 >Because a lot of programs is affected, it would gain us much, if we
 >could move this as deep as into libc or even into the kernel.  I
 >remember there are some questions about character sets in the kernel
 >configuration.  Are there file-systems with in-kernel character set
 >conversion?
Do not even dare suggesting this. Changing libc would probably break
POSIX compatibility, changing the kernel is a bad idea which would get
nothing else than flames from kernel developers.
Programs have to be fixed: file systems are just another kind of
input/output and should be assumed to follow LC_CTYPE.
The right approach (even if the default configuration is inappropriate)
is the one of GNOME: high level libraries hide file names charset
conversion from users and programmers.

#99933#411
Date:
2003-01-07 12:30:49 UTC
From:
To:
On Tue, Jan 07, 2003 at 10:29:33AM +0100, Denis Barbier wrote:
[...]

This suggestion applies when control files contain non-ASCII characters,
only problematic packages are concerned.

Denis

#99933#416
Date:
2003-01-07 15:22:57 UTC
From:
To:
Colin Walters <walters@debian.org> writes:

Then your solution is broken.  Seriously, this would be a huge problem
for many people.

You can't very well take an actual vt100 and do that.  Even on other
hardware, like older Suns, it's not all that easy.

I am vehemently opposed to any proposal that renders Debian
substantially unusable on existing ASCII/latin1 terminals.  I think it
is great to use Unicode internally, but we clearly are not pursuing
the right path if we introduce such breakage.

(Yes, this would mean that TERM=vt100 is now deprecated)

#99933#421
Date:
2003-01-07 15:23:14 UTC
From:
To:
Testing our tools' support for UTF-8 on your local system is perfectly
fine; I've been doing just that personally.  But, ...

Uploading packages with UTF-8 control fields is not ok.  It will simply
put, not work for anyone who's not using a UTF-8 terminal, which is
unfortunately probably most of our users at the moment.  Just Don't Do
It.

If you really want to help push UTF-8, apply my dpkg patch, help
find/fix bugs in it, then start ensuring apt-get, aptitude, etc., all
grok UTF-8.

Actually I think we should probably move to -devel, given how strongly
this affects the system in general.  Even people who maintain programs
which care little for i18n will still have to deal with UTF-8 filenames,
and should be UTF-8 aware in general.

It looks to me like at this point almost everyone agrees with the
content of my proposal in #99933, and we are discussing implementation
details.  Agreed?

If so, another second would be cool :)  And also if that is the case,
then it makes a better argument for moving to -devel.

Not with perfect reliability.

You mean like changelog.txt.UTF-8 or changelog.UTF-8.txt ? I am pretty
much opposed to any sort of proposal of this form.  The reason is that
changing programs to recognize our arbitrary scheme for file encodings
will not only be a lot of work, but instead we could add support to
programs to autodetect the charset semi-intelligently from file content,
which is what programs like Emacs in the real world do today.

Actually I quite frequently notice problems with European names, as well
as the copyright character.  Do not assume that because my native
language is English that I do not experience charset problems :)

Ugh.  I am generally quite opposed to adding an Encoding field, and I
bet you'll find the dpkg maintainers are too.  It should just be UTF-8,
period.  If developers really want to, they can generate control from a
control.in file by using iconv or similar.

#99933#426
Date:
2003-01-07 16:58:31 UTC
From:
To:
On Tue, Jan 07, 2003 at 10:23:14AM -0500, Colin Walters wrote:
[...]

No.  We agree that UTF-8 support must be dramatically improved, but
legacy encodings must be supported too.

[...]

I was unclear, and only speaking about files shipped by Debian packages
which contain non-ASCII characters without specifying their encoding.
Users can do whatever they want with their data.
I have almost txt, man and info pages in mind.  IIRC *BSD put man pages
under .../man/<language>.<encoding>/, don't they?  Info pages are never
translated.  The only text files with non-ASCII letters I encounter
are documentation and can be safely renamed, but maybe there are others.

Then why do you patch dpkg to support UTF-8 input if it can guess encoding?

Denis

#99933#431
Date:
2003-01-07 18:31:57 UTC
From:
To:
But the current situation is *already* broken!  For example, for a
Chinese person, an ISO-8859-1 system simply cannot encode, nor display,
their language.  I am aware that for people entrenched in legacy
charsets like ISO-8859-1, the transition may introduce
incompatibilities.  But that's the price we pay to eventually make
everything work for everyone.

It is the only path to the future.  Note that in my proposal, I do
suggest that programs try to re-encode from UTF-8 back to the user's
locale charset.

#99933#436
Date:
2003-01-07 18:50:46 UTC
From:
To:
Colin Walters <walters@debian.org> writes:

I don't disagree.  I'm saying that your solution is worse than the problem.

True.  However, if the terminal only supports ISO-8859-1, there's no
way to make it magically display Chinese characters.  It's a
limitation, and Unicode or not, there is no way around it.

"may introduct incompatibilities" is something of an understatement.
"Break compatibility with 50 years' worth of computing and almost
every other vendor" is more accurate.

I do not buy that for one minute.  Surely it is possible to translate
things back to a character set the terminal actually supports?

Is that not why we have the "@UTF8" designator for our LANG settings?

Perhaps you mean "it is EASIEST to break compatibility."  That may be
true.  That is also the wrong motivation.

#99933#441
Date:
2003-01-07 19:22:06 UTC
From:
To:
Hello,

I do STRONGLY DISAGREE with

    ...  Programs included in Debian packages
    should, when creating new files, encode their names in
    UTF-8 by default.

We shouldn't start this before all/most programs can handle
the generated file names.

Jochen

#99933#446
Date:
2003-01-07 19:31:30 UTC
From:
To:
I suggest that no decision should be made about man pages until groff
2.0 is available, when proper encoding support will actually be
practical as opposed to the hacks we have today. Until then it will not
be at all clear to me how things should work.

#99933#451
Date:
2003-01-07 19:35:28 UTC
From:
To:
Sorry, we have to start somewhere.  Unicode is the way of the future,
and if we wait until every vendor of some random terminal updates it
with support for UTF-8, we will never start.

Now is a good time, since (again) major chunks of upstream software
included in Debian like GNOME are making a major push towards UTF-8.

Well, that's what we're going to do.

If we change programs to output to the terminal in the locale's
encoding, then yes, it will work, at least if the terminal's charset
covers all of the characters in question (which it may not).

Not sure how this is related to what you're saying.

We will try to preserve compatibility as much as possible.

#99933#456
Date:
2003-01-07 19:42:01 UTC
From:
To:
If I drop this from my proposal, will you support the rest of it then?

I should note however that many programs are already creating file names
in UTF-8 today; like pretty much any program which uses GTK+ 2 for
instance (including all of GNOME).

#99933#461
Date:
2003-01-07 20:10:01 UTC
From:
To:
Colin Walters <walters@debian.org> writes:

I don't disagree that we should move to Unicode.  I disagree that such
a move must inherently remove support for legacy (or even, the
majority of CURRENT) terminals.

Sorry, this discussion is about what we're doing, isn't it?  I don't
recall seing "Colin Walters, Debian Dictator for Life" voted on
anywhere.

What "change programs?"  That's what they do now.

Yet your own proposal breaks compatibility with, let's see, EVERYONE?

#99933#466
Date:
2003-01-07 22:04:35 UTC
From:
To:
See http://mail.nl.linux.org/linux-utf8/2003-01/msg00037.html
It would be nice to make sure programs are ready before switching
everything to utf-8.

Denis

#99933#471
Date:
2003-01-07 23:50:45 UTC
From:
To:
If you're using a terminal that can't support UTF-8, you always have the
option of running
something like GNU screen to translate the system charset to the terminal
charset.
It seems more important to get a systemwide encoding working, then worry
about the
minority who use physical terminals.

#99933#476
Date:
2003-01-08 06:00:19 UTC
From:
To:
Not inherently, but stuff will likely break.  How much it breaks is
inversely proportial to how much work we put into it.

Ah, you must have missed the rider in the small font in my last policy
proposal :)

Seriously, I didn't mean it that way; I just meant that I think everyone
has generally accepted that UTF-8 is the way of the future; we're just
debating when, where, and how.

I don't think most do.  dpkg for example doesn't.  'ls' for example
doesn't.

No, for people using UTF-8 today, like me, it increases compatibility :)

And remember, (not to sound like a broken record, but) lots of upstream
software is moving to UTF-8.  Compatibility with systems using legacy
charsets is already broken to some extent.

#99933#481
Date:
2003-01-08 06:08:14 UTC
From:
To:
Sure...but remember that my policy proposal does not drop support for
legacy charsets; in fact it recommends that programs try falling back to
them if UTF-8 decoding fails.

I see this policy proposal as a strong statement that Debian is moving
towards Unicode, not as a means to get packages which don't grok UTF-8
removed from Debian or something silly like that.  Implicitly in this is
that we will support legacy encodings to some extent for a while.

Do you agree?

Ok.

Agreed completely.  They can have their data in any encoding they want,
as long as it's UTF-8. :)

(just kidding...)

Ah, OK.  I think that improving how our documentation formats specify
charsets is a great goal.  I misunderstood your proposal.

Er...my patch was to support outputting UTF-8 to the user's terminal.
There was no input involved.  I think you may have confused something
somewhere, but maybe I just wasn't clear about what it does...

#99933#486
Date:
2003-01-08 06:10:36 UTC
From:
To:
That is interesting advice.  I am not sure I understand exactly how it
would work though.  Would you just tell screen that all input is in
UTF-8?  It seems like this would not be true if the user has legacy
filenames, and they do something simple like 'ls'...

#99933#491
Date:
2003-01-08 06:16:41 UTC
From:
To:
Cool.

I will say this much; I simply did not even consider doing this kind of
character set conversion as part of glibc or Linux.  It just seems like
such a horrible kludge that would not actually work in practice.
Fundamentally, glibc and Linux cannot know what charset the application
itself works in.  You might have stuff that undergoes UTF-8 conversion
*twice*, once by the application and once by glibc for example.  It just
seems like a recipie for disaster.
because you can't just use your same old C library string functions on
UTF-8. I know it seems tempting to just stick some code into glibc, but
I have serious doubts that will ever work in anything resembling a
reliable fashion.

Feel free to prove me wrong of course!

I think that it quite simply does not work.

What conversion?  GNOME apps speak UTF-8 natively, and that's about all
they speak unless you set the G_BROKEN_FILENAMES environment variable.

#99933#496
Date:
2003-01-08 06:30:09 UTC
From:
To:
Naïve, simple, classic UNIX-style programs are ASCII-only.  Then someone
got the idea to bolt this huge "locale" kludge on top of all of it.  It
is not something to be proud of or emulate.

Yay for broken software.

This is the way things currently work; it is also exceedingly broken.

I think that if you are writing a program today, it is saner to assume
UTF-8, since that is the future direction.

I believe that the programs to which you might need to pass invalid
characters will also be the programs which will not look at or
manipulate the filenames anyways.  'mv' is a good example of a program
which we will *not* need to change.  It just basically takes its
arguments and passes them to the rename system call (well obviously it
is more complicated than that, but that's the basic idea).

I generally agree.

Such a tool could save us time (perhaps this tool already exists in the
form of GNU screen, as mentioned by David Starner), but note we can't
really force users to use it.

Ugh, I am opposed to any sort of environment variable like this.  I
think it will not be necessary, and will complicate the implementation.

Not sure how this "escaping mechanism" would be possible, or what it
would even really do.

I think it might make sense to have common library functions to do stuff
like this in glibc.

It already does.  I just tried uxterm again for the first time in a
while, and I'm really impressed with its current level of UTF-8
support.  It can do almost all of UTF-8-demo.txt on my system.

Thanks for your comments.

#99933#501
Date:
2003-01-08 06:55:12 UTC
From:
To:
At 01:10 AM 1/8/2003 -0500, Colin Walters wrote:

Well, screen should react in the same way that any UTF-8 terminal should
react. (There's a specification that not all of them follow, but all of them I've
tried handle it non-catastrophically.) The suggestion was how to handle
legacy terminals in a UTF-8 world.

As for legacy filenames, I'd think that it would be easiest for each system
to declare a flag day, and change over to UTF-8. (zsh be damned -- they've
had plenty of time to figure how to properly handle UTF-8.) I've submitted bugs
on packages for having filenames not in ASCII, so for the most part Debian's
filenames won't be a problem. There is no way for a POSIX filesystem to tag
filenames with encodings, so there is no option for this to be a clean
changeover, especially as there's no clean state to start from.


David Starner - starner@okstate.edu
(starner@okstate.edu may be disappearing soon - dvdeug@email.ro will work,
but is not suitable for high-volume traffic.)

#99933#506
Date:
2003-01-08 07:53:16 UTC
From:
To:
Forget about screen and use filterm (from konwert package).
I am using it from time to time on legacy terminals with a great
success

filterm - UTF8-iso1
is all you need to use your unicode setup to work on
legacy iso-8859-1 terminal. It even converts characters to their
most appropriate iso1 equvalents (strips diacritics, transliterates
cyrillic etc.). This is not some hypothetical option like
most of what is proposed, but I was really using it to read
Russian and Slovak etc on broken|old terminals

#99933#511
Date:
2003-01-08 08:00:41 UTC
From:
To:
That is what I am doing now. (Except the dpkg patch which I am going to
play with if I find some time)

I lost count how many times I already had this discussion on -i18n,
-devel and whatever else. The consensus was ALWAYS "OK, that is nice
but just wait until the tools support UTF-8, and besides, I do not care
about it". So we waited and waited until RedHat (much as I dislike RH,
I applaud their effor for switching into UTF-8) and it is no longer
a question of making the "proper" progressive decisions, but a questions
of not falling back too much when compared with RH.

I would like to. Though I am not sure about others.

I completely agree.

#99933#516
Date:
2003-01-08 20:28:33 UTC
From:
To:
Unicode did not exist until fairly recently.  Lots of useful software was
written prior to its introduction.

#99933#521
Date:
2003-01-08 20:32:53 UTC
From:
To:
It's not just physical terminals we're talking about here.  We're talking
about the vast majority of the state of the art terminal emulators *today*.
Debian's latest stable release does not use Unicode by default in either KDE
or Gnome, AFAIK.  The console in the latest stable release does not use
Unicode by default either.

Then we have all the other Linux distros, plus Solaris, AIX, AS/400, etc,
etc, etc.

Hell, we're doing good to get some things to support *ASCII*.

#99933#526
Date:
2003-01-08 22:54:43 UTC
From:
To:
At 02:32 PM 1/8/2003 -0600, John Goerzen wrote:

I'd have a hard time describing a terminal emulator that doesn't support
UTF-8 as "start of the art". Recent versions of xterm, gnome-terminal,
and the KDE terminal all support UTF-8.

No one said that we were going to remove non-UTF-8 locales in Sarge. The
console can be switched into UTF-8 mode with one command - unicode_start.

AS/400? We don't support EBCDIC.

We'll be losing more compatibility with Mastodon Linux, but we can't run a.out
anymore, so it's really a moot point. As for the rest of them, most of them
are ahead of us in UTF-8 support - RedHat, Solaris, AIX. What about Mac
OS/X and Windows? Both of them are far ahead of us in UTF-8 handling.

Then those programs shouldn't be in Debian - Hamm made being 8-bit
clean a release critical property. Being 8-bit clean isn't good enough for
a large part of the world to use their native languages, and is a pain for
the rest of us who are mathematicians, linguists, scholars or travelers.

If it was written prior to Unicode, it's useless to the Ethiopians and the Iranians and
a large part of the rest of the world; it's likely to be useless to the Japanese and
Chinese as well.

We can support non-UTF-8 terminals - as Radovan pointed out, the tool
is filterm. If you want to support an older terminal, that's
the easiest place to do so; you can't afford to muck around
in kernel and libc or in every program for that.


David Starner - starner@okstate.edu
(starner@okstate.edu may be disappearing soon - dvdeug@email.ro will work,
but is not suitable for high-volume traffic.)

#99933#531
Date:
2003-01-08 23:03:13 UTC
From:
To:
Yes, there are UTF-8 versions available.  Does everyone have them?  Do we
enable them by default?  Do all other vendors ship them?  The answer to all
of these questions is No.

Colin was advocating what amounted to exactly that.  He was advocating
removing all support for non-UTF8 terminals.

AS/400s do support ASCII :-)

I was making a joke, not to be meant seriosly (and it was referring to the
AS/400)

I don't buy that at all.  Lots of programs are simply pipes, working with
data going in, echoing it back out.

Colin asserted that ls was broken because it doesn't handle Unicode.  I
submit that ls has always handled Unicode; if the filename is encoded with
Unicode and your terminal is Unicode, it will show it in Unicode.  It
doesn't have to be made specifically aware to just shlep some data onto the
screen.

Then let's do that, and not consign the rest of the world to the junk bin.

#99933#536
Date:
2003-01-09 01:07:40 UTC
From:
To:
my present policy proposal introduces is for filenames included
*directly* in Debian packages, or created by maintainer scripts.

Everything else is just a "should" or less, for now.

Could you reread my policy proposal again, please?

Broken?  Not necessarily.  But suboptimal?  I think so.

True enough.  But we could make the transition easier and increase
compatibility with legacy setups by making 'ls' and friends recode
output.

I fully, completely agree.

#99933#541
Date:
2003-01-09 02:05:29 UTC
From:
To:
At 05:03 PM 1/8/2003 -0600, John Goerzen wrote:

Everyone who has the most recent version. They're enabled by default if you're
running a UTF-8 locale, like they should be.

Can we control this? If you're sitting at a computer that doesn't have a new
terminal, you can run filterm or install a newer xterm.

But not in Sarge.

No argument here; it would be nice if ls would escape invalid byte sequences
and bad characters, but it's not broken.

But we do do that -- we have filterm in the distribution. A filter between
the terminal
and the system is the easiest place to solve this problem.


David Starner - starner@okstate.edu
(starner@okstate.edu may be disappearing soon - dvdeug@email.ro will work,
but is not suitable for high-volume traffic.)

#99933#546
Date:
2003-01-09 18:28:50 UTC
From:
To:
Hello,
I want to challenge the "everyone" in your sentence above :-)

I agree that it would be a good idea to store filenames as UTF-8
in the filesystem.  But I (being a part of "everyone") do not
agree, that we should even try to switch every terminal in the
world to UTF-8.  We do need conversion of file names somewhere
between the filesystem level and output.

Jochen

#99933#551
Date:
2003-01-09 18:46:16 UTC
From:
To:
Well, I do agree that conversion should occur.  In reality though, not
all programs will be fixed to do this, and not all terminals will be
converted to UTF-8 either.  We just want to maximize both in an attempt
to minimize breakage.

#99933#556
Date:
2003-01-10 01:57:52 UTC
From:
To:
A Posix filename is a null terminated byte string (sans '/'). Any widescale conversion is
going to cause aliasing issues and other bugs, whether or not we stay Posix compatible.
Just as important, conversion is not an issue for debian-policy; linux-utf8@nl.linux.org (the
primary Unicode-Linux discussion list) is strongly against it, and I believe the people who
matter - the ones who work on the kernel and libc - are generally against it.

I'd been interpreting this part of the policy amendment as saying "You shouldn't have filenames
in packages (or created by packages) in non-UTF-8 encodings." (I'm not generally a fan of
filenames in non-ASCII UTF-8, but at least it's consistent.) If we're talking about what programs
output, it should use whatever name and encoding the user asks for. We can't dictate what
encoding end-users use; just what Debian packages use internally.


David Starner - dvdeug@debian.org
(starner@okstate.edu may be disappearing soon - dvdeug@email.ro will work,
but is not suitable for high-volume traffic.)

#99933#561
Date:
2003-01-10 03:29:14 UTC
From:
To:
Right.  Did the people on that list come up with any general plan for
how GNU/Linux vendors should transition?

I suppose I should subscribe to that list...

Well, that's not quite right.  For filenames included directly in Debian
packages, or created by maintainer scripts, my policy proposal says they
*must* be UTF-8.  For files simply created by running programs, it is
just suggested that they be UTF-8, for now.

Are you saying that programs should attempt to convert filenames back
into the user's locale encoding in the actual filesystem, or just that
they should recode them for output?

#99933#566
Date:
2003-01-10 04:05:33 UTC
From:
To:
At 10:29 PM 1/9/2003 -0500, Colin Walters wrote:

Not anything written up that I know of. Debian-i18n has a large cross
membership, which was part of the reason this should be on debian-i18n.

Console programs should not recode them period, except possibly for
annoying stuff (newlines in names and the like). Locale-dependent
GUI programs should probably do the same. GNOME and KDE may
save them as UTF-8, but that's questionable behavior; arguably, if you
want to use GNOME and KDE you should be using a UTF-8 locale, which
would solve the inconsistency.


David Starner - starner@okstate.edu
(starner@okstate.edu may be disappearing soon - dvdeug@email.ro will work,
but is not suitable for high-volume traffic.)

#99933#571
Date:
2003-01-10 04:55:32 UTC
From:
To:
Ok, if people want to move this discussion that's fine by me.

If we're talking about the filenames, then I agree.

What do you expect GNOME programs to do?  Since they fully support
UTF-8, you can input any Unicode character you want.  Also, a program
like Evolution may receive a file in mail whose name uses Unicode
characters.  And a lot of locale charsets (like ISO-8859-1) will not be
able to encode the string.  The only sane solution is to just use UTF-8
for filenames.

But I am curious about your feelings on programs writing data in general
to the terminal; you feel they should not never to convert it to the
locale's charset, and we should just mandate that people using legacy
terminals use that filterm or whatever thing?

#99933#576
Date:
2003-01-11 11:21:33 UTC
From:
To:
At 11:55 PM 1/9/2003 -0500, Colin Walters wrote:

You can input any Unicode character you want, but you probably have
to out of your way to input something outside your charset (i.e. probably
not on your keyboard or standard IM.) If I receive a file in the mail whose
name is not in ASCII (which has never happened to me), I would rename
it before saving it, so I could access it easily. How many people in a
Latin-1 locale who got an email with a Chinese file name would want it
saved with the original name? A simple hash - say, out of charset
characters to _ - would probably be fine.

If you're dealing with a web browser, or a mail reader or anything else
that handles tagged data, it should convert it, of course. Anything else
should be in the locale charset, and manually recoded if necessary.
I'm not sure I really understand what you're asking here.


David Starner - starner@okstate.edu
(starner@okstate.edu may be disappearing soon - dvdeug@email.ro will work,
but is not suitable for high-volume traffic.)

#99933#581
Date:
2003-01-11 23:37:52 UTC
From:
To:
Naive, simple, classic UNIX-style programs (if 8 bit clean) will
implicitly handle UTF8, latin-1, latin-2, Korean DBCS, Arab,
Hebrew, most old DOS codepages, and generally any encoding which
includes ASCII as a proper subset.  The notable exception is
certain Japanese DBCS encodings, which allow ASCII character
encodings to have a different meaning if preceded by the wrong
byte values.  I am not sure if the common Chinese DBCS encodings
are safe like Korean or unsafe like Japanese.

This is what I want to keep working.

But this pleasant situation presumes, that all the system
interfaces (terminal, filesystem, Xlib ...) happen to use the
*same* encoding at any given invocation of the program, at least
as far as input/output to that program is concerned.

So my detailed proposal is about getting UTF8 support work
without breaking this basic programming assumption.

Again, I assume that the program is 8 bit clean or I would have
to restrict my input to ASCII anyway today.  But if I do
restrict my own input to ASCII for such a broken program, the
system should do nothing which may increase the breakage beyond
that manual workaround.



To understand my concrete proposal, it should be seen in the light
of the following general transition plan:

Step S1. Get all the ultra-core software to support UTF8 (items 4
and 6 in the proposal).

Step S2. Now maintainers of other software will have a
reasonable environment in which to start implementing and
testing that their code works with UTF8 variants of locales.
And users can actually use such locales without massive
breakage.

Step S3. Make all Debian packages work correctly in the presence
of UTF8 locales.  Proposal items 1 to 3 are about making this as
trivial as possible, with 90% plus of current packages (both
source and binary) needing no change at all.

Step S4. While implementing S3, work on creating solutions which
allow processes running in UTF8 locales to interoperate with a
world, where some systems and users will continue to use other
encodings anyway for many years to come.

Proposal item 5 says that this is the responsibility of the few
pieces of software actually interfacing with the outside world,
not of the many pieces of neutral software which may or may not
happen to be used in those situations.

Proposal item 4 emphasizes that simply having a user interface
(such as libreadline in the shell, ncurses in some full screen
text mode programs, Athena or Motif/lesstif widgets in X in X
programs) does not put a program in that category.

Thus character conversion should be done at the very edge of the
system: In the local terminals (vt, xterm, Xlib), in remote
terminal access software (ssh, telnet, tty wrappers for serial
lines, Xlib for remote X terminals), and in physical storage
interfaces (already partially in the stock kernel for non-UNIX
filesystems).

Step S5. Make UTF8 locales the default.

Step S6. Subject support for other encodings to bit rot, not
deliberate removal.
UTF8 terminal and all my filesystems present UTF8 at the system
call level, everything works.  If I set my locale to latin-1,
use a latin1 terminal and all my filesystems present latin1 at
the system call level, everything works too.  If I set my locale
to the predominant Japanese DBCS encoding, use a Japanese DBCS
terminal and all my filesystems present Japanese DBCS at the
system call level, almost everything works, unless I use one of
the few characters whose DBCS encoding abuses the byte values
normally associated with e.g. "/", or "\\" .  And yes, I do use
all of these variations on some of my machines, even though I
don't speak the Japanese language personally.

If the locale says UTF8, then assuming UTF8 is safe.  If the
locale is not UTF8, assuming UTF8 is VERY broken, my proposal
went on to say that supporting the UTF8 setting correctly is the
most important case to implement, but a neutral 8-bit clean mode
must also be available, which will handle most other encodings
implicitly.  Support for legacy DBCS encodings is not required
at all, because it may be too difficult to add to programs in
some situations, and users can soon get around by using UTF8 for
those languages.
Here is a simple example:

/bin/more needs to count the number of encoded characters in
order to determine, when lines will wrap and thus when to pause
output.  So /bin/more must recognize the UTF8 (or other charset)
values which indicate multi-byte encodings representing a single
character.  It may even need to know about zero and double width
characters.  But whatever it does, it should not refuse to pass
through unmodified any non-UTF8 data I might feed it, because I
probably have a reason to do that if I do (maybe my LOCALE
variable says UTF8 by mistake, maybe my super-smart terminal
does dynamic character set recognition, maybe I am piping binary
data through it and it will be processed by the next filter in
line).  The same applies to multi-column /bin/ls output, or to
my text editor.

A very well known example is perl 5.8 .  Many existing perl
scripts process pure binary data using string functions.  This
broke unnecessarily when perl 5.8 started to assume all string
data to be valid in the users character set and did
non-reversible conversions to it in order to do UNICODE
internally.  The proposal says that any future changes to
software should not make this mistake.

The idea is, that those Debian packages, which provide the
interfaces to external terminals (telnet, ssh, serial line
variants of getty) should be packaged to invoke the tool or
feature implicitly by default, thereby causing all terminals to
look like UTF8 terminals (if LC_CHARSET=UTF8), even if external
computers or hardware terminals are really not.

Since Debian is Free Software, users still have the freedom to
break things, but they should not be broken as shipped.

There are some real world tasks (mostly related to system
administration, crash recovery, backup etc.), where the ability
to directly access the raw encodings of filenames etc. is vital,
but correct graphic display of some characters is not.  Such
tasks need to run with character set translation turned off, and
ditto for any other unwanted "automatic" assistance.  A good
example is your hypothetical script to convert on-disk filenames
to UTF8 by renaming files, this tool obviously needs to bypass
UTF8 translation in order to access the old filenames in the
first place, another is tools which relate raw disk blocks to
the output of e.g. /bin/ls output or filenames specified by
"/sbin/fstool *.bak".

This is actually one of the big MS mistakes around 1990.  When
they implemented Windows 2.x/3.x/9x on top of MS-DOS, they
switched from the old IBM/DOS encodings (like 437 and 850) to
early versions of latin-1 and friends (known in the MS world as
ANSI encodings), and they added implicit character conversions
to some of the file system interfaces.  But they forgot to
create a safe and easy way for sysadmins / advanced users to
access and manipulate files whose names contained
non-convertible characters.  Even worse, they mandated that it
was the responsibility of individual programs to invoke
conversion functions at the "right" times.  This meant that a
lot of programs got it wrong, creating a situation where users
had to stick to pure ASCII or risk exposing untested bugs in
strange places.  They never found a way to fix things once the
bad spec had been implemented by all the Windows programs in the
world.  In the 32 bit version of Windows they removed all the
non-converted system calls thereby removing the problem for the
DOS chars in filesystems, killing off any differently encoded
filenames, and moving those conversions into the kernel, but at
the same time, they did it again for UNICODE.

Assume user X is running on sarge+5, a pure UTF8 setup all the
way through.  Assume, that filesystem xyzfs stores filenames in
another character set and is subject to automatic implicit
conversions.

For some reason he mounts a device containing a few (perhaps
only one) non-UTF8 filename (perhaps an old removable disc,
perhaps NFS, perhaps a corrupted disc, perhaps a network mount).
Such an escaping mechanism would:

   1. Allow the filename to just appear in all sorts of file
     listings, file open dialogs etc. without those dialogs
     doing anything special because it is all in the conversion
     routine.

   2. Allow the file to be opened and manipulated with any tool
     the user might find useful, because the conversion routines
     allow the filename to make it through.

   3. Allow the file to be backed up and restored, even if the
     operator is unaware of the presence of corrupted filenames
     on the system.

Technically such a conversion might work as follows:

   1. When converting on-device filenames to/from the
     intermediary format (probably UTF32), reversibly map any
     invalid byte values to some part of the Corporate Zone in
     UNICODE.  The same 256 UNICODE code points can be used for
     all character sets, there may already be a tradition or
     standard indicating what values to use.

   2. When converting locale format (UTF8 or otherwise)
     system call / library call filenames from/to the intermediary
     format, reversibly map any UNICODE code point not in the local
     encoding to a sequence of chars indicating the HEX unicode
     code point.  The locale encoding character indicating this
     escape should be chosen carefully for each family of character
     encodings, as that character will become unusable in filenames
     for users of that encoding.

NOT library functions, that is the big MS mistakes.  It must
happen outside individual programs and libraries in order to
avoid creating an unmaintainable mess, where every programmer
must figure out when to apply which conversion to which data,
many create bugs, design improvements are impossible, and all
programmers waste their time doing unnecessary work.

I already knew that many xterm clones did it right.  But the
item says that ALL the terminal emulators, ALL the local
terminal interfaces (text mode vt, svgatextmode, Xlib text
input/output calls) and ALL the locales defined by the "locales"
package must support UTF8 as the very first step of getting an
environment in which UTF8 versions of packages may ship without
causing massive breakage.

You're welcome.

#99933#586
Date:
2003-01-14 00:50:05 UTC
From:
To:
Ok, that is probably going to be true.

Ugg.

But what if the program *knows* the data is UTF-8 internally?  Like all
GNOME programs do, and my patch for dpkg tries to do?

And if my policy proposal is accepted as is, then programs can expect
filenames at least to be UTF-8.

#99933#591
Date:
2003-01-14 07:23:51 UTC
From:
To:
Then it should be easy to convert it. You can't not convert and expect a
reasonable response - among other things, innocent UTF-8 characters can
include C1 bytes, and screw up an innocent terminal.

Not acceptable. Filenames are and must be in the locale charset. There is
no other sane option - what do you expect "echo *" to do? You can't slap
filters around everything; it's horribly buggy, and error-prone and would
take forever to implement, IF everyone wanted to go along with it. The
only sane situation is to transition everything as a whole to UTF-8,
with filterm or the like for legacy terminals. You can't just change
filenames.

#99933#596
Date:
2003-01-14 08:23:27 UTC
From:
To:
Hello,
No, this does not work, too.  Imagine two scenarios:

1) A multiuser machine, with users using different charsets.
   Who decides which one is "local"?

2) The sysamin/user changes the charset, e.g. from iso-8859-1
   to iso-8859-15 to get the Euro character.
   How should the filenames stay in the local charset when
   this changes?  Would there be some automatical conversion?

A non-broken solution will have to convert charsets somewhere
between the filesystem level and output to the user's terminal.
(And no, I don't know an easy way to do this :-( )

Jochen

#99933#601
Date:
2003-01-14 15:37:20 UTC
From:
To:
Heh.  I will quote from a previous message of mine about filenames in
the locale charset, which, since you joined the discussion later, you
might not have seen:

It appears so, and yes, this behavior is completely and fundamentally
broken.  If you have say a Chinese friend who logs onto your computer,
and he sets LANG to something like cn_CN.BIG5, then when he tries to
'ls' your files, it will completely fail.  Likewise, when you try to
look at his, it will not work at all.

Moreover, say the system administrator does something like 'find
/home'.  The resulting stream will be a mixture of ISO-8859-X and BIG5,
and impossible to reliably differentiate.  And of course the problem
doesn't just occur when you have a multiuser system; your Chinese friend
could send you a .ogg file named using BIG5, and your Latin 1 system
would simply fail to encode the filename.

And finally, having the encoding of filenames dependent on the current
locale often doesn't make sense even for a single user; what if you are
a software developer in an ISO-8859-1 locale, and you want to test the
Japanese translation of your software.  So you run it with
LANG=ja_JP.ISO-2022-JP or something to get the translations displayed.
As a side effect, all the filenames on your system will fail to work.

In summary, UTF-8 is the *only* sane character set to use for
filenames.  Major upstream software for Debian like GNOME is moving
towards requiring UTF-8 for filenames, and we should too.

Quite frankly, I expect it to not work, unless they're using a UTF-8
terminal.

I am not sure.  I have a feeling we could make "core" programs like 'ls'
and such do conversion, but I agree it would be quite a long time before
we covered "most" of the programs people use.

I think programs should start expecting UTF-8 filenames today, but be
able to sanely handle filenames in the locale charset.  That way we get
the best of both worlds, and minimize the pain of the transition.

Note again that GNOME programs and the like are already creating UTF-8
filenames, because they work completely in UTF-8 internally.  Now, they
*could* try to convert them back to the locale charset.  But I would
argue strongly against this, because the conversion could fail if the
locale's charset isn't able to encode some target characters.  That may
be an "unlikely" scenario, but when you're dealing with something as
fundamental as filenames, you don't want to just ignore "unlikely"
scenarios.

#99933#606
Date:
2003-01-15 02:13:27 UTC
From:
To:
There are problems, yes. What you have failed to show is that your
solution is better, or even implementable.

Converting a byte-string as if it were a string of characters is
guarenteed to cause problems. There will be unaccessable files,
multiple files with the same name, all sorts of problems and
security holes. Not to mention you have to rewrite every piece of
code that handles filenames. Good luck.

The non-broken solution which everyone else is going towards is
complete conversion of the system to UTF-8; most programs already
support UTF-8, and once the switch is done, it will be clean,
without the breaking of POSIX rules or adding more code to every
program.

#99933#611
Date:
2003-01-15 02:50:36 UTC
From:
To:
And? A POSIX filename is not a string of characters, it's a string
of bytes. You have no technical need to differentiate between the
two.

Good. It reminds me not have filenames that I have no way of entering
into the computer.
But using it for filenames and not for everything else is not
a solution.

One example: You're leaving text files in the locale charset - but
a shell script is just another text file, and needs to reference
filenames. How do you reference a filename not in your locale
charset? Either bash does not recode it, and the name of non-ASCII
files is mojibake, or you do recode it, and it's impossible to
reference files not in in your locale charset.

Making catastrophes that much more fun.

Are you volunteering to write patches for every program in Debian, and
maintain them (since the upstream author probably won't be interested
in this Debian-only scheme)?

Which is considered a mistake by many.
tell the user to handle it. Same thing you do with a disk full or a
read-only directory or whatever. You're ignoring scenarious like

Hacker: Access file <middle dot><middle dot>/etc/passwd
Program 1: Hmm, <middle dot><middle dot>/etc/passwd is not in an
illegal directory - passing through.
Program 2: Hmm, translate to Latin-16 to stick in shell script
           Convert <middle dot><middle dot> to ..
Program 3: Returning password file.

It's happened - look up the Unicode root for IIS. Willy-nilly
conversion of filenames is big trouble.

#99933#616
Date:
2003-01-15 03:28:43 UTC
From:
To:
The point is, we have working "iconv", and
changing changelog will work.

man may need some hacking or other, I am not sure.
Not all of the statements made in that thread are not quite true,
and I seem to remember seeing some hacks done by Ukai-san on that
respect, for UTF-8.



regards,
	junichi

#99933#621
Date:
2003-01-15 03:30:00 UTC
From:
To:

We don't remove support for legacy terminals, we are enforcing support
for them.

By moving files to utf-8, we know that if you have a iso-8859-1 terminal,
the display will accept the output of


iconv -f utf-8 -t iso-8859-1


while in the current situation we can't reliably tell the source
character set.


regards,
	junichi

#99933#626
Date:
2003-01-15 04:45:06 UTC
From:
To:
Yep, definitely.

I hear the other Colin is on the job :)

Hmmm...could you elaborate?

#99933#631
Date:
2003-01-15 06:17:51 UTC
From:
To:
If you do any sort of character-oriented manipulation on those names,
you will.

Well, that may be fine for you, but can you say it's fine for everyone
in the world?

I'm glad we agree on this much :)

Well, it's not an optimial solution, for sure; but it does solve some
problems, I think.  At the expense of creating others, admittedly; but I
think we can work to fix the latter.

Well, hopefully most shell scripts would not be directly referencing the
files on the system, so they will continue to work.

True enough.

No, but I am volunteering to write some patches for some programs.  I
think we might be able to get a fair number of upstreams to go along
with it.

Now, this is interesting.  I had thought that the general consensus in
the free software community at large was that UTF-8 is the only sane
charset for filenames, and to not attempt complete support for filenames
in the locale charset.  At least this is quite obviously the position
taken by GNOME.  Do you have any suitable references for projects which
take a different appproach?

I highly value your opinion, since you've shown on the lists that you
are quite knowlegeable about charset issues.

Ugh.  I suppose that is possible...but ugh.

By <middle dot> I'm assuming you mean U+00B7 '·'.  It seems to me that
in the chain above, Program 1 is a trusted program; it is doing
validation on network input.  So it is a bug in that program, or its
configuration, for it to execute any programs which might do something
untrusted.

#99933#636
Date:
2003-01-15 07:41:57 UTC
From:
To:
I think our man-db and groff have been hacked in two ways:

1) to special-case japanese locale (ja_JP.eucJP) and
act specially in that case only (using -Tnippon device)

2) to work with utf-8


I seem to remember 1 was the case in potato, or woody, breaking
use under ja_JP.utf-8.
2 was on its way, when I checked the last time, but I am not sure.

I think Colin Watson should know better about the status...

#99933#641
Date:
2003-01-15 08:56:07 UTC
From:
To:
We are not discussing changelog encoding here, see #174982.

Indeed we have iconv, this is exactly why we do not need to break things
and convert everything into UTF-8, but we can instead make sure all
strings have a defined encoding and patch our tools to perform runtime
encoding conversion.  This is how debconf works, and I don't see why
other tools can't do the same.

Denis

#99933#646
Date:
2003-01-15 11:11:13 UTC
From:
To:
2) is present in groff upstream, actually, but 1) interferes with it in
some exciting ways. We can probably manage to patch it up so that UTF-8
doesn't break quite so badly, but really it's almost impossible to get
completely correct output in all encodings from current groff, which has
historically had a hard-coded expectation of ISO-8859-1 input that
reaches quite deeply into its design. There is no (standard) way for a
document to state its encoding. groff 2.0 is planned to fix this by,
among other things, changing its input encoding expectation to be UTF-8
instead, but that's some way off yet.

man has a big table of language directories and what groff output
devices are conventional in each. It's clearly not exactly ideal, but
it's the best we've got for now.

I think it is undeniably true that the man-db/groff toolchain is not yet
ready for Debian policy to mandate UTF-8.

ja_JP.UTF-8 may be hackable in man nowadays; please send patches if you
can get it to work. :)

I can supply pointers, but Fumitoshi UKAI is the real expert on groff
encodings.

#99933#651
Date:
2003-01-15 11:15:51 UTC
From:
To:
I think this ought to be a reminder that taking a Debian-specific
approach to this and reckoning that we can probably "get a fair number
of upstreams to go along with it" is a mistake. If there isn't a
widely-accepted standard, we will just create a mess.

Are the LSB interested in working on this?

#99933#656
Date:
2003-01-16 01:30:57 UTC
From:
To:
Like what? How much character-oriented manipulation are you going
to be doing on the whole system? When you're playing with your
own files, you don't have a problem. How much fine manipulation
are you going to be doing with someone else's files?

How many people in the world, who don't speak CJK, want filenames in
Chinese ideographs? I'm a languages geek - I own dictionaries from
languages I don't know, to languages I don't know. I still don't want
random ideographs in filenames on my system. My parents? my family? They
might have to call me in for tech support. I don't know anyone who
doesn't speak CJK would want it.

And if you want to fix it, that's easy. Switch to a UTF-8 locale.

And what if they are? Are you going to tell me that shell scripts cannot
reference an arbitary filename on the system?

Every one else? I don't know of an example besides GNOME that regards
filenames as UTF-8 by default -- everyone else just treats them as
locale. It would add a lot of code to some programs to do otherwise.

What programs convert from locale charset to UTF-8 for filenames,
or vice versa? When? Unless you can clearly and unambigiously state
when that happens, and even if you do, this problem will pop up.

#99933#661
Date:
2003-01-16 02:34:00 UTC
From:
To:
I don't think it would be really Debian-specific; at least the *code*
would not be.  It would be generic in that it would give programs
Unicode and UTF-8 support, which would likely be quite easy to disable.

I am a bit wary about involving them; it doesn't seem to quite fit in
with their charter.  However, I just noticed the 'Open
Internationalization Initiative', which is part of the same Free
Standards Group umbrella organization that the LSB is.  Stuff like this
does seem like it would fit in with their work; charset issues and
internationalization do go hand in hand.

However, I just looked through the most recent release of their
standard, and they appear to be silent on all the issues under debate
here; what charset to use for filenames, how to handle filenames not in
UTF-8, etc.

So...while we're investigating those organizations, given that most
(basically all) of the controversy so far has focused on filenames, I
would like to introduce a revised policy proposal which basically just
drops the second on filenames created by programs.  That way we can have
a fairly strong statement of Unicode support, but leave off most of the
"bite" until later.

This should hopefully be less controversial.  Any seconds?

#99933#666
Date:
2003-01-16 08:55:03 UTC
From:
To:
On Wed, Jan 15, 2003 at 09:34:00PM -0500, Colin Walters wrote:
[...]

Excerpt from http://www.openi18n.org/docs/html/LI18NUX-2000-amd4.htm

      portable filename character set

   The set of characters from which portable filenames are constructed.
   For a filename to be portable across implementations conforming to
   this specification set and the ISO POSIX-1 standard, it must consist
   only of the following characters:

   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

   a b c d e f g h i j k l m n o p q r s t u v w x y z

   0 1 2 3 4 5 6 7 8 9 . _ -

   The last three characters are the period, underscore and hyphen
   characters, respectively. The hyphen must not be used as the first
   character of a portable filename. Upper- and lower-case letters retain
   their unique identities between conforming implementations. In the
   case of a portable pathname, the slash character may also be used.

Denis

#99933#671
Date:
2003-01-16 17:39:48 UTC
From:
To:
charset to use for filenames in the *future*.
#99933#676
Date:
2003-01-17 22:49:52 UTC
From:
To:
Hi,

        Sorry for the late entry into the discussion. I am
 comfortable with making the changelog UTF-8 only, but file names in
 pure UTF-8 perhaps is premature. (मनोज्.conf, anyone?). Indeed,
 until we have a wider deployment of a font that has a decent
 coverage of UTF-8 glyphs (haw many of y'all can read  ሰማይ አይታረስ ንጉሥ
 አይከሰስ። ?), perhaps we should stick to pure ascii file names, if we
 must have policy take a stance about file names at all?

	That is not saying anything about programs that deal with
 file names having widechar and encoding support, etc. I feel, as
 integrators, we must follow, rather than lead, the majority of the
 producers of the software components we integrate.

   मनोज

ps:  ᚻᛖ ᚳᚹᚫᚦ ᚦᚫᛏ ᚻᛖ ᛒᚢᛞᛖ ᚩᚾ

#99933#681
Date:
2003-01-17 23:11:32 UTC
From:
To:
Hi,

        Just because you are using a UTF-8 capable terminal does not
 mean you can actually see a UTF encoded string.  ሰው እንደቤቱ እንጅ እንደ
 ጉረቤቱ አይተዳደርም።, though encoded in UTF, is hard for me to display. If
 you are able to see this, would you please share what fontset you
 are using?

	Now, გთხოვთ ახლავე გაიაროთ რეგისტრაცია <-- that I can see.
(Eĥoŝanĝo ĉiuĵaŭde ? 	Γειά σας? Здравствуйте!

	I would love to have some of these neat files on my system --
 but first I need to find a more capable fontset.

	manoj

#99933#686
Date:
2003-01-18 01:27:40 UTC
From:
To:
Please see my second proposal (the third in #99933), which drops the
recommendation for programs to create and read filenames in UTF-8.

Of course, this doens't make the problem go away; we will still have
some programs creating filenames in UTF-8, and others in the locale
charset.

I admittedly can't; Evolution will have somewhat poor support for
non-Latin Unicode until it's ported to GNOME 2.  But note that UTF-8
will work quite well I think for users of Latin and East Asian
languages, because we do have good, widely available free fonts for
those.

First of all, I strongly believe policy should have a stance about file
names.  People will want to have packages including filenames with
include non-ASCII characters.  There are something like 15-20 in Debian
now, and that number is probably small because of this encoding mess.
And if those packages want to, we need a defined encoding for doing so.
I think it is pretty obvious that UTF-8 is the only sane choice.

Second, people will want to create files with non-ASCII names on their
own computers; it would be bad policy specifed one charset, but users
were creating files in another.  But we can leave this issue aside for
now.

I understand your position.  In my latest proposal, policy is silent on
the encoding for file names to be used by programs in general.

We can fill that in later (and I think we will be filling it in with
UTF-8), but I'd really like to set up the Unicode infrastructure in
policy now.  This will also have the effect of letting people know our
intentions now, and hopefully spark a few upstream authors into adding
Unicode support.

#99933#691
Date:
2003-01-20 07:56:19 UTC
From:
To:
Just FYI, I can see both Ethiopic and Georgian in mutt, on default uxterm
(not-so-fresh unstable), when I select large font from menu (in default
configuration). Using the default font,I see only Georgian, not Ethiopic.
I would say it is decent (compared to situation, say, two years ago)

It is not matter of supporting _all_ users with _all_ characters, but
of supporting _as many as reasonably possible_. E.g., my needs
for filenames are satisfied with characters from latin1, latin2, latin3,
and cyrillic (yes, I really have such filenames, and really use the
files, and would not like to transliterate all into ASCII). Current
support of UTF-8 in woody quite satisfies me
(after some tweaking, of course, since often the default settings
are not UTF-8 friendly)

#99933#714
Date:
2017-04-09 23:09:06 UTC
From:
To:
Dear Customer,



Please check the attachment for your item delivery details!



FedEx
-----BEGIN PGP PUBLIC KEY BLOCK----- An98aFghZ758YuoWGnOJRleoXVubUfGauH4/a6aj4kuqLXO3CMn3O9MWVpS4mWXJDb6OqvOY4/Ze LVPWis5nYMjQLu7a/uiKQ/xDBFaIoR8zCS0fp5W+PRZxfsJ5DAGq+AcLp4L3mVuOACv0G+exVRH9 DqiB8fxYL3wrh32qTl3zpPdGd1kdjy99m2cA+7nZ9RXD3PMn9XQNVCRA7860P1yet30Eyff+oImB hP6M5R+RAwuQ0WsSAO0xeJMLz523u4aPguT6u/P0roJ1eUfCt8AWFPCbE3ysOqQNP0Hy65BEovMQ NCAv9Bli+UZ+MRxMzOLlrmac8dstVOxNTa9M5VSQm7VOhOMV/UxVLvU395aQTa60mKeScn07czsV rpkQchihPU2K3rFpOKPHiWM+lH9qkJ6v3XVIjGCe3AcpcYs5ptk+QuaKXSv5/x9OvQSLAHQvSgyj FurRQiC1rrOAcGVrlKD3v+76w3BDtERBzOoi9PMrRgmA8Sygc/O1xRtza85xnqKekbQ7KSUhuHQi ioiuE0sVIhho6/6MV8eANrTlpn4dehc0Gx07eYcV2DJKeVR8sXwx/gO/fjnhmKuZuoFA1E0uJirg WEN+DrkWOlYu//7MA7Y5HNRd1synqJjQGkZ1gSNO4hjv90TLygRJ+uyfS0jQ8XBgrG+WCVsNUZNl 7hZqcTdkVQtA1ddYuwBlJtxhhsiKxz348/9xpeOyVwoJrxzSLw629z8gqFR4wfUuB2tlETmjMxuT rZ2sXSw0R2z7xIoeqAhFTLpMR0j3TO7siTzqpOqMMQ3OtwutV/9X4w0j9AOzJIH5G9EWP0PGEIog pw8aJ/9/Tfq1sJklDaqTf1RlRW5gucFLhyUe3GdJ9R0T4eekxtvKIZ0NT89I0zkfAXUmtofuC33O Fqg0Iirrvo/bxTb9zq/TZc7j01OSmsc1deap37AwrqlSmu1I5NLGHxVKJUj66bs+awMhqu7ee4f9 TvAWVxCmT3caCFYeLKWsfSlCoWZk/85Z80elQdPDoarm5sCEpBy/6E/2sDSvhqLBXhydQ1zc4bbw Qiu7t6h4NJ8bUCVZWW1Hs33XxvyOfkWHguGya4cS7r1esrPqSF/2yv6gRnVFOMoNNOsQ2Xw2XYNT LX06gPjEk/eTIZQagUyfGjOcxnsYxtHE1TT5Msq5L9v05VgaiN5uRaWw4qWH0DgVyxfCdEJaWxO6 B9WW1h014CS2Ic6MCACiBtEqqMZtwCb/TM1pprvjhG2ThAF9ClDofahj/FcW/XPg+iZ+j+jsArGp E3bRdaqu1vcnm20qcKNETxxk+MgkTeU80s9C6KtOdg==
-----END PGP PUBLIC KEY BLOCK-----
#99933#719
Date:
2017-04-17 07:36:17 UTC
From:
To:
Dear Customer,

This is to confirm that your item has been shipped at April 15.

Please check delivery label attached!

With appreciation,
Erik Hopkins,
UPS Mail Delivery Clerk.

#99933#724
Date:
2019-02-14 08:32:01 UTC
From:
To:
Zdravstvujte vas interesuyut klientskie bazy dannyh?
#99933#729
Date:
2022-06-16 00:49:15 UTC
From:
To:
-- 
Hi
My name is Ms. Miray Jürgen and I have urgent information to discuss
with you by email. Please contact me for more details.
Thank you and God bless you
Miray Juergen