- Package:
- debian-policy
- Source:
- debian-policy
- Submitter:
- Radovan Garabik
- Date:
- 2022-06-16 01:00:03 UTC
- Severity:
- wishlist
Following proposed addition to policy clarifies encoding issues and
prepares for eventual later migration to utf-8 (see Bug#99324).
Note the use of word "should" - these are not strict requirements.
--- policy.sgml-old Fri Jun 1 11:40:16 2001
+++ policy.sgml Thu Jun 7 13:31:09 2001
@@ -1653,6 +1653,15 @@
</sect>
+
+ <sect id="controlencoding"><heading>Encoding of control files</heading>
+ <p>
+ If, for whatever reason (such as upstream author's or maintainer's
+ names, foreign language package description and similar), you need to
+ use characters outside 7 bit ASCII range in control files, these
+ characters should be encoded using UTF-8 encoding.
+ </p>
+ </sect>
</chapt>
<chapt id="versions"><heading>Version numbering</heading>
@@ -2276,8 +2285,16 @@
all.
</p>
</sect1>
+
+ <sect1><heading>Character set of <tt>debian/changelog</tt></heading>
+
+ <p>
+ Character set of <tt>debian/changelog</tt> should be either pure ASCII, or UTF-8.
+ </p>
+ </sect1>
</sect>
+
<sect id="srcsubstvars"><heading><tt>debian/substvars</tt>
and variable substitutions </heading>
@@ -7370,6 +7387,26 @@
from <tt>/usr/share/doc/<var>package</var>/</tt>.
</p>
+ <p>
+ Documentation of debian packages in text format, if written in
+ language requiring characters outside of 7-bit ASCII range,
+ should use either well-established encoding for the given
+ language <footnote>such as ISO-8859-2 for some central- and easter
+ europian languages, KOI8-R for Russian, etc.</footnote>, or UTF-8
+ encoding.
+ Maintainers are being encouraged to use UTF-8, having in mind
+ the general debian migration toward unified character encoding.
+ </p>
+
+ <p>
+ Original upstream documentation, if in encoding other than UTF-8
+ or the well-established encoding for the particular language,
+ should be converted either to UTF-8 or to the well-established
+ encoding. Choice between UTF-8 and other encoding is left to the
+ maintainer discretion, however, one package should have all the
+ documentation in one consistent encoding for one language.
+ </p>
+
</sect>
<sect id="usrdoc">
@@ -7440,6 +7477,18 @@
Other formats such as PostScript may be provided at the
package maintainer's discretion.
</p>
+
+ <p>
+ HTML documents, if in encoding other than <tt>us-ascii</tt>, should
+ have in their header an appropriate META tag describing
+ the used encoding.
+
+ Example:
+ <example>
+ <META HTTP-Equiv="Content-Type" CONTENT="text/html; charset=UTF-8">
+ </example>
+ </p>
+
</sect>
<sect id="copyrightfile">
@@ -7555,6 +7604,24 @@
changelog, then the Debian changelog should still be called
<tt>changelog.Debian.gz</tt>.</p>
</sect>
+
+ <sect id="charset">
+ <heading>Deafult character set</heading>
+
+ <p>
+ Names of maintainers, upstream authors and other data in
+ packages' descriptions and related debian data files (such as
+ <tt>debian/changelog</tt>, <tt>debian/copyright</tt>,
+ <tt>debian/control</tt>), as well as in English language
+ documentation, should be either transliterated or
+ transcribed to ASCII, or used in UTF-8 encoding at the
+ discretion of the maintainer. However, for names
+ in scripts based on non-latin alphabets, ASCII (or suitable
+ latin-script) version should be provided along with original
+ name.
+ </p>
+ </sect>
+
</chapt>
<appendix id="pkg-scope">
On Fri, Jun 08, 2001 at 02:26:49PM -0500, Steve Greenland wrote: ... ... Package can have documentations in several different languages. What I meant is that each language is in one encoding, so that there are not two files with different encodings for one language. Of course, other languages can have other encodings (otherwise we would be forcing people to use UTF-8, which we do not want to). Sorry for my sloppy english, if the intention was not clear from the description. So I would probably say it as: "...however, the documentation for any single package should use only one encoding for one language." Does it look clear enough?
[*snip*]
Aaah, I did read the original wrong (because it did seem to imply that
all the docs in a given package needed to be the same language, which is
is why I left that part of it out). I might go with:
"...however, in a single package, all the documents written in a
particular language should share the same encoding."
I won't claim that is clearly superior to your phrasing though, so pick
whichever works better for you.
Steve
Hello Radovan, Thursday, June 07, 2001, 5:58:31 PM, you wrote: Seconded.
Mea culpa... by mistake I sent my previous mail to 99324@bugs.debian.org, which was Cesar Eduardo Barros's proposal about full unicode support, while my proposal is #99933 does JIS X0208 allow chinese characters to be used together with japanese? There are actually two problems: the first one, which was emphasised and used as the main argument against unicode is in fact the less important: unicode is not complete for CJK. That is (relatively :-)) easy to fix, just write a proposal and get it accepted.... The second problem reflects a fundamental design decision: Unicode unifies Chinese (traditional and simplified), Korean and Japanese characters, and because of differencies in glyphs, it means using appropriate font is required to view the text properly. The situation is IMHO quite similar to german for using Fraktur (Sütterlin) script - it is a latin script, and unicode consortium (IMHO rightfully) decided that it is a typesetting difference - not an encoding one (you can - and sometimes you do - typeset english text using Fraktur fonts, after all). If Germans were using it still today, you would have exactly the same problems as with CJK scripts now (of course, the complexity of CJK is much greater than that of a latin scripts) Or, similar example, I was reading a linguistic book in Russian, and there were examples from Old Church Slavonic. To distinguish them from normal text, they were typeset in a different font, using actual ancient glyphs - again, according to unicode this is a typesetting change, not an encoding one (it is cyrillic all the way) I am really not sure if unicode went the right way, I feel the ability to display Chinese name in a Japanese document using Chinese glyphs (or vice versa) is something that should not be get rid of... perhaps it should consider them to be different scripts with different encodings, but when would it stop? Making italics, boldface etc. to be different characters? You cannot display all of them at the console anyway. This is for a future. As for X11, fonts are being rapidly developped. It was there, at the end. Maybe.. but just let's do not overcomplicate things :-) You do not know what is a particular font... one of (traditional|simplified)C,J,K, or the full font name? not really, since ascii cannot be used to display the particular language (take slovak or russian). More appropriate example from the history is the war between EBDIC, ASCII and other proprietary encodings... thanks god one and only one encoding won. The situation repeats itself, we have 2 competing encodings in Slovak, 3 in Russian.. and if we want one of them to win, why not make the winner unicode, which has the indisputable[1] advantage of being unified for the whole world? [1] of course, problems with CJK remains and has to be addressed and that is something terribly needed today, with this world wired together.
Thanks. I don't think so. However, JIS X0208 implies a japanese character set and the japanese language, while unicode indicates no such thing. I disagree. The Han Unification issue is more like the difference between the latin and the italic character sets. Yes, many characters are similar, however there are also some characters which are unique to each representaiton. Also, Unicode does include Fraktur characters. And, this could be rectified -- with Unicode 3.1, they have the code space to represent each major representation of the character set. Unicode already does that. Take a look at the mathematical alphanumeric symbols [1D400-1D744]. For example: 1D400 MATHEMATICAL BOLD CAPITAL A 1D41A MATHEMATICAL BOLD SMALL A 1D434 MATHEMATICAL ITALIC CAPITAL A 1D44E MATHEMATICAL ITALIC SMALL A 1D468 MATHEMATICAL BOLD ITALIC CAPITAL A 1D482 MATHEMATICAL BOLD ITALIC SMALL A 1D49C MATHEMATICAL SCRIPT CAPITAL A 1D4B6 MATHEMATICAL SCRIPT SMALL A 1D4D0 MATHEMATICAL BOLD SCRIPT CAPITAL A 1D4EA MATHEMATICAL BOLD SCRIPT SMALL A 1D504 MATHEMATICAL FRAKTUR CAPITAL A 1D51E MATHEMATICAL FRAKTUR SMALL A 1D538 MATHEMATICAL DOUBLE-STRUCK CAPITAL A etc. etc. console vs. x is a not a character set issue. Note that console has other limitations (fixed width, uni directional). For currently relevant policy it matters what actually works. I'm not sure I understand this question (I don't know enough about oriental languages and fonts to give a full answer in any event). latin-1 doesn't solve this problem so that's a non-issue. ebdic vs. ascii wasn't about supported languages. I agree. However, Unicode is not a mature standard, so we need to be careful in places where it would cause problems. Thanks,
No, because latin (upright) and italics are used interchangebly, whereas fraktur carries implicit connotation of language used - just like different glyphs for unified CJK charset. but in mathematical symbols - that is a completely different beast if only they instead of talking how bad is unicode started working on improving it (duck, run :-)) the reason and purpose of these characters is quite different from "base" unicode characters of course. That's why my proposal is very mildly worded and gives a lot of freedom to maintainers to decide what charset they want. well, would you indicate just "this README needs japanese unicode font" and the user has to figure out by himself what is that or "this README needs -misc-fixed-*-*-*-ja-*-*-*-*-*-*-iso10646-1" and the user is fubar when he does not have that font. true, but the mess in encodings was quite comparable to what is there today outside of Latin-1 world. And the peace ASCII brought could be compared to peace that (hopefully :-)) unicode brings one day. Of course. Nobody is talking about compulsory switching to unicode _right now_.
I'm sorry. Not italics, but Old Italic. U10300-U1032F. This includes letters like U10308 OLD ITALIC LETTER THE (a circle with an X in it) as well as letters like U10301 OLD ITALIC LETTER BE (essentially the same as a capital roman B). Here, we could assume a common history, and define a map which relates many of the characters.. much as has been done with Han Unification. Please explain why it matters to the reader whether the letter A is classifed by the unicode consortium as mathematical [or not]? I don't have the technical skill nor the political connections to properly contribute to the unicode consortium. I can, however, point out major problem areas, and I like to think of that as valuable [at least to Debian -- I like to think that the members of the Unicode Consortium are already aware of these problems]. The point is that unicode already does support the things you were suggesting as more unreasonable than indicating oriental language. Agreed. I think "needs japanese unicode font" might suffice. Perhaps a package name which includes that font would also be good. An X font spec would, of course, be necessary if you wanted a program to "just work". It depends on context. I'll accept your analogy. (In the name of peace :). Thanks,
Raul Miller <moth@debian.org> writes: IMHO, a better mechanism are Unicode 3.1 language tags, see: http://www.unicode.org/unicode/reports/tr27/#tag
Which says: The characters in this block provide a mechanism for language tagging in Unicode plain text. However, the use of these characters is strongly discouraged. The characters in this block are reserved for use with special protocols. They are not to be used in the absence of such protocols, or with any protocols that provide alternate means for language tagging, such as HTML or XML. Which implies that this mechanism isn't useful for representing different languages in the same document. That, instead, it's logically equivalent to a MIME declaration of the document's language. Maybe, in the future, the Unicode Consortium wants to change the standard so that this mechanism can be used to represent multiple languages within the same document. But that's not the current standard.
Because in a mathematical equation, a "script" A, for instance, is semantically distinct from a latin capital A. Fundamental, basic information is lost without a distinction between these characters. In text, italics or scripted letters for emphasis or whatever are stylistic markup, not semantic distinctions. For instance, people who chat with me on IRC can deduce my meaning whether or not I elect to use bold and/or inverse text, and in fact that's why people get yelled at when they do it.
You're telling me why the context matters. You're not telling me why the unicode naming of the code points matters. If the reader sees "Branden", why should it matter whether any underlying code points were designated by the consortium as mathematical? If the reader sees A-B, why should it matter whether any underlying code points were not designated as mathematical by the consortium?
Why are you CC'ing me? Are you interested in having a discussion of these issues, or just in provoking me by filling my inbox?
indicating otherwise. Probably not the best reason, but you did ask. FYI,
-policy when discussing pending proposals, which are assigned bug numbers, remember? That way seconders and people who later want to consult the "legislative record" regarding the adoption of a policy proposal can easily look the information up. Either the BTS should be enhanced, or you should learn to remember that I don't like private CC's on mails to lists I read, like this one. Except on this list, you don't have to remember because I provide a handy mnemonic: Mail-Copies-To: nobody X-No-CC: I subscribe to this list; do not CC me on replies. This is hardly the first time I've brought this up. Ignorance is no excuse.
Ok, we were talking about two different things because mathematical letter is a different than "normal" letter. They might look alike, but (depending on typography), often do not. well, this was not aimed at you :-) It does not. Bold mathematic symbols are quite different from bold text characters. MATHEMATICAL BOLD CAPITAL A has a very different meaning than ITALIC MATHEMATICAL CAPITAL A (e.g. one denotes a variable, other vector or matrix) You can make a text bold, and meaning will remain. If you make a mathematical expression all bold, it will have a completely different meaning. And, there is no such letter as MATHEMATICAL BOLD CYRILLIC CAPITAL LETTER A, since cyrillic letters are normally not used in mathematic context. Yet, in your favourite typesetting software, you are able to write boldface cyrillic (since it is again typesetting issue, not encoding one) Well, personally, I could survive without these mathematical chars in unicode, but neither do I have any objections for using them. because if code points are mathematical, I parse it as B \times r \times a \times n \times d \times e \times n
Of course, this doesn't prevent other uses. But you're right, that only a limited selection (e.g. not Han) of characters enjoy bold code points. So? Let's imagine you're composing an html document. What's to prevent you from wrapping a mathematical alphanumeric character with <b></b>? But if the context is not mathematical, how can you tell that mathematical code points are used? If I say xy-2yz=0, and I don't use mathematical characters, why would you not interpet that as indicating multiplication?
... that is a different kind of "boldness", used to emphasise bold mathematical symbols are different symbols from those not bold. mathematical symbols enclosed in <b></b> are just emphasised normal mathematical symbols, not bold mathematical symbols I cannot, therefore there are special mathematical characters to distinguish it. because I would interpret it as a comparision in some kind of programming language, the one that allows variables to begin with digit.
Not bold, agreed, but Sha is used for the Shafarevic-Tate group in number theory, and I think it's also sometimes used in applied mathematics for certain interesting functions used in Fourier analysis. I can't think of any other examples offhand, though. Julian
Now let's imagine that a person is actually using this document. How can this person tell which kind of boldness is in use? Let's imagine that a person is actually reading this document. What difference does it make to this person that the Unicode Consortium has named the code point using the word MATHEMATICAL? How would the person even find out about this? [I guess they could do view source on the html document, then cut and paste an individual character into some search dialog box which might then be used to locate the character and (by association) the name of the character. But that seems a bit useless.] How would you know that the Unicode Consortium hadn't used the word MATHEMATICAL to describe the code points of those characters? If you didn't know about the code points which have MATHEMATICAL in the name (for example, last week), would you have had a different interpretation of this expression? If there was surrounding text describing the character of the variables x, y, and z, would you insist on this contrived intepretation of yours? If we assume that the user is using debian software which merely displays the characters (and doesn't actually inform the user of the unicode names for the underlying code points), would there be any particular reason for the user to interpret some characters as algebraic variables and others as word forming characters in some unknown programming language (for some reason other than knowledge of the unicode code point numbers)? Thanks,
mathematical symbols could use different typesetting convention (see latex) decent html browser would render mathematical symbols differently. But, of course, it need not, depeding on font used. I do not insist on it... but as you can see, without a context anything can be misinterpreted, and special symbols are just a tad helpful. this is just nitpicking... unicode is full of characters having the same glyphs how do you distinguish between LATIN CAPITAL LETTER A, CYRILLIC CAPITAL LETTER A, GREEK CAPITAL LETTER ALPHA ? they look the same in upright font, but if you select cursive font to view the document, they will look differently. The same with mathematical symbols... they might look the same with one font, but if you prefer slanted font, you suddenly see a difference... (or vice versa). <utopia> If you select a sentence in your favourite word processing software, and apply LOWERCASE function, you suddenly see those 3 indistinguishable letters turn into 3 different lowercase letters (ok, 2 in this case). And surprise, characters in an equation were NOT lowercased, since your software was clever enough to know it should not lowercase mathematical symbols automatically. Neither would it run them through spellchecker. </utopia>
But this should depend on mathematical context, not code point. Or are you suggesting that latex shouldn't render ascii characters using mathematical typesetting conventions? Again, this should depend on whether a mathematical context is in use (e.g. mathematical equation). Unless (to use the same example again) you wish to prohibit the use of ascii characters in mathematical equations. I'll agree that context is important. I'll agree that special symbols are important. I disagree with the idea that special symbols may only be used in certain contexts. That's like saying that HTML should only be used to describe the structure of a document and not its appearance -- fine language from a standards body, but with little to do with how the standard is actually used. Exactly. Thanks,
Which is exactly correct. It does not appear that Raul is willing to discuss this issue rationally, therefore I will content myself with opposing his position. Hopefully the rest of the technical committee has greater respect for thoughtfully-scoped, coherent and directed standards, and not phenomena like Visual Basic, which was famously derided as being a language designed by "focus group".
If this were true, we wouldn't have emerging standards such as
XHTML to rectify the problem.
This isn't a technical committee issue, nor is it about visual basic.
At least, not currently.
However, I am sorry for allowing this to devolve into a discussion of
tangential points.
* * * * *
It was just pointed out to me by a Unicode guy, that XML has an xml:lang
attribute which can be used on any xml tag.
If we structure our handling of multi-language documents based on this
aspect of XML (and use unicode tr27 to support this same functionality
in non-XML documents) we can address the "unicode doesn't have a way of
specifying the language" issue.
But that still leaves us with the "JIS has characters which aren't in
Unicode" issue. [If that's an actual issue.]
Here is the proposal with typos and mistakes fixed, with added paragraph about possible use of other encodings. I left out the requirenment to specify a font needed to view the documentation, since IMHO that is overcomplication and unnecessary.--- policy.sgml-old Fri Jun 1 11:40:16 2001 +++ policy.sgml Thu Jun 7 13:31:09 2001 @@ -1653,6 +1653,15 @@ </sect> + + <sect id="controlencoding"><heading>Encoding of control files</heading> + <p> + If, for whatever reason (such as upstream author's or maintainer's + names, foreign language package description and similar), you need to + use characters outside 7 bit ASCII range in control files, these + characters should be encoded using UTF-8 encoding. + </p> + </sect> </chapt> <chapt id="versions"><heading>Version numbering</heading> @@ -2276,8 +2285,16 @@ all. </p> </sect1> + + <sect1><heading>Character set of <tt>debian/changelog</tt></heading> + + <p> + Character set of <tt>debian/changelog</tt> should be either pure ASCII, or UTF-8. + </p> + </sect1> </sect> + <sect id="srcsubstvars"><heading><tt>debian/substvars</tt> and variable substitutions </heading> @@ -7370,6 +7387,26 @@ from <tt>/usr/share/doc/<var>package</var>/</tt>. </p> + <p> + Documentation of debian packages in text format, if written in + language requiring characters outside of 7-bit ASCII range, + should use either well-established encoding for the given + language <footnote>such as ISO-8859-2 for some central- and eastern + europian languages, KOI8-R for Russian, etc.</footnote>, or UTF-8 + encoding. + Maintainers are being encouraged to use UTF-8, having in mind + the general debian migration toward unified character encoding. + </p> + + <p> + Original upstream documentation, if in encoding other than UTF-8 + or the well-established encoding for the particular language, + should be converted either to UTF-8 or to the well-established + encoding. Choice between UTF-8 and other encoding is left to the + maintainer's discretion, however, in a single package, all the + documents written in a particular language should share the same encoding. + </p> + + <p> + Package may (at the discretion of the maintainer) include documentation + files in other encodings, if they are present also in canonical encoding, + and if the encodings used are clearly marked. + </p> + </sect> <sect id="usrdoc"> @@ -7440,6 +7477,18 @@ Other formats such as PostScript may be provided at the package maintainer's discretion. </p> + + <p> + HTML documents, if in encoding other than <tt>us-ascii</tt>, should + have in their header an appropriate META tag describing + the used encoding. + + Example: + <example> + <META HTTP-Equiv="Content-Type" CONTENT="text/html; charset=UTF-8"> + </example> + </p> + </sect> <sect id="copyrightfile"> @@ -7555,6 +7604,24 @@ changelog, then the Debian changelog should still be called <tt>changelog.Debian.gz</tt>.</p> </sect> + + <sect id="charset"> + <heading>Deafult character set</heading> + + <p> + Names of maintainers, upstream authors and other data in + packages' descriptions and related debian data files (such as + <tt>debian/changelog</tt>, <tt>debian/copyright</tt>, + <tt>debian/control</tt>), as well as in English language + documentation, should be either transliterated or + transcribed to ASCII, or used in UTF-8 encoding at the + discretion of the maintainer. However, for names + in scripts based on non-latin alphabets, ASCII (or suitable + latin-script) version should be provided along with original + name. + </p> + </sect> + </chapt> <appendix id="pkg-scope">
Raul Miller: I don't know where you got this impression, but it's wrong. Read the document. It introduces a TAG START character, Ascii-equivelent tag characters, and a TAG CANCEL character. <EN-US>You can label text like this.<DE-DE>Ja, du kanst.<TAG CANCEL> Because in theory, MATHEMATICAL ITALIC CAPITAL A won't be available on every keyboard, nor in every font. Any software that translates ordinary, non-mathematical italic characters to MATHEMATICAL ITALIC's would be non-conformant to the Unicode standard. They shouldn't obey case mappings, and HTML markup and the like probably won't and shouldn't work on them. There's no way most people will be able to enter them without setting up fairly unusual software. As a reader, you probably couldn't tell if my message was in KOI8-R and that I was using the Cyrllic lookalike characters whereever possible, but that doesn't make it more correct or more likely. Japenese can travel in China and use 'Japenese' ideographs to comunicate with the Chinese people who have no knowledge of Chinese. That's a indictive sign that the characters being used are fundamentally the same characters. Yes, there are characters that are written differently and unique characters - such is true about two languages that use the Latin script. I'm not arguing that all the unifications of individual characters were correct, but the fundamental concept of unification is correct. (It's interesting that it's almost always the Japenese that complain about the unificaition - the Koreans and Chinese, for the most part, seem to find the variations introduced by unification to be normal. One of the main forces behind unificiation was Chinese, with GB 13000) Actually, it can't be rectified. The code space has existed for almost half a decade - the only change is that it's being used now. But part of the fundamental nature of Unicode is the unification of CJK characters. You can not change the meaning of 50,000 characters in the Unicode standard and invalidate all Japenese/Chinese/Korean (pick two) data in Unicode, any more than you can introduce case up and case down control characters into ASCII and use the space of lower case characters for something else. What? It's not mature? The majority of the world's desktops use, or will soon use, Unicode, as it's fundamental to Mac OS X and Windows NT/2000/ME. It's been around for ten years now, and has reached the point where it's fundamentally stagnant. Sure, there will be a few more ideographs, a few more mathematical characters, a few more obscure/dead/minority scripts encoded but Unicode 3.1 is basically what Unicode 5.9 will be. The Unicode people are committed to not breaking backward compatibility, and with the wealth of support put by many of them into Unicode, they can't afford to change anything major. It may be wrong, but it's mature. > But that still leaves us with the "JIS has characters which aren't in All the characters from JIS X 0208 and JIS X 0212 are in Unicode (they were one of the original primary sources of characters for Unicode). JIS X 0208 is the character set used in ISO-2022-JP, and I believe SJIS and EUC-JP use the same set. JIS X 0213 should be completely included in Unicode, as the same Japanese body that does JIS X 0213 is the ISO 10646 liason. I know that a number of what Unicode would consider variants of preencoded characters were encoded in Unicode for compatibility with JIS X 0213. Radovan Garabik: When would this be necessary? The appropriate fixed font should get picked by locale (it's in xterm now; I don't know if the Debian unstable xterm has it, or if it will be in XFree 4.1 or 4.2). So the issue is only when a user is using an inappropriate choice of font (which we can't save a user from) or is reading a Chinese readme in a Japanese locale or vice versa. If this is unreadable, the knowledgable user would know to switch fonts. At worst, it's no worse than what we have now with having to change locales and fonts to read a Chinese readme in a Japenese locales.
Raul Miller: Except that you're not supposed to use this mechanism with HTML, and unlike XML, in HTML the language can only be identified in the mime header. However, if unicode can act as a super set for every character set we currently use then we can ignore this problem for the purpose of deciding when to migrate. Do you have any idea whether the problems identified at http://support.microsoft.com/support/kb/articles/Q170/5/59.ASP have been resolved? I've not been able to find anybody knowledgeable about this issue. I don't know what you mean. Prior to Unicode 3.1 the code space was 16 bits. With Unicode 3.1 the code space has been expanded to 21 bits. In principle, at least, with the additional code space unicode can have a 1-to-1 mapping with the characters represented in the shift jis standards. Once unicode can act as a super set for every character set we currently support, we can use it as such. Until then, we can't. Thanks,
Raul Miller <moth@debian.org> That's an HTML problem. Does Debian use enough mixed language HTML to actually make that a problem? If so, it's not a problem XHTML has. Are they a problem for us? Windows Code Page 932 may or may not correspond to anything that we care about. (At a glance, at least one of each pair that both correspond to the same Unicode character is not in the real JIS X 0218.) The problems have not been resolved; they are inherent in the fact Unicode was designed. Needless to say, not all the choices made for Unicode were the same as those made for CP932, and that manifests in the fact that characters do not always correspond one to one between the two standards. NO. Since Unicode 2.0, the code space has been 21 bits. The ONLY thing that Unicode 3.1 did, is put characters above U+FFFF. It did not change the fundamental structure of Unicode in the least. Unicode has a one to one mapping with the characters in JIS X 0208, the basis for all Unix Japanese encodings. That it fails in completely encoding some proprietory encodings is inevitable. If Unicode were a super set for every character set that anyone needs to support, it would be worthless and completely unusable. The creators also realized that a perfect proposal, ignoring backward compatibility, would go nowhere. Unicode is a carefully balanced compromise between the two problems. However, if we currently support any character set well, it is through a Unicode based glibc - I don't believe libc accepts the existance of any character set that can't be mapped to Unicode. So arguably, yes, Unicode is a super set for every character set we currently support well.
There is no such thing as a MIME header in HTML. Besides, HTML does include the lang attribute for most elements. I wonder what it's for if not for indicating the language.
severity 99933 normal retitle 99933 [AMENDMENT 06/07/2001] Encourage use of UTF-8 in documentation and clarify encoding issues thanks this proposal has 3 seconds (Arthur Korn, Roland Mas, Raul Miller). Since it has been already discussed to death, I propose one week discussion (which ends at 13 July 2001). I am aware of oncoming policy freeze, if this does not make into woody's policy, it should be considered for inclusion into the next release.
If it's indeed the case that this is a CP 932 problem and not a shift JIS problem, and if it's indeed the case that we don't support CP 932, then I'll agree that this isn't a problem. I stand corrected. I didn't say for any character set that anyone needs to support. I said for every character set we currently support. I hope you see the difference. [And, as an aside, I should have said "for each character set that we currently support" -- I understand that unicode doesn't need to support mixed character set usage before we migrate.] Assuming we're using glibc support (e.g. toupper()) for all those character sets, I'll agree that you have a good point. I stand corrected. Thanks,
I'm not intending to include any substantive changes to policy, only "bug-fix" type proposals. Julian
Hello 99933, I second this.
----- Original Message ----- currently With my Debian hat on, of course I see the difference. With my Unicode hat on, there is no difference. Every small group and company has their own character sets that they need supported, and Debian's just another group. Note that Unix locales tend to prefentially use standardized character sets (JIS X 0218, ISO-8859-*) which ISO 10646 had to superset completely. If you have a recent version of locales installed, look in /usr/share/i18n/charmaps, which has every character set we support for use in iconv or locales. For actual locale charsets, look in /etc/locale.gen. If you remove ISO-8859-* (which are all Unicode compatible) and remove UTF-8, you're left with 11 charsets: cp1251, tis-620, koi8-r, koi8-u, euc-tw, euc-jp, gb2312, gb18030, gbk, big5, and big5hks. 3 of these have problems: euc-tw, big5 and big5hks. All three have characters that can't be reversably mapped to Unicode and back. euc-tw shouldn't be a problem, as its irreversable mappings are due to duplication of an entire CNS plane of characters, apparently due to an encoding quirk. big5 has some characters mapped to private use segments; I don't know if this is because glibc doesn't use Unicode 3.1 yet, or if that represents a private use segment in big5 (the characters are contigious), or if they haven't been encoded in Unicode yet. (Unlikely, IMO).
Hmm, I searched the policy bug list, I don't know how I missed those. Probably my fault for using galeon-snapshot and expecting its search function to work :) #99324 isn't really a proposal, just a discussion. #99933 goes a lot farther than #174982. First of all, we can't even suggest that people use UTF-8 in package control fields until all our tools support it. Right now it is just plain broken to put anything but ASCII in them. I also personally don't like how it recommends using a "well-established encoding" or UTF-8. I mean, that's basically saying nothing. It doesn't help applications at all, which will still be forced to guess what encoding files are in. In short, it doesn't improve the situation at all. I think policy should be silent on the encoding for most files, until we can usefully say it will just be UTF-8. Perhaps though policy could *suggest* UTF-8, and mention that it is the preferred encoding. I do like the HTML META tag suggestion, although in the case of XHTML, it should be fine to use the charset parameter of the XML processing instruction, like in <?xml version="1.0" charset="UTF-8"?>. Yes, I think the time is getting closer. But I wanted my proposal to be small and simple, just a way for Unicode to get a foothold in policy, which we can expand later.
I have a counter-proposal to #99933, which I have attached. I believe it fixes the problems I raised with your proposal, and should also cover some new areas (like filenames). I also hopefully fixed James' issue with the RFC link. This patch supplants the one in #174982. It is more ambitious than #174982, but still does not introduce any "must"s, only "should"s or weaker. Opinions?
And I am going to use UTF-8 for Maintainer: in my packages, once I have new stable mail address (and new UTF-8 GPG alias) well, the whole proposal was a compromise after a long and bloody flamewar :-) It does help users, though. Most users are strictly monolingual (English does not count) and use the well-established encoding. As my proposal does - there is just "should" everywhere, no "must" Yes, this is fine. If you manage to persuade relevant persons (Manoj?). Good luck :-)
Yes, and it is fundamentally broken to do so, because our tools do not support it. Displaying it might happen to work on the maintainer's machine, but it will probably fail in many more places around the world, where people use terminals with a different native encoding type. Please only use ASCII until the tools support it, and file bugs against packages with control fields with characters not in ASCII. Otherwise you are just worsening the problem by adding yet another encoding to the mix of ISO-8859-1, ISO-8859-2, and who knows what else is already there. I understand that, but I think we can just avoid the issue of general file encodings for now, and only work on particular bits like distributed documentation and filenames. How does it help users? It's basically saying "the current broken situation is OK, but you may also unbreak your files if you want". Putting this in policy doesn't help anyone at all. I mean, "well established" alone is a very vague criteria. Let me ask this another way; what change do you expect to happen by saying that files may be in the "well established" encoding or UTF-8? It would basically be validating the current practice, which I consider broken. Policy shouldn't endorse it. I think a better approach is just for policy to be silent on the general encoding issue, set up a general Unicode infrastructure, start pushing UTF-8 where it is really needed (like filenames), and let the pressure build. Do you agree? So far I haven't seen any objections...
Hello,
Is this meant to apply to programs like "ls", "bash", "touch", and
"emacs"? I imagine that the transition period could be a hard time
for users who (like me) use non-ASCII characters in file-names.
As I see it, the current (broken ?) behaviour is, to use the user's
locale setting (LC_CTYPE) to encode file names. During the
transition period non-ASCII file names will have two possible
representations in the file system (LC_CTYPE vs. UTF-8). I think
we should clarify the following points before introducing the above
into policy:
1) Should interpretation of existing files' names as UTF-8
be implemented before the encoding of newly created files'
names is switched?
2) How should already existing files with non-ASCII names
be converted?
What do you think?
Jochen
Yes. That is probably true. But we really have no other choice. See below. It appears so, and yes, this behavior is completely and fundamentally broken. If you have say a Chinese friend who logs onto your computer, and he sets LANG to something like cn_CN.BIG5, then when he tries to 'ls' your files, it will completely fail. Likewise, when you try to look at his, it will not work at all. Moreover, say the system administrator does something like 'find /home'. The resulting stream will be a mixture of ISO-8859-X and BIG5, and impossible to reliably differentiate. And of course the problem doesn't just occur when you have a multiuser system; your Chinese friend could send you a .ogg file named using BIG5, and your Latin 1 system would simply fail to encode the filename. And finally, having the encoding of filenames dependent on the current locale often doesn't make sense even for a single user; what if you are a software developer in an ISO-8859-1 locale, and you want to test the Japanese translation of your software. So you run it with LANG=ja_JP.ISO-2022-JP or something to get the translations displayed. As a side effect, all the filenames on your system will fail to work. In summary, UTF-8 is the *only* sane character set to use for filenames. Major upstream software for Debian like GNOME is moving towards requiring UTF-8 for filenames, and we should too. See for example: http://www.gtk.org/gtk-2.0.0-notes.html Microsoft Windows has used Unicode for filenames for a long time because of issues like these. MacOS also uses Unicode. And like Tollef said, Red Hat 8 has already switched to defaulting to UTF-8 for new systems. I am not sure what policy can say here. For people using filenames in legacy encodings, perhaps policy could suggest that programs try to fall back to the user's locale encoding, if the filename is not valid UTF-8. This might become common practise, but I don't think policy should require it. Again, major chunks of upstream software which have Unicode support (like GNOME), are *already* defaulting to interpreting filenames as UTF-8 by default. I am just trying to bring policy in line with best practise in this regard. There are lots of different options; we could have a package 'unicode-transition' in base which would convert all local filesystems, or we could do it as part of a base-files upgrade. But mainly, this is a technical issue separate from policy, in my opinion. We can hash out those detailed plans separately from this proposal.
On Jan 04, Colin Walters <walters@debian.org> wrote: >In summary, UTF-8 is the *only* sane character set to use for >filenames. True, but does not work in reality for too many people, so this cannot be made mandatory. > Major upstream software for Debian like GNOME is moving >towards requiring UTF-8 for filenames, and we should too. See for >example: This is false. GNOME does not requires UTF-8, it's just a default.
Colin Walters <walters@debian.org> writes: Whether or not this is broken is debatable. It is the current status quo, though, on a majority of systems. Breaking that nilly-willy is not acceptable. I'd prefer: 1. Programs are extended to handle UTF8 filenames iff LC_CTYPE is UTF8. Programs that right now cope with other charsets can keep this support if LC_CTYPE is set to any other value (even C). Filenames incompatible with the current locale must be handled reasonably. Once this is implemented for a resonable percentage of packages: 2. An UTF8 locale is made the default on new installations. For upgrades scripts are provided to convert filesystem trees over to UTF8. Do a release. 3. Support for non-UTF8 charsets is deprecated, removed, or succumbs to bit rot. Yeah, and the Gnome2 file dialog completely ignores my latin1 filenames. That's best practise? Anyway, for my daily living Gnome2 is a quite irrelevant chunk of software. aterm, zsh, xemacs, mozilla are much more important. Only half of these support UTF8 right now AFAIK. I'd guess from the 80%-software in Debian less than 50 % handle UTF8.
Note that in my proposal UTF-8 filenames are only mandatory (a "must") for files *included directly* in Debian packages or created by maintainer scripts. Since I don't think we have any packages including anything but ASCII filenames, this will not change a thing. UTF-8 filenames for programs in general is just a "should", to be eventually upgraded to a "must" when we have even more support in major programs. But now is the time to get a strong statement of support for Unicode in policy, and start fixing the remaining programs. That's true, you can set a G_BROKEN_FILENAMES variable. But we should not expect upstream authors to implement such hacks in general. G_BROKEN_FILENAMES is exactly what its name implies; a workaround for a broken system. Plus, can you imagine setting a variable for each of the different programs you use? Other operating systems like Windows and MacOS have had this problem solved for a long time. We need to do it.
I don't think so. I have put forth many real-world scenarios in which using national charsets for filenames simply breaks, in ways that are basically impossible to fix. You may be able to get away with using a national charset on a machine where everyone speaks the same language, and never interacts with speakers of another language, but that's about it. What *is* debatable is when and how to make the transition, which is what we're doing now. Again, my policy proposal does *not* (I am 95% sure) create any new RC bugs. The only "must" is for filenames actually included in packages. I actually wrote another lintian patch for this (attached) which I ran over my small sample of .debs, and found no new bugs. It requires my patch for GNU tar; see: http://bugs.debian.org/175089 Using UTF-8 for programs in general, in my patch, is just a "should". First of all, there is no need for 'if and only if'. Programs can always try to decode filenames in UTF-8, and if that fails, then try the locale's charset. Would this make you happy if I modified my policy proposal to do this? Again, note this part of my proposal is still not a "must". Your programs will not get RC bugs for a lack of UTF-8 support for filenames. I agree with this wholeheartedly. Well, you might have to set G_BROKEN_FILENAMES. But this is the whole reason we are switching to UTF-8; so programs will not have to deal with the nightmare of recoding filenames! If you feel strongly however you could lobby the GNOME maintainers to default to falling back automatically to the national encoding if UTF-8 decoding fails. I've noticed that UTF-8 sometimes makes zsh unhappy, but other than that basically all the software I use every day (evolution, gnome-terminal, GNU Emacs (well, from CVS), nautilus, and galeon) supports UTF-8 filenames.
* Colin Walters | Note that in my proposal UTF-8 filenames are only mandatory (a "must") | for files *included directly* in Debian packages or created by | maintainer scripts. Since I don't think we have any packages including | anything but ASCII filenames, this will not change a thing. You are wrong in this regard. inorwegian includes a file called bokmål (which, ISTR, has symlinks for both ISO8859-1 and UTF8)
Hm, I don't see the symlinks for UTF-8. Anyways, such an approach with symlinks would not really solve the problem. Since the files in question appear to be only used internally by ispell, it should not be difficult to recode the filenames in UTF-8; the only program that would have to be changed is ispell. Also, inorwegian seems to have scripts which assume an ISO-8859-1 environment: Setting up inorwegian (2.0-9) ... Malformed UTF-8 character (unexpected end of string) at /usr/share/perl5/Debconf/Client/ConfModule.pm line 125, <STDIN> line 8.
Colin Walters <walters@debian.org> writes: Don't you think this is a common case? I'd even say more common than your scenarios. At least common enough that it should be acknowledged. I am not concerned about RC bugs in mine or others packages. My point is that ways how things have worked up to now will no longer, and this can be avoided. This will invariably interpret some non-ASCII non-UTF8 filenames wrong. But it will condone or even suggest broken behaviour like Gnome2's. Considering old standards broken because a newer one exists is just ridiculous. I still think taking LC_CTYPE unconditionally as a hint is the best solution. People who don't care (e.g. USians) are happy with any solution. People that have it at an older encoding get some slack. People like you should already have it at UTF8 and get all the fun right away. No argument there. That's quite an understatement. The commandline editor can't deal with multibyte characters in any way. So for example entering an o umlaut and then deleting it gets you in trouble, because zsh does not handle the two byte sequence as one character. FWIW, I am quite content with mandating the contents of some files as UTF8. We may want a BOM, at the start, though.
Previously Colin Walters wrote: Right. I'm tempted to make the next dpkg release abort if people try that. Wichert.
Previously Colin Walters wrote: I second this proposal. Wichert.
I agree, it is common enough. But previously people had no choice but to use a broken hack; now we have a solution. It only "worked" for specific regions, and specific cases. We should of course try to ensure that for people using filenames with legacy non-ASCII encodings, the transition is as painless as possible. I fully understand and agree with that. That may be true. However, UTF-8 was designed so that the chance of it being interpreted as another charset was small, and decreasingly small as the length of the input increases. See RFC 2279. That's why it is a good strategy to try decoding as UTF-8 first; and if that fails, fall back to the locale's encoding. The whole point of this proposal is to move Debian more in line with major chunks of upstream software like GNOME 2. If you disagree with their behavior, please suggest an alternative to solve all the problems I named above. The old "standards" such as they were are were a workaround for the lack of Unicode support. Now that we have it, we should stop using the workaround. No. Even only English-speaking programmers like me are tired of dealing with the multitude of national encodings, and having to make our programs do stuff like unreliable charset autodetection. ISO-8859-1 and BIG5 are not solutions for filenames, they are workarounds. I'm not sure what you are saying here. Ok. Well, this should not be impossible to fix, I hope. Again, there is no mandate involved in my policy proposal. It is all just "should"s, except for file names. We don't need one for UTF-8. That's another one of the great things about it.
No, just difficult to fix without a nasty kludge.
* Colin Walters | On Sat, 2003-01-04 at 13:15, Tollef Fog Heen wrote: | > * Colin Walters | > | > | Note that in my proposal UTF-8 filenames are only mandatory (a "must") | > | for files *included directly* in Debian packages or created by | > | maintainer scripts. Since I don't think we have any packages including | > | anything but ASCII filenames, this will not change a thing. | > | > You are wrong in this regard. inorwegian includes a file called | > bokmål (which, ISTR, has symlinks for both ISO8859-1 and UTF8) | | Hm, I don't see the symlinks for UTF-8. Actually, the file names are in UTF8 already. :) | Anyways, such an approach with symlinks would not really solve the | problem. Since the files in question appear to be only used | internally by ispell, it should not be difficult to recode the | filenames in UTF-8; the only program that would have to be changed | is ispell. And any hard coded scripts using -d norsk (or -d bokmal) for getting Norwegian ispell output. | Also, inorwegian seems to have scripts which assume an ISO-8859-1 | environment: | | Setting up inorwegian (2.0-9) ... | Malformed UTF-8 character (unexpected end of string) at /usr/share/perl5/Debconf/Client/ConfModule.pm line 125, <STDIN> line 8. This is due to debconf not knowing what charset the template is in. It will be fixed.
On Jan 04, Robert Bihlmeyer <robbe@orcus.priv.at> wrote: >Considering old standards broken because a newer one exists is just >ridiculous. Agreed. >> I've noticed that UTF-8 sometimes makes zsh unhappy, [...] > >That's quite an understatement. The commandline editor can't deal with >multibyte characters in any way. So for example entering an o umlaut >and then deleting it gets you in trouble, because zsh does not handle >the two byte sequence as one character. The same applies to bash. There has been patch in the BTS for a very long time but it has never been applied.
On Jan 04, Colin Walters <walters@debian.org> wrote: >> We may want a BOM, at the start, though. > >We don't need one for UTF-8. That's another one of the great things >about it. What do you know about international environments? Maybe you do not need a BOM because your native language needs just ASCII and you do not have any text file encoded with latin-1, but in the rest of the world the situation is quite different. I propose a new policy amendment: developers whose native language is english should not discuss i18n-related policy matters.
Well, hey, so they are. Don't know why it didn't look like it before... Hm, but if the filename is already UTF-8, what is the problem? Cool.
That would make sure that i18n is always an afterthought. You need to work *with* developers, not *against* them. How are you planning to impose an i18n policy on people who have been excluded from discussing it? Richard Braakman
Hm, the latest bash appears to work for me at least. I've been using it when I want to do UTF-8 file manipulation until zsh is fixed.
If you can make an argument that starting every text file with a BOM would be a good idea on a Unix-like system such as Debian, please do. Everything I have read argues otherwise. Unix has always treated files as just streams of bytes, and allowed you to concatenate streams with pipes. Having the BOM show up randomly, and expecting programs like 'cat' to remove it, or add it when it is missing, is too much to ask. 'cat' can't know whether its input is random binary data or UTF-8. But you don't have to listen to me, here are some arguments from Markus Kuhn against it, which I turned up in a quick Google search: http://www.rosat.mpe-garching.mpg.de/mailing-lists/perl-unicode/1999-11/msg00004.html In any case, whether or not to start every file with a BOM is basically orthogonal to my proposal, so we can discuss the BOM after the core proposal has been accepted.
[ CC'd to the Debian Description Translation Project maintainer, as he may be interested ] Ok, I spent a little bit of time and hacked up some experimental patches for dpkg to support UTF-8, and to recode it to the locale's encoding type on output. If you'd like to play, see: http://bugs.debian.org/175363 http://bugs.debian.org/175370 Hopefully we can get these into dpkg soon, and at that point we can start using UTF-8 in maintainer fields and package descriptions.
* Colin Walters | On Sat, 2003-01-04 at 19:22, Tollef Fog Heen wrote: | | > And any hard coded scripts using -d norsk (or -d bokmal) for getting | > Norwegian ispell output. | | Hm, but if the filename is already UTF-8, what is the problem? It isn't in stable, which means that I want to keep compat symlinks around for at least one release. (But that is just me :) I tried to fix this last night, but sed seemed to take far too long to do anything; unsure what the bug there is. :/
On Sat, Jan 04, 2003 at 12:10:42PM -0500, Colin Walters wrote: [...] [...] So how to implement your proposal? The main issue is to patch glibc API so that filenames are supposed to be UTF-8 encoded. Has this already been discussed? Denis
What do you mean? What changes to the glibc API would be required? If you are suggesting that functions like readdir() attempt to convert filenames from UTF-8 into the user's current locale, I am completely against that. It will just exacerbate the problem.
thanks The DDTP has no problmes with UTF-8 in control fields. Some maintainer use UTF-8 or something else with 'some translations' in the descriptions. This is not nice. The policy should be: use normal ACSII and UTF-8 encoding if you use non-ACSII characters Gruss Grisu
Well, you can reduce that to 'just use UTF-8', since UTF-8 is a strict superset of ASCII. So I agree with you, but we need to wait for this patch to get into dpkg before we can add such a rule to policy. In the meantime though, we should start removing ISO-8859-1 and friends...
<p>
Programs should expect filenames in general (whether from
a Debian package or created by the user) to be encoded
with UTF-8, although it is recommended for programs to try
gracefully falling back to the current locale's encoding
if this fails. Programs included in Debian packages
should, when creating new files, encode their names in
UTF-8 by default.
</p>
Consider a program written in C, which creates new files with open(2);
if I understand your proposal right, when a filename is not UTF-8
encoded, it should be converted into UTF-8 according to user's locale.
I am wondering how to perform this task:
a. Let open() perform this conversion.
b. Add a utility function in a common library and patch all programs
to add calls to this routine.
c. Let all programs perform their own checks.
d. ... Others?
How do you think your proposal should be implemented?
Denis
ok. Gruss Grisu
Well, broadly speaking, there are two cases: 1) Programs which do not look at the contents of filenames, and just treat them as mostly opaque arguments. Commands like 'touch' fall into this category. We should not need to change them at all; you just start passing UTF-8 instead of ASCII or ISO-8859-1 to them. Any change to glibc would break these programs. 2) Programs which do manipulate filenames. These are trickier. Now, there are several ways to make these programs handle UTF-8. For some of them, no change will be required; stuff like searching for ASCII characters still works with UTF-8. However, if these programs display them to the user on a tty, it will be necessary to convert them to the user's locale encoding (of course, once we make UTF-8 terminals standard, programs will not need to do this.) If they stuff them in a GUI widget, they will have to be sure to tell the widget that they are in UTF-8 (if necessary). No. This would certainly ensure corruption. It depends. For some programs, instead of converting the filename back to the user's locale's encoding for internal manipulation (which may fail, remember, since UTF-8 can encode far more than say ISO-8859-1), it would be better to change the program to handle all strings internally as UTF-8. For some programs this will be fairly trivial, for others it may be difficult. Another alternative is to have a small library which will first try decoding a filename using UTF-8 back into the user's locale encoding, and only if that fails, then just take the filename as-is. The best approach will depend on the program, and how it manipulates filenames. I hope that helps.
Hmm. Remember the far more common case of a program that takes a filename on the command line and then tries to open it. The user would have typed it in the local encoding, so it needs conversion. On the other hand, if the program was invoked by another program then the filename is likely to already be in UTF-8. I guess this conversion should be done by the user's shell, and all filename arguments on the command line should be encoded in UTF-8. Umm, except that the shell doesn't know which arguments are filenames. How should this be done? Richard Braakman
That's true. Hm. Maybe the best approach will be to first just implement Unicode and UTF-8 support for more programs, so it is how they handle filenames (and strings in general) internally, much like how GNOME programs do it now. This is all well and good, I think. The bigger question is what to do for programs that create or rename files, especially from user input. Should they try to convert filenames back into the locale encoding? I would say no, because 1) it could fail if the locale encoding can't encode certain characters and 2) it will just prolong the brokenness. For programs like 'touch' though which do not look at the filename at all, I think they should not be changed at all. They will create a file named using the same encoding given to it as an argument. After we have a "sufficient" number of programs supporting UTF-8 natively in this way, we change the policy on filenames to a "must", drop support for legacy terminals and encodings, and switch everyone to a UTF-8 terminal, and a UTF-8 locale. My guess is that this could happen some time after sarge's release. For sarge, we could (and probably should) make the default locale for new installations be UTF-8. After we've switched to a UTF-8 locale for everyone, programs will no longer need the code to handle legacy encodings. It will probably still be useful to keep it though, because the legacy encodings will be around for a long time, and we want things to Just Work as much as possible. So again, after this current policy proposal is accepted, it will still not be a RC bug to not have UTF-8 support; but people will know that it is coming. What do you think?
Just to answer this a bit more directly; no, I think the shell should do no conversion. It should just pass its input on to programs in the encoding it received it. So for people using legacy encodings, yes, programs will receive filenames in those encodings, not UTF-8. But hopefully programs will handle it, and convert them to UTF-8 internally, and write them out as UTF-8. But if they don't, then they don't (unless we fix the program). There's not much we can do about it, until switching users to UTF-8 locales and terminals.
Besides Sebastien's reply, there is another good reason not to do
recoding in the shell: for any program which actually manipulates
filenames, we will need to add Unicode/UTF-8 support *anyway*, even if
the shell did convert everything to UTF-8. For example, any program
that used to do:
char *c;
for (c = some_function_that_gets_user_input(); c != NULL; c++)
printf("%s\n", c);
will have to be changed to do something like:
char *c;
for (c = some_function_that_gets_user_input(); c != NULL; utf8_next_char(c))
printf("%s\n", c);
Since we will have to change programs anyways, we might as well fix them
to decode filenames as well. The shell is kind of tempting as a "quick
fix", but I don't think it will really help us.
Well, let's be clear; nothing we can do will truly work in all cases.
The vast majority of data is untagged, and charsets are not always
reliably distinguishable. We are just trying to minimize what breaks.
For the case you named above, I think what should happen is that 'ls'
converts all the arguments to UTF-8 for internal processing. For the
first argument, UTF-8 validation will fail, so ls will try converting
from the locale's charset, which will work. The rest of the arguments
will validate as UTF-8, so ls just goes on its way.
I don't think the shell does in all cases. Think about when arguments
are computed dynamically.
Generally speaking, I think the shell should just be a conduit for
bytes, and not modify them at all. Much like 'cat'.
Well, this situation can already break horribly on systems whose users
use different character encodings. So we aren't creating a regression
here, in my opinion.
We will definitely need UTF-8 support for the terminal. I know
gnome-terminal works, and uxterm works too. I don't know about support
for Linux consoles.
Fixing progams that handle terminal input is a different matter IMHO, it's something that should be decided on a more case by case basis, and alot of cases might be effortless handled just by extending ncurses/slang I think the philosophy should be that everything should be converted to UTF-8 after it is read from the terminal. Programs that interface with the terminal need to convert. Changing programs that handle terminal input is a far smaller scope than changing every program that touches argv and every program that does terminal input. If this route is followed then a huge swath of programs are half correct already, their only problem is that they will not be converting utf-8 for display. That might be best handled through glibc (again, changing *everything* just to get around the lack of utf-8 terminals is insane) Well, that's not true. At the shell level everything is tagged. The shell knows things returned from readdir are utf-8 and things typed into the console are something else. When I mean 'all cases' I mean the cases the come up in a system with only UTF-8 names in the filesystem, not one that has mixed encodings already in the filesystem, that's hopeless. Eww, that's gross, it isn't definate that UTF-8 validation will always fail for non UTF-8 text, you could easially get lucky and type in a word that is valid UTF-8, but needs conversion! That's a terribly subtle UI bug. Consider the shell to be a scripting language just like python/java and look at how it's handled there - all internal strings are UTF-8, functions that read/write to the terminal convert automatically, functions exist to convert arbitary text/files. You have everything needed to make the shell work uniformly in any environment, but some cases might require an iconv, but the iconv is required for *all* users, not just those with different locale settings. I think that's a good goal. The trouble is, the shell interfaces with the terminal, so it is the only thing in a position to know how to convert characters coming from the terimal to UTF-8, nothing else can do this. Jason
Hello Colin, At least I agree to this :-) I think that we need filename conversion between UTF-8 and the user's character set, because we cannot ban all non-UTF8 terminal types. In my opinion the main problem is, where this conversion should take place. Because a lot of programs is affected, it would gain us much, if we could move this as deep as into libc or even into the kernel. I remember there are some questions about character sets in the kernel configuration. Are there file-systems with in-kernel character set conversion? Does anybody know: how do they solve the problems we discuss here? Where do they convert filenames, e.g. when I login via ssh and type "ls -l Bär*" from my LC_CTYPE=ISO-8859-15 system? And how is the conversion done there? Ok, I see that this is no real problem. Jochen
Hello, I think that this would be a really bad idea, because it would be a to severe restriction on the set of supported terminal types. Think of remote logins from non-Debian machines: we cannot control the program at the other end of the line. And what about serial (hardware) VT-220 terminals? We cannot change the hardware and to loose support for it would be not nice. So in my opinion we cannot drop support for non-UTF8 locales and terminals. We need to do file-name conversion here. Jochen
That's true, but I don't think there is really anything we can do to solve that problem. Well, such terminals should be explicitly marked as deprecated inside Debian. Actually, probably the best solution is for the terminal to be able to switch encodings at runtime; the experimental gnome-terminal can do this.
[ CC's trimmed, since mail to the bug will reach -policy ] A lot of programs don't use curses... I generally agree with that. If by 'touching argv' you mean 'modifying and creating output based on', then I hope you agree that we will almost certainly have to make those programs grok Unicode anyways, as I said before. UTF-8 is a multibyte encoding, and traversing and manipulating it correctly generally requires one to use different string functions (although stuff like strchr(foo, '.') will still work). Output is a big problem, I agree. But how exactly do you propose to modify glibc? No, it doesn't! Even if we force users to run a script which converts all legacy encodings to UTF-8, people will still have files NFS mounted readonly on other systems, files that they created using a legacy program, files on CD-ROM or DVD, etc. What do you mean anyways that everything on the shell level is tagged? How is that possible? What if I do something like this: touch $(nc www.random.org 80) But mixed encodings will happen in the real world. It is unavoidable. There is a lot of legacy data. I agree, it sucks and it's pretty gross. But I don't think there is a better solution. Yes, but even in Python/Java/C# or whatever, you don't always know the encoding for sure; what if you're opening up a Debian changelog? By default the strema will be opened using the user's locale encoding, but we already mandated that Debian changelogs be UTF-8. I don't see how you can make iconv just make everything work. As I said, I don't think the shell knows everything, and I think just modifying the shell will not fix everything, even if it did.
Hello everybody,
UNIX-style programming should continue to "just work", I like
the idea that I can download any old program written in a past
decade and just type make.
And Yes!, there are several filesystems in the Linux kernel
which do character set conversions on the fly. Specifically,
all the Microsoft/IBM compatible filesystems (*fat, ntfs, hpfs,
iso9660) allow the DOS-side and unix-side character sets to be
specified as mount options. Some versions of the smb file
sharing tools also do this. And I think there is some
conversion code in the text mode vt implementation (screen and
keyboard) too.
At least the filesystem character conversions already use
UNICODE as the intermediary format, and thus the kernel includes
an almost complete set of UNICODE to/from X conversion tables,
each as a separate module with kerneld autoload support and all.
So here is my idea of how to do it (no I have not checked what
RH or others do, but I know what MS did wrong 10 years ago and I
live with those mistakes as a cross platform programmer every
day).
1. Unless otherwise specified here, or there are very special
circumstances, all programs and libraries should assume that all
strings they receive or output (including, but not limited to
filenames) are in the same encoding, and make no externally
visible character encoding conversion. (This is usually trivial
to do, just do nothing).
2. If a program really needs to make assumptions about the
character encoding of data, it should assume the character
encoding specified by the locale. As a minimum, the following 3
cases must work correctly:
2.1. UTF8
2.2. iso8859-1+ defined as the single byte encoding where
each byte is one character, which is its own UNICODE
equivalent, and where all byte values are treated as
valid, even if the corresponding UNICODE codepoint is not
defined. (This character set is usually combined with the
C locale to allow processing of arbitrary binary data in
any unknown encoding).
2.3. any other single byte encoding where the values 0..127
are ASCII and 128..255 are graphic characters not
interpreted in any particular way.
Support for other multi-byte character encodings than UTF8 is
not required for sarge and later, but should not be removed if
it is already there. For new code, either use the libc
character handling functions, or just treat anything not UTF8 as
iso8859-1+ except when converting to/from UTF8.
Note 2.1: Code which just treats strings as binary data already
satisfy the above.
Note 2.2: Code which just checks for ASCII values such as \n, /
etc. and passes consecutive sequences of high-numbered chars
around as is, already satisfy the above thanks to the design
properties of UTF8.
3. Unless required for security or other functionality, programs
and libraries should not object to processing invalid
characters. (This increases the users chance of being able to
deal with data in inconsistent or broken encodings, e.g. with
commands such as mv M?nch.txt Maench.txt).
However no conversions should cause bytes to be treated as an
ASCII control char unless its encoding is exactly that ASCII
byte value alone. This means not converting the "redundant"
UTF8 encodings to their shortest form, but either leaving them
as is or converting them to something harmless. ? is not
harmless, any ASCII char other than a-zA-Z is not harmless in
general context.
Note 3.1: This is trivially satisfied by code which does not
do convert or check character encoding at all.
4. The low level software which converts keystrokes (or other
non-string input) to strings or converts strings to pixels (or
other non-string output), is responsible for doing so
consistently with the locale of the programs to which it
provides this service, unless those programs explicitly specify
otherwise.
For terminal-style input/output, there will be a tool or library
feature (existing or Debian-created) which does two-way
conversion of character sets around a pty. This tool can /
should be plugged into ssh, telnet, serial line getty and other
conduits which allow terminal access from terminals that might
have different locales than preferred on a given Debian system.
Note 4.1: Editors, libreadline etc. are not under this rule.
Those are just regular software which needs to count characters
(and thus check for multibyte chars in the specified encoding).
This rule is about the actual terminal interfaces, whether text
or graphic.
5. Software which persists or transports strings outside the
current process group, such as the name processing in
filesystems, should convert strings from the current locale to a
common encoding chosen by the implementor, such as UTF8, UTF16,
UTF32 or in some cases another encoding. It must be possible to
turn off the translation through an extra environment variable,
no matter what the locale or its character encoding.
For filenames or other data to which access must be possible
even if it is improperly encoded, the translation code should
include a well-defined escaping mechanism for accessing invalid
character encodings on the medium. This code must not be
enabled in other contexts, due to serious security issues (it
could e.g. allow bad people to bypass code to filter out shell
metacharacters etc.). This escape mechanism should allow things
like tar backups to just work, no matter how confused the
filenames on a disk.
A mechanism needs to be devised, either in kernel or libc, which
allows the conversion of filenames and console i/o to and from
the process locale to indeed match the process locale. A
similar or identical mechanism should be put in Xlib.
6. The base software in sarge, such as libc, Xlib, xterm must
support UTF8 variants of all locales as soon as possible.
Without this, the rest cannot even begin to be implemented.
P.S. I am not a DD, just trying to be helpful and constructive.
Cheers,
Jakob
but unless someone starts actually _using_ UTF-8, we would never know which tools are broken and which are not (I already found one bug in handling of UTF-8 GPG alias - I'll file the bugreport after some more testing). And remember, this is debian *un*stable, so some breakage is to be expected. ... Yes. But no sign of recognizing the urgent need of solving the problem either.
On Tue, Jan 07, 2003 at 09:29:44AM +0100, Radovan Garabik wrote: [...] [Could this discussion take place on debian-i18n?] Mixing legacy encodings and UTF-8 looks like a bad idea, except that we can determine whether strings are UTF-8 encoded or not. So it makes automatic conversion a bit harder, but it is not a real problem. The main problem with text files is that their encoding is not specified. All human editable text files must *explicitly* tell their encoding, either by their content (like XML/SGML/HTML) or by their file name (.txt documentation or man pages must contain their encoding in their full name, naming scheme must be standardized). This allows support for both UTF-8 and legacy encodings. (To Colin: you did not notice any problem because ASCII text is UTF-8, but problems arise with all other legacy encodings). A good example is debconf. Joey Hess added encoding information in 1.2.0, legacy encodings are currently the default, and switching to UTF-8 will take place when it is time, without any trouble. Automatic conversion to user's locale (including UTF-8) is performed on output. The only problem is that very few maintainers did manage to switch to po-debconf in order to add encoding informations into their templates files. A similar approach could be considered for deb control files, a new mandatory Encoding field must be added to debian/control (and automatically put in other files when needed), which tells encoding used by all control files. Dpkg and friends may then perform automatic conversion (to UTF-8 or to current user's locale) if desired. Denis
On Jan 06, Jochen Voss <jvoss2@web.de> wrote: >Because a lot of programs is affected, it would gain us much, if we >could move this as deep as into libc or even into the kernel. I >remember there are some questions about character sets in the kernel >configuration. Are there file-systems with in-kernel character set >conversion? Do not even dare suggesting this. Changing libc would probably break POSIX compatibility, changing the kernel is a bad idea which would get nothing else than flames from kernel developers. Programs have to be fixed: file systems are just another kind of input/output and should be assumed to follow LC_CTYPE. The right approach (even if the default configuration is inappropriate) is the one of GNOME: high level libraries hide file names charset conversion from users and programmers.
On Tue, Jan 07, 2003 at 10:29:33AM +0100, Denis Barbier wrote: [...] This suggestion applies when control files contain non-ASCII characters, only problematic packages are concerned. Denis
Colin Walters <walters@debian.org> writes: Then your solution is broken. Seriously, this would be a huge problem for many people. You can't very well take an actual vt100 and do that. Even on other hardware, like older Suns, it's not all that easy. I am vehemently opposed to any proposal that renders Debian substantially unusable on existing ASCII/latin1 terminals. I think it is great to use Unicode internally, but we clearly are not pursuing the right path if we introduce such breakage. (Yes, this would mean that TERM=vt100 is now deprecated)
Testing our tools' support for UTF-8 on your local system is perfectly fine; I've been doing just that personally. But, ... Uploading packages with UTF-8 control fields is not ok. It will simply put, not work for anyone who's not using a UTF-8 terminal, which is unfortunately probably most of our users at the moment. Just Don't Do It. If you really want to help push UTF-8, apply my dpkg patch, help find/fix bugs in it, then start ensuring apt-get, aptitude, etc., all grok UTF-8. Actually I think we should probably move to -devel, given how strongly this affects the system in general. Even people who maintain programs which care little for i18n will still have to deal with UTF-8 filenames, and should be UTF-8 aware in general. It looks to me like at this point almost everyone agrees with the content of my proposal in #99933, and we are discussing implementation details. Agreed? If so, another second would be cool :) And also if that is the case, then it makes a better argument for moving to -devel. Not with perfect reliability. You mean like changelog.txt.UTF-8 or changelog.UTF-8.txt ? I am pretty much opposed to any sort of proposal of this form. The reason is that changing programs to recognize our arbitrary scheme for file encodings will not only be a lot of work, but instead we could add support to programs to autodetect the charset semi-intelligently from file content, which is what programs like Emacs in the real world do today. Actually I quite frequently notice problems with European names, as well as the copyright character. Do not assume that because my native language is English that I do not experience charset problems :) Ugh. I am generally quite opposed to adding an Encoding field, and I bet you'll find the dpkg maintainers are too. It should just be UTF-8, period. If developers really want to, they can generate control from a control.in file by using iconv or similar.
On Tue, Jan 07, 2003 at 10:23:14AM -0500, Colin Walters wrote: [...] No. We agree that UTF-8 support must be dramatically improved, but legacy encodings must be supported too. [...] I was unclear, and only speaking about files shipped by Debian packages which contain non-ASCII characters without specifying their encoding. Users can do whatever they want with their data. I have almost txt, man and info pages in mind. IIRC *BSD put man pages under .../man/<language>.<encoding>/, don't they? Info pages are never translated. The only text files with non-ASCII letters I encounter are documentation and can be safely renamed, but maybe there are others. Then why do you patch dpkg to support UTF-8 input if it can guess encoding? Denis
But the current situation is *already* broken! For example, for a Chinese person, an ISO-8859-1 system simply cannot encode, nor display, their language. I am aware that for people entrenched in legacy charsets like ISO-8859-1, the transition may introduce incompatibilities. But that's the price we pay to eventually make everything work for everyone. It is the only path to the future. Note that in my proposal, I do suggest that programs try to re-encode from UTF-8 back to the user's locale charset.
Colin Walters <walters@debian.org> writes: I don't disagree. I'm saying that your solution is worse than the problem. True. However, if the terminal only supports ISO-8859-1, there's no way to make it magically display Chinese characters. It's a limitation, and Unicode or not, there is no way around it. "may introduct incompatibilities" is something of an understatement. "Break compatibility with 50 years' worth of computing and almost every other vendor" is more accurate. I do not buy that for one minute. Surely it is possible to translate things back to a character set the terminal actually supports? Is that not why we have the "@UTF8" designator for our LANG settings? Perhaps you mean "it is EASIEST to break compatibility." That may be true. That is also the wrong motivation.
Hello,
I do STRONGLY DISAGREE with
... Programs included in Debian packages
should, when creating new files, encode their names in
UTF-8 by default.
We shouldn't start this before all/most programs can handle
the generated file names.
Jochen
I suggest that no decision should be made about man pages until groff 2.0 is available, when proper encoding support will actually be practical as opposed to the hacks we have today. Until then it will not be at all clear to me how things should work.
Sorry, we have to start somewhere. Unicode is the way of the future, and if we wait until every vendor of some random terminal updates it with support for UTF-8, we will never start. Now is a good time, since (again) major chunks of upstream software included in Debian like GNOME are making a major push towards UTF-8. Well, that's what we're going to do. If we change programs to output to the terminal in the locale's encoding, then yes, it will work, at least if the terminal's charset covers all of the characters in question (which it may not). Not sure how this is related to what you're saying. We will try to preserve compatibility as much as possible.
If I drop this from my proposal, will you support the rest of it then? I should note however that many programs are already creating file names in UTF-8 today; like pretty much any program which uses GTK+ 2 for instance (including all of GNOME).
Colin Walters <walters@debian.org> writes: I don't disagree that we should move to Unicode. I disagree that such a move must inherently remove support for legacy (or even, the majority of CURRENT) terminals. Sorry, this discussion is about what we're doing, isn't it? I don't recall seing "Colin Walters, Debian Dictator for Life" voted on anywhere. What "change programs?" That's what they do now. Yet your own proposal breaks compatibility with, let's see, EVERYONE?
See http://mail.nl.linux.org/linux-utf8/2003-01/msg00037.html It would be nice to make sure programs are ready before switching everything to utf-8. Denis
If you're using a terminal that can't support UTF-8, you always have the option of running something like GNU screen to translate the system charset to the terminal charset. It seems more important to get a systemwide encoding working, then worry about the minority who use physical terminals.
Not inherently, but stuff will likely break. How much it breaks is inversely proportial to how much work we put into it. Ah, you must have missed the rider in the small font in my last policy proposal :) Seriously, I didn't mean it that way; I just meant that I think everyone has generally accepted that UTF-8 is the way of the future; we're just debating when, where, and how. I don't think most do. dpkg for example doesn't. 'ls' for example doesn't. No, for people using UTF-8 today, like me, it increases compatibility :) And remember, (not to sound like a broken record, but) lots of upstream software is moving to UTF-8. Compatibility with systems using legacy charsets is already broken to some extent.
Sure...but remember that my policy proposal does not drop support for legacy charsets; in fact it recommends that programs try falling back to them if UTF-8 decoding fails. I see this policy proposal as a strong statement that Debian is moving towards Unicode, not as a means to get packages which don't grok UTF-8 removed from Debian or something silly like that. Implicitly in this is that we will support legacy encodings to some extent for a while. Do you agree? Ok. Agreed completely. They can have their data in any encoding they want, as long as it's UTF-8. :) (just kidding...) Ah, OK. I think that improving how our documentation formats specify charsets is a great goal. I misunderstood your proposal. Er...my patch was to support outputting UTF-8 to the user's terminal. There was no input involved. I think you may have confused something somewhere, but maybe I just wasn't clear about what it does...
That is interesting advice. I am not sure I understand exactly how it would work though. Would you just tell screen that all input is in UTF-8? It seems like this would not be true if the user has legacy filenames, and they do something simple like 'ls'...
Cool. I will say this much; I simply did not even consider doing this kind of character set conversion as part of glibc or Linux. It just seems like such a horrible kludge that would not actually work in practice. Fundamentally, glibc and Linux cannot know what charset the application itself works in. You might have stuff that undergoes UTF-8 conversion *twice*, once by the application and once by glibc for example. It just seems like a recipie for disaster. because you can't just use your same old C library string functions on UTF-8. I know it seems tempting to just stick some code into glibc, but I have serious doubts that will ever work in anything resembling a reliable fashion. Feel free to prove me wrong of course! I think that it quite simply does not work. What conversion? GNOME apps speak UTF-8 natively, and that's about all they speak unless you set the G_BROKEN_FILENAMES environment variable.
Naïve, simple, classic UNIX-style programs are ASCII-only. Then someone got the idea to bolt this huge "locale" kludge on top of all of it. It is not something to be proud of or emulate. Yay for broken software. This is the way things currently work; it is also exceedingly broken. I think that if you are writing a program today, it is saner to assume UTF-8, since that is the future direction. I believe that the programs to which you might need to pass invalid characters will also be the programs which will not look at or manipulate the filenames anyways. 'mv' is a good example of a program which we will *not* need to change. It just basically takes its arguments and passes them to the rename system call (well obviously it is more complicated than that, but that's the basic idea). I generally agree. Such a tool could save us time (perhaps this tool already exists in the form of GNU screen, as mentioned by David Starner), but note we can't really force users to use it. Ugh, I am opposed to any sort of environment variable like this. I think it will not be necessary, and will complicate the implementation. Not sure how this "escaping mechanism" would be possible, or what it would even really do. I think it might make sense to have common library functions to do stuff like this in glibc. It already does. I just tried uxterm again for the first time in a while, and I'm really impressed with its current level of UTF-8 support. It can do almost all of UTF-8-demo.txt on my system. Thanks for your comments.
At 01:10 AM 1/8/2003 -0500, Colin Walters wrote: Well, screen should react in the same way that any UTF-8 terminal should react. (There's a specification that not all of them follow, but all of them I've tried handle it non-catastrophically.) The suggestion was how to handle legacy terminals in a UTF-8 world. As for legacy filenames, I'd think that it would be easiest for each system to declare a flag day, and change over to UTF-8. (zsh be damned -- they've had plenty of time to figure how to properly handle UTF-8.) I've submitted bugs on packages for having filenames not in ASCII, so for the most part Debian's filenames won't be a problem. There is no way for a POSIX filesystem to tag filenames with encodings, so there is no option for this to be a clean changeover, especially as there's no clean state to start from. David Starner - starner@okstate.edu (starner@okstate.edu may be disappearing soon - dvdeug@email.ro will work, but is not suitable for high-volume traffic.)
Forget about screen and use filterm (from konwert package). I am using it from time to time on legacy terminals with a great success filterm - UTF8-iso1 is all you need to use your unicode setup to work on legacy iso-8859-1 terminal. It even converts characters to their most appropriate iso1 equvalents (strips diacritics, transliterates cyrillic etc.). This is not some hypothetical option like most of what is proposed, but I was really using it to read Russian and Slovak etc on broken|old terminals
That is what I am doing now. (Except the dpkg patch which I am going to play with if I find some time) I lost count how many times I already had this discussion on -i18n, -devel and whatever else. The consensus was ALWAYS "OK, that is nice but just wait until the tools support UTF-8, and besides, I do not care about it". So we waited and waited until RedHat (much as I dislike RH, I applaud their effor for switching into UTF-8) and it is no longer a question of making the "proper" progressive decisions, but a questions of not falling back too much when compared with RH. I would like to. Though I am not sure about others. I completely agree.
Unicode did not exist until fairly recently. Lots of useful software was written prior to its introduction.
It's not just physical terminals we're talking about here. We're talking about the vast majority of the state of the art terminal emulators *today*. Debian's latest stable release does not use Unicode by default in either KDE or Gnome, AFAIK. The console in the latest stable release does not use Unicode by default either. Then we have all the other Linux distros, plus Solaris, AIX, AS/400, etc, etc, etc. Hell, we're doing good to get some things to support *ASCII*.
At 02:32 PM 1/8/2003 -0600, John Goerzen wrote: I'd have a hard time describing a terminal emulator that doesn't support UTF-8 as "start of the art". Recent versions of xterm, gnome-terminal, and the KDE terminal all support UTF-8. No one said that we were going to remove non-UTF-8 locales in Sarge. The console can be switched into UTF-8 mode with one command - unicode_start. AS/400? We don't support EBCDIC. We'll be losing more compatibility with Mastodon Linux, but we can't run a.out anymore, so it's really a moot point. As for the rest of them, most of them are ahead of us in UTF-8 support - RedHat, Solaris, AIX. What about Mac OS/X and Windows? Both of them are far ahead of us in UTF-8 handling. Then those programs shouldn't be in Debian - Hamm made being 8-bit clean a release critical property. Being 8-bit clean isn't good enough for a large part of the world to use their native languages, and is a pain for the rest of us who are mathematicians, linguists, scholars or travelers. If it was written prior to Unicode, it's useless to the Ethiopians and the Iranians and a large part of the rest of the world; it's likely to be useless to the Japanese and Chinese as well. We can support non-UTF-8 terminals - as Radovan pointed out, the tool is filterm. If you want to support an older terminal, that's the easiest place to do so; you can't afford to muck around in kernel and libc or in every program for that. David Starner - starner@okstate.edu (starner@okstate.edu may be disappearing soon - dvdeug@email.ro will work, but is not suitable for high-volume traffic.)
Yes, there are UTF-8 versions available. Does everyone have them? Do we enable them by default? Do all other vendors ship them? The answer to all of these questions is No. Colin was advocating what amounted to exactly that. He was advocating removing all support for non-UTF8 terminals. AS/400s do support ASCII :-) I was making a joke, not to be meant seriosly (and it was referring to the AS/400) I don't buy that at all. Lots of programs are simply pipes, working with data going in, echoing it back out. Colin asserted that ls was broken because it doesn't handle Unicode. I submit that ls has always handled Unicode; if the filename is encoded with Unicode and your terminal is Unicode, it will show it in Unicode. It doesn't have to be made specifically aware to just shlep some data onto the screen. Then let's do that, and not consign the rest of the world to the junk bin.
my present policy proposal introduces is for filenames included *directly* in Debian packages, or created by maintainer scripts. Everything else is just a "should" or less, for now. Could you reread my policy proposal again, please? Broken? Not necessarily. But suboptimal? I think so. True enough. But we could make the transition easier and increase compatibility with legacy setups by making 'ls' and friends recode output. I fully, completely agree.
At 05:03 PM 1/8/2003 -0600, John Goerzen wrote: Everyone who has the most recent version. They're enabled by default if you're running a UTF-8 locale, like they should be. Can we control this? If you're sitting at a computer that doesn't have a new terminal, you can run filterm or install a newer xterm. But not in Sarge. No argument here; it would be nice if ls would escape invalid byte sequences and bad characters, but it's not broken. But we do do that -- we have filterm in the distribution. A filter between the terminal and the system is the easiest place to solve this problem. David Starner - starner@okstate.edu (starner@okstate.edu may be disappearing soon - dvdeug@email.ro will work, but is not suitable for high-volume traffic.)
Hello, I want to challenge the "everyone" in your sentence above :-) I agree that it would be a good idea to store filenames as UTF-8 in the filesystem. But I (being a part of "everyone") do not agree, that we should even try to switch every terminal in the world to UTF-8. We do need conversion of file names somewhere between the filesystem level and output. Jochen
Well, I do agree that conversion should occur. In reality though, not all programs will be fixed to do this, and not all terminals will be converted to UTF-8 either. We just want to maximize both in an attempt to minimize breakage.
A Posix filename is a null terminated byte string (sans '/'). Any widescale conversion is going to cause aliasing issues and other bugs, whether or not we stay Posix compatible. Just as important, conversion is not an issue for debian-policy; linux-utf8@nl.linux.org (the primary Unicode-Linux discussion list) is strongly against it, and I believe the people who matter - the ones who work on the kernel and libc - are generally against it. I'd been interpreting this part of the policy amendment as saying "You shouldn't have filenames in packages (or created by packages) in non-UTF-8 encodings." (I'm not generally a fan of filenames in non-ASCII UTF-8, but at least it's consistent.) If we're talking about what programs output, it should use whatever name and encoding the user asks for. We can't dictate what encoding end-users use; just what Debian packages use internally. David Starner - dvdeug@debian.org (starner@okstate.edu may be disappearing soon - dvdeug@email.ro will work, but is not suitable for high-volume traffic.)
Right. Did the people on that list come up with any general plan for how GNU/Linux vendors should transition? I suppose I should subscribe to that list... Well, that's not quite right. For filenames included directly in Debian packages, or created by maintainer scripts, my policy proposal says they *must* be UTF-8. For files simply created by running programs, it is just suggested that they be UTF-8, for now. Are you saying that programs should attempt to convert filenames back into the user's locale encoding in the actual filesystem, or just that they should recode them for output?
At 10:29 PM 1/9/2003 -0500, Colin Walters wrote: Not anything written up that I know of. Debian-i18n has a large cross membership, which was part of the reason this should be on debian-i18n. Console programs should not recode them period, except possibly for annoying stuff (newlines in names and the like). Locale-dependent GUI programs should probably do the same. GNOME and KDE may save them as UTF-8, but that's questionable behavior; arguably, if you want to use GNOME and KDE you should be using a UTF-8 locale, which would solve the inconsistency. David Starner - starner@okstate.edu (starner@okstate.edu may be disappearing soon - dvdeug@email.ro will work, but is not suitable for high-volume traffic.)
Ok, if people want to move this discussion that's fine by me. If we're talking about the filenames, then I agree. What do you expect GNOME programs to do? Since they fully support UTF-8, you can input any Unicode character you want. Also, a program like Evolution may receive a file in mail whose name uses Unicode characters. And a lot of locale charsets (like ISO-8859-1) will not be able to encode the string. The only sane solution is to just use UTF-8 for filenames. But I am curious about your feelings on programs writing data in general to the terminal; you feel they should not never to convert it to the locale's charset, and we should just mandate that people using legacy terminals use that filterm or whatever thing?
At 11:55 PM 1/9/2003 -0500, Colin Walters wrote: You can input any Unicode character you want, but you probably have to out of your way to input something outside your charset (i.e. probably not on your keyboard or standard IM.) If I receive a file in the mail whose name is not in ASCII (which has never happened to me), I would rename it before saving it, so I could access it easily. How many people in a Latin-1 locale who got an email with a Chinese file name would want it saved with the original name? A simple hash - say, out of charset characters to _ - would probably be fine. If you're dealing with a web browser, or a mail reader or anything else that handles tagged data, it should convert it, of course. Anything else should be in the locale charset, and manually recoded if necessary. I'm not sure I really understand what you're asking here. David Starner - starner@okstate.edu (starner@okstate.edu may be disappearing soon - dvdeug@email.ro will work, but is not suitable for high-volume traffic.)
Naive, simple, classic UNIX-style programs (if 8 bit clean) will
implicitly handle UTF8, latin-1, latin-2, Korean DBCS, Arab,
Hebrew, most old DOS codepages, and generally any encoding which
includes ASCII as a proper subset. The notable exception is
certain Japanese DBCS encodings, which allow ASCII character
encodings to have a different meaning if preceded by the wrong
byte values. I am not sure if the common Chinese DBCS encodings
are safe like Korean or unsafe like Japanese.
This is what I want to keep working.
But this pleasant situation presumes, that all the system
interfaces (terminal, filesystem, Xlib ...) happen to use the
*same* encoding at any given invocation of the program, at least
as far as input/output to that program is concerned.
So my detailed proposal is about getting UTF8 support work
without breaking this basic programming assumption.
Again, I assume that the program is 8 bit clean or I would have
to restrict my input to ASCII anyway today. But if I do
restrict my own input to ASCII for such a broken program, the
system should do nothing which may increase the breakage beyond
that manual workaround.
To understand my concrete proposal, it should be seen in the light
of the following general transition plan:
Step S1. Get all the ultra-core software to support UTF8 (items 4
and 6 in the proposal).
Step S2. Now maintainers of other software will have a
reasonable environment in which to start implementing and
testing that their code works with UTF8 variants of locales.
And users can actually use such locales without massive
breakage.
Step S3. Make all Debian packages work correctly in the presence
of UTF8 locales. Proposal items 1 to 3 are about making this as
trivial as possible, with 90% plus of current packages (both
source and binary) needing no change at all.
Step S4. While implementing S3, work on creating solutions which
allow processes running in UTF8 locales to interoperate with a
world, where some systems and users will continue to use other
encodings anyway for many years to come.
Proposal item 5 says that this is the responsibility of the few
pieces of software actually interfacing with the outside world,
not of the many pieces of neutral software which may or may not
happen to be used in those situations.
Proposal item 4 emphasizes that simply having a user interface
(such as libreadline in the shell, ncurses in some full screen
text mode programs, Athena or Motif/lesstif widgets in X in X
programs) does not put a program in that category.
Thus character conversion should be done at the very edge of the
system: In the local terminals (vt, xterm, Xlib), in remote
terminal access software (ssh, telnet, tty wrappers for serial
lines, Xlib for remote X terminals), and in physical storage
interfaces (already partially in the stock kernel for non-UNIX
filesystems).
Step S5. Make UTF8 locales the default.
Step S6. Subject support for other encodings to bit rot, not
deliberate removal.
UTF8 terminal and all my filesystems present UTF8 at the system
call level, everything works. If I set my locale to latin-1,
use a latin1 terminal and all my filesystems present latin1 at
the system call level, everything works too. If I set my locale
to the predominant Japanese DBCS encoding, use a Japanese DBCS
terminal and all my filesystems present Japanese DBCS at the
system call level, almost everything works, unless I use one of
the few characters whose DBCS encoding abuses the byte values
normally associated with e.g. "/", or "\\" . And yes, I do use
all of these variations on some of my machines, even though I
don't speak the Japanese language personally.
If the locale says UTF8, then assuming UTF8 is safe. If the
locale is not UTF8, assuming UTF8 is VERY broken, my proposal
went on to say that supporting the UTF8 setting correctly is the
most important case to implement, but a neutral 8-bit clean mode
must also be available, which will handle most other encodings
implicitly. Support for legacy DBCS encodings is not required
at all, because it may be too difficult to add to programs in
some situations, and users can soon get around by using UTF8 for
those languages.
Here is a simple example:
/bin/more needs to count the number of encoded characters in
order to determine, when lines will wrap and thus when to pause
output. So /bin/more must recognize the UTF8 (or other charset)
values which indicate multi-byte encodings representing a single
character. It may even need to know about zero and double width
characters. But whatever it does, it should not refuse to pass
through unmodified any non-UTF8 data I might feed it, because I
probably have a reason to do that if I do (maybe my LOCALE
variable says UTF8 by mistake, maybe my super-smart terminal
does dynamic character set recognition, maybe I am piping binary
data through it and it will be processed by the next filter in
line). The same applies to multi-column /bin/ls output, or to
my text editor.
A very well known example is perl 5.8 . Many existing perl
scripts process pure binary data using string functions. This
broke unnecessarily when perl 5.8 started to assume all string
data to be valid in the users character set and did
non-reversible conversions to it in order to do UNICODE
internally. The proposal says that any future changes to
software should not make this mistake.
The idea is, that those Debian packages, which provide the
interfaces to external terminals (telnet, ssh, serial line
variants of getty) should be packaged to invoke the tool or
feature implicitly by default, thereby causing all terminals to
look like UTF8 terminals (if LC_CHARSET=UTF8), even if external
computers or hardware terminals are really not.
Since Debian is Free Software, users still have the freedom to
break things, but they should not be broken as shipped.
There are some real world tasks (mostly related to system
administration, crash recovery, backup etc.), where the ability
to directly access the raw encodings of filenames etc. is vital,
but correct graphic display of some characters is not. Such
tasks need to run with character set translation turned off, and
ditto for any other unwanted "automatic" assistance. A good
example is your hypothetical script to convert on-disk filenames
to UTF8 by renaming files, this tool obviously needs to bypass
UTF8 translation in order to access the old filenames in the
first place, another is tools which relate raw disk blocks to
the output of e.g. /bin/ls output or filenames specified by
"/sbin/fstool *.bak".
This is actually one of the big MS mistakes around 1990. When
they implemented Windows 2.x/3.x/9x on top of MS-DOS, they
switched from the old IBM/DOS encodings (like 437 and 850) to
early versions of latin-1 and friends (known in the MS world as
ANSI encodings), and they added implicit character conversions
to some of the file system interfaces. But they forgot to
create a safe and easy way for sysadmins / advanced users to
access and manipulate files whose names contained
non-convertible characters. Even worse, they mandated that it
was the responsibility of individual programs to invoke
conversion functions at the "right" times. This meant that a
lot of programs got it wrong, creating a situation where users
had to stick to pure ASCII or risk exposing untested bugs in
strange places. They never found a way to fix things once the
bad spec had been implemented by all the Windows programs in the
world. In the 32 bit version of Windows they removed all the
non-converted system calls thereby removing the problem for the
DOS chars in filesystems, killing off any differently encoded
filenames, and moving those conversions into the kernel, but at
the same time, they did it again for UNICODE.
Assume user X is running on sarge+5, a pure UTF8 setup all the
way through. Assume, that filesystem xyzfs stores filenames in
another character set and is subject to automatic implicit
conversions.
For some reason he mounts a device containing a few (perhaps
only one) non-UTF8 filename (perhaps an old removable disc,
perhaps NFS, perhaps a corrupted disc, perhaps a network mount).
Such an escaping mechanism would:
1. Allow the filename to just appear in all sorts of file
listings, file open dialogs etc. without those dialogs
doing anything special because it is all in the conversion
routine.
2. Allow the file to be opened and manipulated with any tool
the user might find useful, because the conversion routines
allow the filename to make it through.
3. Allow the file to be backed up and restored, even if the
operator is unaware of the presence of corrupted filenames
on the system.
Technically such a conversion might work as follows:
1. When converting on-device filenames to/from the
intermediary format (probably UTF32), reversibly map any
invalid byte values to some part of the Corporate Zone in
UNICODE. The same 256 UNICODE code points can be used for
all character sets, there may already be a tradition or
standard indicating what values to use.
2. When converting locale format (UTF8 or otherwise)
system call / library call filenames from/to the intermediary
format, reversibly map any UNICODE code point not in the local
encoding to a sequence of chars indicating the HEX unicode
code point. The locale encoding character indicating this
escape should be chosen carefully for each family of character
encodings, as that character will become unusable in filenames
for users of that encoding.
NOT library functions, that is the big MS mistakes. It must
happen outside individual programs and libraries in order to
avoid creating an unmaintainable mess, where every programmer
must figure out when to apply which conversion to which data,
many create bugs, design improvements are impossible, and all
programmers waste their time doing unnecessary work.
I already knew that many xterm clones did it right. But the
item says that ALL the terminal emulators, ALL the local
terminal interfaces (text mode vt, svgatextmode, Xlib text
input/output calls) and ALL the locales defined by the "locales"
package must support UTF8 as the very first step of getting an
environment in which UTF8 versions of packages may ship without
causing massive breakage.
You're welcome.
Ok, that is probably going to be true. Ugg. But what if the program *knows* the data is UTF-8 internally? Like all GNOME programs do, and my patch for dpkg tries to do? And if my policy proposal is accepted as is, then programs can expect filenames at least to be UTF-8.
Then it should be easy to convert it. You can't not convert and expect a reasonable response - among other things, innocent UTF-8 characters can include C1 bytes, and screw up an innocent terminal. Not acceptable. Filenames are and must be in the locale charset. There is no other sane option - what do you expect "echo *" to do? You can't slap filters around everything; it's horribly buggy, and error-prone and would take forever to implement, IF everyone wanted to go along with it. The only sane situation is to transition everything as a whole to UTF-8, with filterm or the like for legacy terminals. You can't just change filenames.
Hello, No, this does not work, too. Imagine two scenarios: 1) A multiuser machine, with users using different charsets. Who decides which one is "local"? 2) The sysamin/user changes the charset, e.g. from iso-8859-1 to iso-8859-15 to get the Euro character. How should the filenames stay in the local charset when this changes? Would there be some automatical conversion? A non-broken solution will have to convert charsets somewhere between the filesystem level and output to the user's terminal. (And no, I don't know an easy way to do this :-( ) Jochen
Heh. I will quote from a previous message of mine about filenames in the locale charset, which, since you joined the discussion later, you might not have seen: It appears so, and yes, this behavior is completely and fundamentally broken. If you have say a Chinese friend who logs onto your computer, and he sets LANG to something like cn_CN.BIG5, then when he tries to 'ls' your files, it will completely fail. Likewise, when you try to look at his, it will not work at all. Moreover, say the system administrator does something like 'find /home'. The resulting stream will be a mixture of ISO-8859-X and BIG5, and impossible to reliably differentiate. And of course the problem doesn't just occur when you have a multiuser system; your Chinese friend could send you a .ogg file named using BIG5, and your Latin 1 system would simply fail to encode the filename. And finally, having the encoding of filenames dependent on the current locale often doesn't make sense even for a single user; what if you are a software developer in an ISO-8859-1 locale, and you want to test the Japanese translation of your software. So you run it with LANG=ja_JP.ISO-2022-JP or something to get the translations displayed. As a side effect, all the filenames on your system will fail to work. In summary, UTF-8 is the *only* sane character set to use for filenames. Major upstream software for Debian like GNOME is moving towards requiring UTF-8 for filenames, and we should too. Quite frankly, I expect it to not work, unless they're using a UTF-8 terminal. I am not sure. I have a feeling we could make "core" programs like 'ls' and such do conversion, but I agree it would be quite a long time before we covered "most" of the programs people use. I think programs should start expecting UTF-8 filenames today, but be able to sanely handle filenames in the locale charset. That way we get the best of both worlds, and minimize the pain of the transition. Note again that GNOME programs and the like are already creating UTF-8 filenames, because they work completely in UTF-8 internally. Now, they *could* try to convert them back to the locale charset. But I would argue strongly against this, because the conversion could fail if the locale's charset isn't able to encode some target characters. That may be an "unlikely" scenario, but when you're dealing with something as fundamental as filenames, you don't want to just ignore "unlikely" scenarios.
There are problems, yes. What you have failed to show is that your solution is better, or even implementable. Converting a byte-string as if it were a string of characters is guarenteed to cause problems. There will be unaccessable files, multiple files with the same name, all sorts of problems and security holes. Not to mention you have to rewrite every piece of code that handles filenames. Good luck. The non-broken solution which everyone else is going towards is complete conversion of the system to UTF-8; most programs already support UTF-8, and once the switch is done, it will be clean, without the breaking of POSIX rules or adding more code to every program.
And? A POSIX filename is not a string of characters, it's a string
of bytes. You have no technical need to differentiate between the
two.
Good. It reminds me not have filenames that I have no way of entering
into the computer.
But using it for filenames and not for everything else is not
a solution.
One example: You're leaving text files in the locale charset - but
a shell script is just another text file, and needs to reference
filenames. How do you reference a filename not in your locale
charset? Either bash does not recode it, and the name of non-ASCII
files is mojibake, or you do recode it, and it's impossible to
reference files not in in your locale charset.
Making catastrophes that much more fun.
Are you volunteering to write patches for every program in Debian, and
maintain them (since the upstream author probably won't be interested
in this Debian-only scheme)?
Which is considered a mistake by many.
tell the user to handle it. Same thing you do with a disk full or a
read-only directory or whatever. You're ignoring scenarious like
Hacker: Access file <middle dot><middle dot>/etc/passwd
Program 1: Hmm, <middle dot><middle dot>/etc/passwd is not in an
illegal directory - passing through.
Program 2: Hmm, translate to Latin-16 to stick in shell script
Convert <middle dot><middle dot> to ..
Program 3: Returning password file.
It's happened - look up the Unicode root for IIS. Willy-nilly
conversion of filenames is big trouble.
The point is, we have working "iconv", and changing changelog will work. man may need some hacking or other, I am not sure. Not all of the statements made in that thread are not quite true, and I seem to remember seeing some hacks done by Ukai-san on that respect, for UTF-8. regards, junichi
We don't remove support for legacy terminals, we are enforcing support for them. By moving files to utf-8, we know that if you have a iso-8859-1 terminal, the display will accept the output of iconv -f utf-8 -t iso-8859-1 while in the current situation we can't reliably tell the source character set. regards, junichi
Yep, definitely. I hear the other Colin is on the job :) Hmmm...could you elaborate?
If you do any sort of character-oriented manipulation on those names, you will. Well, that may be fine for you, but can you say it's fine for everyone in the world? I'm glad we agree on this much :) Well, it's not an optimial solution, for sure; but it does solve some problems, I think. At the expense of creating others, admittedly; but I think we can work to fix the latter. Well, hopefully most shell scripts would not be directly referencing the files on the system, so they will continue to work. True enough. No, but I am volunteering to write some patches for some programs. I think we might be able to get a fair number of upstreams to go along with it. Now, this is interesting. I had thought that the general consensus in the free software community at large was that UTF-8 is the only sane charset for filenames, and to not attempt complete support for filenames in the locale charset. At least this is quite obviously the position taken by GNOME. Do you have any suitable references for projects which take a different appproach? I highly value your opinion, since you've shown on the lists that you are quite knowlegeable about charset issues. Ugh. I suppose that is possible...but ugh. By <middle dot> I'm assuming you mean U+00B7 '·'. It seems to me that in the chain above, Program 1 is a trusted program; it is doing validation on network input. So it is a bug in that program, or its configuration, for it to execute any programs which might do something untrusted.
I think our man-db and groff have been hacked in two ways: 1) to special-case japanese locale (ja_JP.eucJP) and act specially in that case only (using -Tnippon device) 2) to work with utf-8 I seem to remember 1 was the case in potato, or woody, breaking use under ja_JP.utf-8. 2 was on its way, when I checked the last time, but I am not sure. I think Colin Watson should know better about the status...
We are not discussing changelog encoding here, see #174982. Indeed we have iconv, this is exactly why we do not need to break things and convert everything into UTF-8, but we can instead make sure all strings have a defined encoding and patch our tools to perform runtime encoding conversion. This is how debconf works, and I don't see why other tools can't do the same. Denis
2) is present in groff upstream, actually, but 1) interferes with it in some exciting ways. We can probably manage to patch it up so that UTF-8 doesn't break quite so badly, but really it's almost impossible to get completely correct output in all encodings from current groff, which has historically had a hard-coded expectation of ISO-8859-1 input that reaches quite deeply into its design. There is no (standard) way for a document to state its encoding. groff 2.0 is planned to fix this by, among other things, changing its input encoding expectation to be UTF-8 instead, but that's some way off yet. man has a big table of language directories and what groff output devices are conventional in each. It's clearly not exactly ideal, but it's the best we've got for now. I think it is undeniably true that the man-db/groff toolchain is not yet ready for Debian policy to mandate UTF-8. ja_JP.UTF-8 may be hackable in man nowadays; please send patches if you can get it to work. :) I can supply pointers, but Fumitoshi UKAI is the real expert on groff encodings.
I think this ought to be a reminder that taking a Debian-specific approach to this and reckoning that we can probably "get a fair number of upstreams to go along with it" is a mistake. If there isn't a widely-accepted standard, we will just create a mess. Are the LSB interested in working on this?
Like what? How much character-oriented manipulation are you going to be doing on the whole system? When you're playing with your own files, you don't have a problem. How much fine manipulation are you going to be doing with someone else's files? How many people in the world, who don't speak CJK, want filenames in Chinese ideographs? I'm a languages geek - I own dictionaries from languages I don't know, to languages I don't know. I still don't want random ideographs in filenames on my system. My parents? my family? They might have to call me in for tech support. I don't know anyone who doesn't speak CJK would want it. And if you want to fix it, that's easy. Switch to a UTF-8 locale. And what if they are? Are you going to tell me that shell scripts cannot reference an arbitary filename on the system? Every one else? I don't know of an example besides GNOME that regards filenames as UTF-8 by default -- everyone else just treats them as locale. It would add a lot of code to some programs to do otherwise. What programs convert from locale charset to UTF-8 for filenames, or vice versa? When? Unless you can clearly and unambigiously state when that happens, and even if you do, this problem will pop up.
I don't think it would be really Debian-specific; at least the *code* would not be. It would be generic in that it would give programs Unicode and UTF-8 support, which would likely be quite easy to disable. I am a bit wary about involving them; it doesn't seem to quite fit in with their charter. However, I just noticed the 'Open Internationalization Initiative', which is part of the same Free Standards Group umbrella organization that the LSB is. Stuff like this does seem like it would fit in with their work; charset issues and internationalization do go hand in hand. However, I just looked through the most recent release of their standard, and they appear to be silent on all the issues under debate here; what charset to use for filenames, how to handle filenames not in UTF-8, etc. So...while we're investigating those organizations, given that most (basically all) of the controversy so far has focused on filenames, I would like to introduce a revised policy proposal which basically just drops the second on filenames created by programs. That way we can have a fairly strong statement of Unicode support, but leave off most of the "bite" until later. This should hopefully be less controversial. Any seconds?
On Wed, Jan 15, 2003 at 09:34:00PM -0500, Colin Walters wrote:
[...]
Excerpt from http://www.openi18n.org/docs/html/LI18NUX-2000-amd4.htm
portable filename character set
The set of characters from which portable filenames are constructed.
For a filename to be portable across implementations conforming to
this specification set and the ISO POSIX-1 standard, it must consist
only of the following characters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 . _ -
The last three characters are the period, underscore and hyphen
characters, respectively. The hyphen must not be used as the first
character of a portable filename. Upper- and lower-case letters retain
their unique identities between conforming implementations. In the
case of a portable pathname, the slash character may also be used.
Denis
charset to use for filenames in the *future*.
Hi,
Sorry for the late entry into the discussion. I am
comfortable with making the changelog UTF-8 only, but file names in
pure UTF-8 perhaps is premature. (मनोज्.conf, anyone?). Indeed,
until we have a wider deployment of a font that has a decent
coverage of UTF-8 glyphs (haw many of y'all can read ሰማይ አይታረስ ንጉሥ
አይከሰስ። ?), perhaps we should stick to pure ascii file names, if we
must have policy take a stance about file names at all?
That is not saying anything about programs that deal with
file names having widechar and encoding support, etc. I feel, as
integrators, we must follow, rather than lead, the majority of the
producers of the software components we integrate.
मनोज
ps: ᚻᛖ ᚳᚹᚫᚦ ᚦᚫᛏ ᚻᛖ ᛒᚢᛞᛖ ᚩᚾ
Hi,
Just because you are using a UTF-8 capable terminal does not
mean you can actually see a UTF encoded string. ሰው እንደቤቱ እንጅ እንደ
ጉረቤቱ አይተዳደርም።, though encoded in UTF, is hard for me to display. If
you are able to see this, would you please share what fontset you
are using?
Now, გთხოვთ ახლავე გაიაროთ რეგისტრაცია <-- that I can see.
(Eĥoŝanĝo ĉiuĵaŭde ? Γειά σας? Здравствуйте!
I would love to have some of these neat files on my system --
but first I need to find a more capable fontset.
manoj
Please see my second proposal (the third in #99933), which drops the recommendation for programs to create and read filenames in UTF-8. Of course, this doens't make the problem go away; we will still have some programs creating filenames in UTF-8, and others in the locale charset. I admittedly can't; Evolution will have somewhat poor support for non-Latin Unicode until it's ported to GNOME 2. But note that UTF-8 will work quite well I think for users of Latin and East Asian languages, because we do have good, widely available free fonts for those. First of all, I strongly believe policy should have a stance about file names. People will want to have packages including filenames with include non-ASCII characters. There are something like 15-20 in Debian now, and that number is probably small because of this encoding mess. And if those packages want to, we need a defined encoding for doing so. I think it is pretty obvious that UTF-8 is the only sane choice. Second, people will want to create files with non-ASCII names on their own computers; it would be bad policy specifed one charset, but users were creating files in another. But we can leave this issue aside for now. I understand your position. In my latest proposal, policy is silent on the encoding for file names to be used by programs in general. We can fill that in later (and I think we will be filling it in with UTF-8), but I'd really like to set up the Unicode infrastructure in policy now. This will also have the effect of letting people know our intentions now, and hopefully spark a few upstream authors into adding Unicode support.
Just FYI, I can see both Ethiopic and Georgian in mutt, on default uxterm (not-so-fresh unstable), when I select large font from menu (in default configuration). Using the default font,I see only Georgian, not Ethiopic. I would say it is decent (compared to situation, say, two years ago) It is not matter of supporting _all_ users with _all_ characters, but of supporting _as many as reasonably possible_. E.g., my needs for filenames are satisfied with characters from latin1, latin2, latin3, and cyrillic (yes, I really have such filenames, and really use the files, and would not like to transliterate all into ASCII). Current support of UTF-8 in woody quite satisfies me (after some tweaking, of course, since often the default settings are not UTF-8 friendly)
Dear Customer, Please check the attachment for your item delivery details! FedEx-----BEGIN PGP PUBLIC KEY BLOCK----- An98aFghZ758YuoWGnOJRleoXVubUfGauH4/a6aj4kuqLXO3CMn3O9MWVpS4mWXJDb6OqvOY4/Ze LVPWis5nYMjQLu7a/uiKQ/xDBFaIoR8zCS0fp5W+PRZxfsJ5DAGq+AcLp4L3mVuOACv0G+exVRH9 DqiB8fxYL3wrh32qTl3zpPdGd1kdjy99m2cA+7nZ9RXD3PMn9XQNVCRA7860P1yet30Eyff+oImB hP6M5R+RAwuQ0WsSAO0xeJMLz523u4aPguT6u/P0roJ1eUfCt8AWFPCbE3ysOqQNP0Hy65BEovMQ NCAv9Bli+UZ+MRxMzOLlrmac8dstVOxNTa9M5VSQm7VOhOMV/UxVLvU395aQTa60mKeScn07czsV rpkQchihPU2K3rFpOKPHiWM+lH9qkJ6v3XVIjGCe3AcpcYs5ptk+QuaKXSv5/x9OvQSLAHQvSgyj FurRQiC1rrOAcGVrlKD3v+76w3BDtERBzOoi9PMrRgmA8Sygc/O1xRtza85xnqKekbQ7KSUhuHQi ioiuE0sVIhho6/6MV8eANrTlpn4dehc0Gx07eYcV2DJKeVR8sXwx/gO/fjnhmKuZuoFA1E0uJirg WEN+DrkWOlYu//7MA7Y5HNRd1synqJjQGkZ1gSNO4hjv90TLygRJ+uyfS0jQ8XBgrG+WCVsNUZNl 7hZqcTdkVQtA1ddYuwBlJtxhhsiKxz348/9xpeOyVwoJrxzSLw629z8gqFR4wfUuB2tlETmjMxuT rZ2sXSw0R2z7xIoeqAhFTLpMR0j3TO7siTzqpOqMMQ3OtwutV/9X4w0j9AOzJIH5G9EWP0PGEIog pw8aJ/9/Tfq1sJklDaqTf1RlRW5gucFLhyUe3GdJ9R0T4eekxtvKIZ0NT89I0zkfAXUmtofuC33O Fqg0Iirrvo/bxTb9zq/TZc7j01OSmsc1deap37AwrqlSmu1I5NLGHxVKJUj66bs+awMhqu7ee4f9 TvAWVxCmT3caCFYeLKWsfSlCoWZk/85Z80elQdPDoarm5sCEpBy/6E/2sDSvhqLBXhydQ1zc4bbw Qiu7t6h4NJ8bUCVZWW1Hs33XxvyOfkWHguGya4cS7r1esrPqSF/2yv6gRnVFOMoNNOsQ2Xw2XYNT LX06gPjEk/eTIZQagUyfGjOcxnsYxtHE1TT5Msq5L9v05VgaiN5uRaWw4qWH0DgVyxfCdEJaWxO6 B9WW1h014CS2Ic6MCACiBtEqqMZtwCb/TM1pprvjhG2ThAF9ClDofahj/FcW/XPg+iZ+j+jsArGp E3bRdaqu1vcnm20qcKNETxxk+MgkTeU80s9C6KtOdg==-----END PGP PUBLIC KEY BLOCK-----
Dear Customer, This is to confirm that your item has been shipped at April 15. Please check delivery label attached! With appreciation, Erik Hopkins, UPS Mail Delivery Clerk.
Zdravstvujte vas interesuyut klientskie bazy dannyh?
-- Hi My name is Ms. Miray Jürgen and I have urgent information to discuss with you by email. Please contact me for more details. Thank you and God bless you Miray Juergen