- Package:
- lists.debian.org
- Source:
- lists.debian.org
- Submitter:
- "Carlos Z.F. Liu"
- Date:
- 2025-02-08 02:36:37 UTC
- Severity:
- normal
- Tags:
Hi, It seems that our mailing list software have serious encoding error on debian-chinese-gb archive. Look on the web archive in recent 3 months, most chinese mails are displayed as garbage. Changing character set in web browser doesn't help at all. There are at least two different situations: 1. Mails in UTF-8 encoding Both subject and body are totally messed. Every chinese character is displayed as an accented character, like "ä" or "å". Such as, "Re: æåmuttääçéäåèçéé" [1]. 2. Mails in gb2312/gb18030 base64/quotedprint encoding These mails' content can be displayed correctly. But titles are not decoded and showed as raw text. such as, "=?gb2312?b?UmU6IHFyZWYguPHKvdDe1f0=?=" [2] But, if the mail is claimed as gbk encoding, everything looks correct. Here[3] is an example. (gb2312 is the subset of gbk, gbk is the subset of gb18030) Although our chinese lists are called "...-gb" and "...-big5", but most people are using UTF-8 encoding nowadays. So if we have only one decoding choice, I suggest UTF-8. [1] http://lists.debian.org/debian-chinese-gb/2006/05/msg00116.html [2] http://lists.debian.org/debian-chinese-gb/2006/08/msg00011.html [3] http://lists.debian.org/debian-chinese-gb/2006/05/msg00109.html
Hi Carlos, quite a while ago, you submitted a bug to the Debian BTS noting deficiencies in the encoding of the list archive[1]. We have now installed a newer version of the mailing list software that should provide some improvements. While we did not regenerate all of the archive yet, the current month's archive[1] seems to indicate that things will improve when we do this (soon). One thing I noticed is that gb18030[3] does not seem to be decoded, apparently because upstream is lacking support. What it would need is a character mapping from gb18030 to UTF. There already are those for CP936 (which is, I take it, gbk) and gb3212. If someone could provide those maps, it would probably be easy to add support for those as well. (The ones that are already there are in CP936.pm and GB2312.pm in the mhonarc package, the format should be similar and - if I understood your mail correctly - the overlapping sections of these character encodings could be used as a starting point) If you or someone else would be able to help out here, I would appreciate that a lot. Kind regards Thomas 1. http://bugs.debian.org/381436 2. http://lists.debian.org/debian-chinese-gb/2007/11/ 3. http://lists.debian.org/debian-chinese-gb/2007/11/threads.html#00025
Hi Carlos, quite a while ago, you submitted a bug to the Debian BTS noting deficiencies in the encoding of the list archive[1]. We have now installed a newer version of the mailing list software that should provide some improvements. While we did not regenerate all of the archive yet, the current month's archive[1] seems to indicate that things will improve when we do this (soon). One thing I noticed is that gb18030[3] does not seem to be decoded, apparently because upstream is lacking support. What it would need is a character mapping from gb18030 to UTF. There already are those for CP936 (which is, I take it, gbk) and gb3212. If someone could provide those maps, it would probably be easy to add support for those as well. (The ones that are already there are in CP936.pm and GB2312.pm in the mhonarc package, the format should be similar and - if I understood your mail correctly - the overlapping sections of these character encodings could be used as a starting point) If you or someone else would be able to help out here, I would appreciate that a lot. Kind regards Thomas 1. http://bugs.debian.org/381436 2. http://lists.debian.org/debian-chinese-gb/2007/11/ 3. http://lists.debian.org/debian-chinese-gb/2007/11/threads.html#00025
Hi Thomas! Thank you for your quick response! :-) Thomas Viehmann (Listmaster) wrote: I just did a "apt-get source mhonarc" and noticed that MHonArc currently recognizes gb2312 and gbk, but "grep -r" shows no result of gb18030. Quick fix: Since GB18030 is a strict superset of GBK (a.k.a. CP936), making gb18030 an alias of cp936, perhaps modifying def-mime.mrc: cp936; gbk cp936; ms936 cp936; windows-936 + cp936; gb18030 cp949: euc-kr cp949: ks_c_5601-1987 cp949: ks_c_5601-1989 or mhopt.pl would do the trick? :-) While it is not perfect, it should be fine with 99.9% of all GB18030 mail out there. The long-term solution is to extend CP936.pm to create a GB18030.pm. James Su (SCIM author) and I worked on that several years ago, e.g. the text codec in Qt. Maybe eventually I'll make one for MHonArc too. Haha, I'd better not make promises yet as I'm still kind of MIA. :-) So, until GB18030.pm is available, an alias to CP936 would do. For more background information, you may like to take a look at http://lists.w3.org/Archives/Public/ietf-charsets/2002JanMar/0038.html Thank you very much for your help! Warm regards,
Actually, we are using mhonarc 2.6.16. Thanks, I put that into our config files and regenerated the 2008-03 archive. If you don't notice any breakage, I will regenerate the old debian-chinese-gb lists as well and then close this bug unless you feel that gb18030 needs proper support. It seems that gb18030 is not supported in standard perl but I needs something like Encoding::HanExtra from CPAN. Kind regards T.
Hello, It has been some time since https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=381436 has been filed. How do you see the situation now? Yours, Cord, Debian Listmaster of the day
Dear Customer, We can not deliver your parcel arrived at March 07. Review the document that is attached to this e-mail! Yours respectfully, Jason Holman, UPS Support Manager.
Dear Customer, UPS courier was unable to contact you for your parcel delivery. Please check the attachment for complete details! With many thanks, Wayne Carter, UPS Chief Station Manager.
فیلم سکسی
ATTN: Attached please find flyers for your review. DirectAxis Support Team.
Final Notice. You are among the beneficiaries of 2024/2025 grant for all scam victims and relatives reconfirm your email if active for more details Thank You. Regards Mr. Rowland Cole ( Financial Crimes Enforcement Network)
Final Notice. You are among the beneficiaries of 2024/2025 grant for all scam victims and relatives reconfirm your email if active for more details Thank You. Regards Mr. Rowland Cole ( Financial Crimes Enforcement Network)