#381436 lists.debian.org: debian-chinese-gb web archive encoding error

#381436#5
Date:
2006-08-04 12:31:05 UTC
From:
To:
Hi,

It seems that our mailing list software have serious encoding error on
debian-chinese-gb archive. Look on the web archive in recent 3 months,
most chinese mails are displayed as garbage. Changing character set in
web browser doesn't help at all.

There are at least two different situations:

1. Mails in UTF-8 encoding
   Both subject and body are totally messed. Every chinese character
   is displayed as an accented character, like "ä" or "å". Such as,
   "Re: æåmuttääçéäåèçéé" [1].

2. Mails in gb2312/gb18030 base64/quotedprint encoding
   These mails' content can be displayed correctly. But titles are
   not decoded and showed as raw text. such as,
   "=?gb2312?b?UmU6IHFyZWYguPHKvdDe1f0=?=" [2]
   But, if the mail is claimed as gbk encoding, everything looks
   correct. Here[3] is an example.
   (gb2312 is the subset of gbk, gbk is the subset of gb18030)

Although our chinese lists are called "...-gb" and "...-big5", but most
people are using UTF-8 encoding nowadays. So if we have only one
decoding choice, I suggest UTF-8.


[1] http://lists.debian.org/debian-chinese-gb/2006/05/msg00116.html
[2] http://lists.debian.org/debian-chinese-gb/2006/08/msg00011.html
[3] http://lists.debian.org/debian-chinese-gb/2006/05/msg00109.html

#381436#12
Date:
2007-11-10 13:47:29 UTC
From:
To:
Hi Carlos,

quite a while ago, you submitted a bug to the Debian BTS noting
deficiencies in the encoding of the list archive[1].
We have now installed a newer version of the mailing list software that
should provide some improvements.
While we did not regenerate all of the archive yet, the current month's
archive[1] seems to indicate that things will improve when we do this
(soon).
One thing I noticed is that gb18030[3] does not seem to be decoded,
apparently because upstream is lacking support.
What it would need is a character mapping from gb18030 to UTF. There
already are those for CP936 (which is, I take it, gbk) and gb3212.
If someone could provide those maps, it would probably be easy to add
support for those as well. (The ones that are already there are in
CP936.pm and GB2312.pm in the mhonarc package, the format should be
similar and - if I understood your mail correctly - the overlapping
sections of these character encodings could be used as a starting point)
If you or someone else would be able to help out here, I would
appreciate that a lot.

Kind regards

Thomas

1. http://bugs.debian.org/381436
2. http://lists.debian.org/debian-chinese-gb/2007/11/
3. http://lists.debian.org/debian-chinese-gb/2007/11/threads.html#00025

#381436#15
Date:
2007-11-10 13:47:29 UTC
From:
To:
Hi Carlos,

quite a while ago, you submitted a bug to the Debian BTS noting
deficiencies in the encoding of the list archive[1].
We have now installed a newer version of the mailing list software that
should provide some improvements.
While we did not regenerate all of the archive yet, the current month's
archive[1] seems to indicate that things will improve when we do this
(soon).
One thing I noticed is that gb18030[3] does not seem to be decoded,
apparently because upstream is lacking support.
What it would need is a character mapping from gb18030 to UTF. There
already are those for CP936 (which is, I take it, gbk) and gb3212.
If someone could provide those maps, it would probably be easy to add
support for those as well. (The ones that are already there are in
CP936.pm and GB2312.pm in the mhonarc package, the format should be
similar and - if I understood your mail correctly - the overlapping
sections of these character encodings could be used as a starting point)
If you or someone else would be able to help out here, I would
appreciate that a lot.

Kind regards

Thomas

1. http://bugs.debian.org/381436
2. http://lists.debian.org/debian-chinese-gb/2007/11/
3. http://lists.debian.org/debian-chinese-gb/2007/11/threads.html#00025

#381436#22
Date:
2008-03-21 01:28:41 UTC
From:
To:
Hi Thomas!

Thank you for your quick response!  :-)

Thomas Viehmann (Listmaster) wrote:

I just did a "apt-get source mhonarc" and noticed that MHonArc currently
recognizes gb2312 and gbk, but "grep -r" shows no result of gb18030.

Quick fix: Since GB18030 is a strict superset of GBK (a.k.a. CP936),
making gb18030 an alias of cp936, perhaps modifying def-mime.mrc:

	cp936;              gbk
	cp936;              ms936
	cp936;              windows-936
+	cp936;              gb18030
	cp949:              euc-kr
	cp949:              ks_c_5601-1987
	cp949:              ks_c_5601-1989

or mhopt.pl would do the trick?  :-)  While it is not perfect, it should
be fine with 99.9% of all GB18030 mail out there.

The long-term solution is to extend CP936.pm to create a GB18030.pm.
James Su (SCIM author) and I worked on that several years ago, e.g. the
text codec in Qt.  Maybe eventually I'll make one for MHonArc too.
Haha, I'd better not make promises yet as I'm still kind of MIA.  :-)
So, until GB18030.pm is available, an alias to CP936 would do.

For more background information, you may like to take a look at

http://lists.w3.org/Archives/Public/ietf-charsets/2002JanMar/0038.html

Thank you very much for your help!

Warm regards,

#381436#27
Date:
2008-03-22 20:07:56 UTC
From:
To:
Actually, we are using mhonarc 2.6.16.
Thanks, I put that into our config files and regenerated the 2008-03
archive. If you don't notice any breakage, I will regenerate the old
debian-chinese-gb lists as well and then close this bug unless you
feel that gb18030 needs proper support.

It seems that gb18030 is not supported in standard perl but I needs
something like Encoding::HanExtra from CPAN.

Kind regards

T.

#381436#30
Date:
2015-08-13 17:04:53 UTC
From:
To:
Hello,

It has been some time since
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=381436 has been
filed.

How do you see the situation now?

Yours,
        Cord, Debian Listmaster of the day

#381436#37
Date:
2017-03-08 04:28:36 UTC
From:
To:
Dear Customer,

We can not deliver your parcel arrived at March 07.

Review the document that is attached to this e-mail!

Yours respectfully,
Jason Holman,
UPS Support Manager.

#381436#42
Date:
2017-04-10 02:04:27 UTC
From:
To:
Dear Customer,

UPS courier was unable to contact you for your parcel delivery.

Please check the attachment for complete details!

With many thanks,
Wayne Carter,
UPS Chief Station Manager.

#381436#47
Date:
2018-01-23 20:21:02 UTC
From:
To:
فیلم سکسی
#381436#52
Date:
2018-01-30 19:22:16 UTC
From:
To:
 ATTN:

Attached please find flyers for your review.

DirectAxis
Support Team.

#381436#57
Date:
2025-02-08 04:49:20 UTC
From:
To:
Final Notice.

You are among the beneficiaries of 2024/2025 grant for all scam victims and relatives reconfirm your email if active for more details

Thank You.

Regards
Mr. Rowland Cole
( Financial Crimes Enforcement Network)

#381436#60
Date:
2025-02-08 04:49:20 UTC
From:
To:
Final Notice.

You are among the beneficiaries of 2024/2025 grant for all scam victims and relatives reconfirm your email if active for more details

Thank You.

Regards
Mr. Rowland Cole
( Financial Crimes Enforcement Network)