Fabre

#779207 unzip fails to unpack filenames containing 'Ã¤' 'Ã¶' 'Ã¼' -> results in "(invalid encoding)" #779207

Package:: unzip

Source:: unzip

Description:: De-archiver for .zip files

Submitter:: derMaria

Date:: 2024-05-28 07:42:04 UTC

Severity:: wishlist

Tags:

#779207#5

Date:: 2015-02-25 13:07:28 UTC

From:

To:

Dear Maintainer,

whenever I try to unpack a ZIP File in which a filename is containing 'ä',
'ö' or 'ü' it is replaced by '�' and the term " (ungültige Kodierung)"
(invalid encoding) is added as part of the extracted filename.

It is a whole lot of work to remove the term " (ungültige Kodierung)" from all
of the files as these characters in German are quite often used and I don't
pack the files myself so I can't influence the original names of the files.

See also:

http://forum.ubuntuusers.de/topic/falsche-buchstaben-nach-entpacken/
https://blueprints.launchpad.net/unzip/+spec/unzip-detect-filename-encoding

Thanks - Rafael

#779207#10

Date:: 2016-01-17 18:47:56 UTC

From:

To:

I'm pretty often have this problem with Cyrillic characters in
filenames. So this is not related only to German, but for all locales
with non-latin characters.
This problem is already solved by Altlinux (Russian Linux distro).
Ubuntu is using their patch.

unzip (6.0-20ubuntu1) xenial; urgency=medium

  * Resynchronise with Debian. Remaining changes:
    - Add patch from archlinux which adds the -O option, allowing a charset
      to be specified for the proper unzipping of non-Latin and non-Unicode
      filenames.

Archlinux uses it too, according to the bug report:
https://bugs.archlinux.org/task/8383

Patch from Ubuntu is attached.

#779207#15

Date:: 2017-05-21 14:00:07 UTC

From:

To:

control: tags -1 patch
control: severity -1 important

Hi,

zip as shipped currently with Debian squeeze lacks encoding support.
This is a widely known problem with some workarounds.
https://superuser.com/questions/872596/decompress-zip-with-given-encoding
https://unix.stackexchange.com/questions/251969/how-can-i-correctly-decompress-a-zip-archive-of-files-with-hebrew-names

Seemingly the same problem is reported as https://bugs.debian.org/696914
too.

Apparently, Ubuntu, Arch, Redhat and FreeBSD ships (or shipped) patched
version of unzip to cope with this widely known encoding issue (it seems
this is a more than 10 year hanging issue. Upstream change seemd to
broke old patch sometime in history. But I see Ubuntu has an updated
patch.). Knowing slow upstream, maybe it is good idea to apply a patch
to fix this shortcomings on Debian too.

Arch bug and patch in 2009:
https://bugs.archlinux.org/task/15256

Ubuntu discussion on this bug is here:
https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/580961

In this:
Mathew Hodson (mathew-hodson) wrote on 2016-05-16: #198
I've closed the remaining tasks. This particular bug was fixed in
Precise and later. For remaining issues in p7zip and file-roller, see
Bug #1382106 and Bug #495880

Current Ubuntu fixed this bug and its diff is here:
https://ubuntudiff.debian.net/q/package/unzip

unzip (6.0-21ubuntu1) artful; urgency=low

* Merge from Debian unstable. Remaining changes:
- Add patch from archlinux which adds the -O option, allowing a charset
to be specified for the proper unzipping of non-Latin and non-Unicode
filenames.

Looks quite reasonable.

The same patch has been in use from unzip version 6.0-19ubuntu1 packaged
by Sebastien Bacher <seb128@ubuntu.com> Fri, 23 Oct 2015 15:58:43 +0200

So this patch should have been well tested by know!

As long as we apply the same patch as Ubuntu, security concern is
minimal, too. (I understand that, with so many recent CVE fixes, you may
be very conservative deviating from the upstream.)

If you don't feel like updating under freeze, please seriously consider
uploading right after the release and backporting.

Regards,

Osamu

#779207#24

Date:: 2017-05-21 14:38:51 UTC

From:

To:

severity 779207 wishlist
thanks

This is still a feature request. Granted, a feature request that many
people request, but still a feature request.

The proposed patch, even if it's "well tested", may or may not be
compatible with whatever thing upstream finally implements.

If we only had some assurance that the upstream patch will be like
this, then yes, it would be fine to apply the patch (I would be happy
to backport the changes from upstream git or whatever source control
version they have), but we don't really know, so no, I still do not
feel like deviating from upstream, not under freeze, and not after
the freeze.

Anyway, I'll ask the authors about this once again.

Thanks.

#779207#31

Date:: 2017-11-24 08:30:26 UTC

From:

To:

Dear Maintainer,

it looks like some upstream beta version from 2010 fixes this by adding
the -I and -O option, and the changelog says it's based on the
unzip60-alt-iconv-utf8.patch proposed in this thread.

This is the beta version:
ftp://ftp.info-zip.org/pub/infozip/beta/unzip610b.zip

You'll need to compile with -DUSE_ICONV_MAPPING to enable this, and
depend on the iconv library.

Mentioned also in #197427 and maybe other related bugs are also #696914
and #483290.

Ciao,
   Antonio

#779207#36

Date:: 2017-11-24 09:24:24 UTC

From:

To:

On Fri, 24 Nov 2017 09:30:26 +0100 Antonio Ospite <ao2@ao2.it> wrote:

[...]

Well, the changelog mentions the "iconv library", but on linux the
functionality is in glibc, so no extra dependencies should be needed.

Ciao,
   Antonio

#779207#41

Date:: 2019-01-07 16:09:54 UTC

From:

To:

Hi Santiago,

How is this bug going?

I think we can assume upstream patch is like this.

The upstream has a 610beta release on 2010.
https://sourceforge.net/projects/infozip/files/unreleased%20Betas/UnZip%20betas/

It adds the support of -O/-I option for non-UTF8 encoding.

I checked the code in 610beta, and the patch applied in archlinux AUR[1][2],
most code are same, except:

1. Upstream code is surrounded by USE_ICONV_MAPPING.
2. 610b has a big refactor of command line parser. So the code of command parser
   is a big difference.

The left code is just same.

Cloud you backport this lovely feature in Debian?

[1] https://aur.archlinux.org/packages/unzip-iconv
[2] http://www.conostix.com/pub/adv/06-unzip60-alt-iconv-utf8_CVE-2015-1315.patch

#779207#46

Date:: 2019-02-05 10:01:14 UTC

From:

To:

Hi Santiago,

I think, the community in all non-english spoken countries would highly
appreciate a solution for this bug, which affects many almost every day.

The effort hopefully is not too big, but the benefit would be really
tremendously.

Many thanks in advance!

Best Regards,
---------------------------------------------------------
"Sed quis custodiet ipsos custodes?"
(Wer, außer den Wächtern selbst, wacht über die Wächter?)

#779207#51

Date:: 2024-05-22 15:40:21 UTC

From:

To:

The built-in .zip archiver in older versions of Windows used DOS (OEM) or Windows (ANSI) code page corresponding to current regional settings for new archives. Lots of such archives still exist.
 
The correct behavior is to determine the relevant OEM or ANSI code page based on the system locale and use it. You can look at this PR for reference implementation:
https://github.com/p7zip-project/p7zip/pull/232

Sample archive showing this bug attached. It uses CP866 (often called DOS or OEM) for Cyrillic (Russian) letters.

#779207#56

Date:: 2024-05-26 15:48:23 UTC

From:

To:

Dear colleagues,
I am writing to bring to your attention an issue with the current upstream version of unzip that has not been updated for many years. In the modern environment, where the vast majority of systems use UTF-8, unzip exhibits several problems that need addressing:
1) unzip is unable to correctly extract files containing the bit 11 in the General Purpose flag. This bit indicates that the file names are encoded in UTF-8. However, unzip attempts to re-encode them as if they are in OEM codepage, leading to incorrect file names.
2) By default, unzip does not display UTF-8 encoding correctly on Unix systems.
3) It is necessary to determine the OEM codepage correctly based on the system locale, rather than using a single codepage for all archives.
4) The assumption that archives for which the legacy codepage cannot be determined are encoded in ISO 8859-1 is incorrect. In reality, most archivers used the user's system codepage, which could be any codepage. It is reasonable not to alter the encoding in this case, ensuring that the archive can be opened at least on the same system where it was created. Additionally, options -O and -I have been added to specify the encoding manually.
I have prepared a patch (based on a similar patch from Ubuntu, with significant enhancements) that addresses these issues. A significant difference from the Ubuntu patch is that my code is capable of selecting the OEM codepage based on the system locale, instead of assuming the Russian/Cyrillic CP866 codepage for all archives when the system is set to UTF-8.
I hope you will find this patch useful.
Best regards,
Ivan Sorokin

#779207#61

Date:: 2024-05-28 07:37:54 UTC

From:

To:

I slightly modified the patch for unzip:
1) Found a sample ANSI archive, which requires a separate code branch, so added it (sample archive attached)
2) Fixed the -I and -O options, which were broken; they now mean the same thing for all types of archives, both are left for compatibility, you can specify either one. That is, they work similarly to -mcp in 7zip now

#779207 unzip fails to unpack filenames containing 'Ã¤' 'Ã¶' 'Ã¼' -> results in "(invalid encoding)" #779207

Just Reply to ...

Reply to submitter ...

Send control command (Silently)

Set Architecture Tags (Silently)