Dear Maintainer, whenever I try to unpack a ZIP File in which a filename is containing 'ä', 'ö' or 'ü' it is replaced by '�' and the term " (ungültige Kodierung)" (invalid encoding) is added as part of the extracted filename. It is a whole lot of work to remove the term " (ungültige Kodierung)" from all of the files as these characters in German are quite often used and I don't pack the files myself so I can't influence the original names of the files. See also: http://forum.ubuntuusers.de/topic/falsche-buchstaben-nach-entpacken/ https://blueprints.launchpad.net/unzip/+spec/unzip-detect-filename-encoding Thanks - Rafael
I'm pretty often have this problem with Cyrillic characters in
filenames. So this is not related only to German, but for all locales
with non-latin characters.
This problem is already solved by Altlinux (Russian Linux distro).
Ubuntu is using their patch.
unzip (6.0-20ubuntu1) xenial; urgency=medium
* Resynchronise with Debian. Remaining changes:
- Add patch from archlinux which adds the -O option, allowing a charset
to be specified for the proper unzipping of non-Latin and non-Unicode
filenames.
Archlinux uses it too, according to the bug report:
https://bugs.archlinux.org/task/8383
Patch from Ubuntu is attached.
control: tags -1 patch control: severity -1 important Hi, zip as shipped currently with Debian squeeze lacks encoding support. This is a widely known problem with some workarounds. https://superuser.com/questions/872596/decompress-zip-with-given-encoding https://unix.stackexchange.com/questions/251969/how-can-i-correctly-decompress-a-zip-archive-of-files-with-hebrew-names Seemingly the same problem is reported as https://bugs.debian.org/696914 too. Apparently, Ubuntu, Arch, Redhat and FreeBSD ships (or shipped) patched version of unzip to cope with this widely known encoding issue (it seems this is a more than 10 year hanging issue. Upstream change seemd to broke old patch sometime in history. But I see Ubuntu has an updated patch.). Knowing slow upstream, maybe it is good idea to apply a patch to fix this shortcomings on Debian too. Arch bug and patch in 2009: https://bugs.archlinux.org/task/15256 Ubuntu discussion on this bug is here: https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/580961 In this: Mathew Hodson (mathew-hodson) wrote on 2016-05-16: #198 I've closed the remaining tasks. This particular bug was fixed in Precise and later. For remaining issues in p7zip and file-roller, see Bug #1382106 and Bug #495880 Current Ubuntu fixed this bug and its diff is here: https://ubuntudiff.debian.net/q/package/unzip unzip (6.0-21ubuntu1) artful; urgency=low * Merge from Debian unstable. Remaining changes: - Add patch from archlinux which adds the -O option, allowing a charset to be specified for the proper unzipping of non-Latin and non-Unicode filenames. Looks quite reasonable. The same patch has been in use from unzip version 6.0-19ubuntu1 packaged by Sebastien Bacher <seb128@ubuntu.com> Fri, 23 Oct 2015 15:58:43 +0200 So this patch should have been well tested by know! As long as we apply the same patch as Ubuntu, security concern is minimal, too. (I understand that, with so many recent CVE fixes, you may be very conservative deviating from the upstream.) If you don't feel like updating under freeze, please seriously consider uploading right after the release and backporting. Regards, Osamu
severity 779207 wishlist thanks This is still a feature request. Granted, a feature request that many people request, but still a feature request. The proposed patch, even if it's "well tested", may or may not be compatible with whatever thing upstream finally implements. If we only had some assurance that the upstream patch will be like this, then yes, it would be fine to apply the patch (I would be happy to backport the changes from upstream git or whatever source control version they have), but we don't really know, so no, I still do not feel like deviating from upstream, not under freeze, and not after the freeze. Anyway, I'll ask the authors about this once again. Thanks.
Dear Maintainer, it looks like some upstream beta version from 2010 fixes this by adding the -I and -O option, and the changelog says it's based on the unzip60-alt-iconv-utf8.patch proposed in this thread. This is the beta version: ftp://ftp.info-zip.org/pub/infozip/beta/unzip610b.zip You'll need to compile with -DUSE_ICONV_MAPPING to enable this, and depend on the iconv library. Mentioned also in #197427 and maybe other related bugs are also #696914 and #483290. Ciao, Antonio
On Fri, 24 Nov 2017 09:30:26 +0100 Antonio Ospite <ao2@ao2.it> wrote: [...] Well, the changelog mentions the "iconv library", but on linux the functionality is in glibc, so no extra dependencies should be needed. Ciao, Antonio
Hi Santiago, How is this bug going? I think we can assume upstream patch is like this. The upstream has a 610beta release on 2010. https://sourceforge.net/projects/infozip/files/unreleased%20Betas/UnZip%20betas/ It adds the support of -O/-I option for non-UTF8 encoding. I checked the code in 610beta, and the patch applied in archlinux AUR[1][2], most code are same, except: 1. Upstream code is surrounded by USE_ICONV_MAPPING. 2. 610b has a big refactor of command line parser. So the code of command parser is a big difference. The left code is just same. Cloud you backport this lovely feature in Debian? [1] https://aur.archlinux.org/packages/unzip-iconv [2] http://www.conostix.com/pub/adv/06-unzip60-alt-iconv-utf8_CVE-2015-1315.patch
Hi Santiago, I think, the community in all non-english spoken countries would highly appreciate a solution for this bug, which affects many almost every day. The effort hopefully is not too big, but the benefit would be really tremendously. Many thanks in advance! Best Regards,--------------------------------------------------------- "Sed quis custodiet ipsos custodes?" (Wer, außer den Wächtern selbst, wacht über die Wächter?)
The built-in .zip archiver in older versions of Windows used DOS (OEM) or Windows (ANSI) code page corresponding to current regional settings for new archives. Lots of such archives still exist. The correct behavior is to determine the relevant OEM or ANSI code page based on the system locale and use it. You can look at this PR for reference implementation: https://github.com/p7zip-project/p7zip/pull/232 Sample archive showing this bug attached. It uses CP866 (often called DOS or OEM) for Cyrillic (Russian) letters.
Dear colleagues, I am writing to bring to your attention an issue with the current upstream version of unzip that has not been updated for many years. In the modern environment, where the vast majority of systems use UTF-8, unzip exhibits several problems that need addressing: 1) unzip is unable to correctly extract files containing the bit 11 in the General Purpose flag. This bit indicates that the file names are encoded in UTF-8. However, unzip attempts to re-encode them as if they are in OEM codepage, leading to incorrect file names. 2) By default, unzip does not display UTF-8 encoding correctly on Unix systems. 3) It is necessary to determine the OEM codepage correctly based on the system locale, rather than using a single codepage for all archives. 4) The assumption that archives for which the legacy codepage cannot be determined are encoded in ISO 8859-1 is incorrect. In reality, most archivers used the user's system codepage, which could be any codepage. It is reasonable not to alter the encoding in this case, ensuring that the archive can be opened at least on the same system where it was created. Additionally, options -O and -I have been added to specify the encoding manually. I have prepared a patch (based on a similar patch from Ubuntu, with significant enhancements) that addresses these issues. A significant difference from the Ubuntu patch is that my code is capable of selecting the OEM codepage based on the system locale, instead of assuming the Russian/Cyrillic CP866 codepage for all archives when the system is set to UTF-8. I hope you will find this patch useful. Best regards, Ivan Sorokin
I slightly modified the patch for unzip: 1) Found a sample ANSI archive, which requires a separate code branch, so added it (sample archive attached) 2) Fixed the -I and -O options, which were broken; they now mean the same thing for all types of archives, both are left for compatibility, you can specify either one. That is, they work similarly to -mcp in 7zip now