#738483 python-gnupg: list_keys fails with debian-keyring due to UTF-8 corruption

#738483#5
Date:
2014-02-09 21:57:11 UTC
From:
To:
Dear Maintainer,

There are some DD's that have mangled character encodings on their UID
records of their public keys, particularly an 'é' as in José or Lainé,
instead of being encoded as UTF-8 sequence 0xC3 0xA9, they've been
encoded as 0xE9 (like ISO-8859-1).

python-gnupg fails when listing these keys with the following traceback:

  File "/usr/lib/python2.7/dist-packages/gnupg.py", line 1047, in list_keys
    self.decode_errors).splitlines()
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 207127: invalid continuation byte

This can temporarily be worked around by importing the keyring, editing
the keys and deleting the offending UID packets… until the keys are
refreshed.

I'm a Python newbie so wouldn't know if it's possible to have the
language be more lenient about decoding corrupt UTF-8 data, but after
poking around stackoverflow.com it doesn't seem likely.

Seeing that the module does support locales, I futilely tried exporting
LANG=en_US.ISO-8859-1 which results in "UnicodeDecodeError: 'ascii'
codec can't decode byte 0xc3" errors.

Specific keys:

$ gpg --list-keys 4BE0582590788E11 E2624966A269D927
Keyring: /usr/share/keyrings/debian-keyring.gpg
-----------------------------------------------
pub   1024D/0x4BE0582590788E11 2001-02-01
      Key fingerprint = B06B 023F EAA6 37DC 1E62  B079 4BE0 5825 9078 8E11
uid                 [ unknown] Jose Carlos Garcia Sogo <jsogo@debian.org>
uid                 [ unknown] Jose Carlos Garcia Sogo <jose@jaimedelamo.eu.org>
uid                 [ unknown] Jose Carlos Garcia Sogo <jsogo@arrakis.es>
uid                 [ unknown] Jos\xe9\x20Carlos Garc\xed\x61 Sogo
uid                 [ unknown] José Carlos García Sogo <jose@tribulaciones.org>
sub   2048g/0x002DE195AC9EAC51 2001-02-01

Keyring: /usr/share/keyrings/debian-keyring.gpg
-----------------------------------------------
pub   1024D/0xE2624966A269D927 2002-05-04
      Key fingerprint = 190E 48D7 5117 4F42 ABAC  4DD7 E262 4966 A269 D927
uid                 [ unknown] Jeremy Laine <jeremy.laine@polytechnique.org>
uid                 [ unknown] Jeremy Laine <jeremy.laine@m4x.org>
uid                 [ unknown] Jeremy Laine <jeremy_laine@users.sourceforge.net>
uid                 [ unknown] Jeremy Laine <sharky@debian.org>
uid                 [ unknown] Jeremy Lain\xe9\x20<jeremy.laine@m4x.org>
uid                 [ unknown] Jeremy Lainé <jeremy.laine@m4x.org>
sub   1024g/0xD33A84027D816A68 2002-05-04

#738483#10
Date:
2014-02-09 22:07:29 UTC
From:
To:
Gerald Turner <gturner@unzane.com> writes:

Sorry for the bug report if this is a core Python problem that cannot be
worked around by any particular module.

Ultimately perhaps GnuPG should gain a feature:

  When listing keys without --with-colons, gpg handles the corrupt UTF-8
  data in the function util/strgutil.c:utf8_to_native (outputs
  "\\x%02x"), however when --with-colons is in effect, the raw data is
  printed.

  Maybe we need a "--with-conversion" argument which would change the
  behavior of the function g10/keylist.c:list_keyblock_colon to call
  utf8_to_native when printing UID packets as well.

#738483#15
Date:
2014-02-25 22:32:05 UTC
From:
To:
Hi, I saw the new version hit testing, was feeling hopeful, so thought
I'd try to exercise the bug, it's still present.

Quick run down how to reproduce:

$ gpg --export 4BE0582590788E11 E2624966A269D927 > bug-738483.gpg

$ python
Python 2.7.6 (default, Jan 11 2014, 14:34:26)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/gnupg.py", line 1062, in
  list_keys
    self.decode_errors).splitlines()
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 207337: invalid continuation byte

#738483#22
Date:
2015-01-21 13:16:30 UTC
From:
To:
Hi Gerald,
I've hit the same problem. Here is a dirty fix for you:

    import gnupg
    from pprint import pprint

    keyring = gnupg.GPG(keyring = "/usr/share/keyrings/debian-keyring.gpg")
    keyring.decode_errors = 'replace'  # <- to replace decode errors with \ufffd
    keys = keyring.list_keys()

    # this shows a list of keys with crazy UIDs
    pprint([ k for k in keys if any(u'\ufffd' in uid for uid in x['uids']) ])

I think that we could ask José to remove this unfortunate UID from the keyring...

Cheers,
Tomasz