#671842 python-html5lib: lxml builder: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

#671842#3
Date:
2012-05-07 11:48:05 UTC
From:
To:
lxml builder raises an exception when parsing a string with control
characters:
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 38, in parse
      return p.parse(doc, encoding=encoding)
    File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 211, in parse
      parseMeta=parseMeta, useChardet=useChardet)
    File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 111, in _parse
      self.mainLoop()
    File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 174, in mainLoop
      self.phase.processCharacters(token)
    File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 572, in processCharacters
      self.parser.phase.processCharacters(token)
    File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 611, in processCharacters
      self.parser.phase.processCharacters(token)
    File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 652, in processCharacters
      self.parser.phase.processCharacters(token)
    File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 711, in processCharacters
      self.parser.phase.processCharacters(token)
    File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 804, in processCharacters
      self.parser.phase.processCharacters(token)
    File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 948, in processCharacters
      self.tree.insertText(token["data"])
    File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/_base.py", line 288, in insertText
      parent.insertText(data)
    File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/etree_lxml.py", line 225, in insertText
      builder.Element.insertText(self, data, insertBefore)
    File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/etree.py", line 114, in insertText
      self._element.text += data
    File "lxml.etree.pyx", line 904, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:37110)
    File "apihelpers.pxi", line 721, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:16855)
    File "apihelpers.pxi", line 1366, in lxml.etree._utf8 (src/lxml/lxml.etree.c:22060)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

#671842#10
Date:
2013-12-13 12:57:11 UTC
From:
To:
Hi.

It seems the problem still happens with v.0.99 (from a pending upload package prepared for experimental) :

$ python
Python 2.7.5+ (default, Sep 17 2013, 17:31:54)
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/html5lib/html5parser.py", line 28, in parse
    return p.parse(doc, encoding=encoding)
  File "/usr/lib/python2.7/dist-packages/html5lib/html5parser.py", line 224, in parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "/usr/lib/python2.7/dist-packages/html5lib/html5parser.py", line 93, in _parse
    self.mainLoop()
  File "/usr/lib/python2.7/dist-packages/html5lib/html5parser.py", line 183, in mainLoop
    new_token = phase.processCharacters(new_token)
  File "/usr/lib/python2.7/dist-packages/html5lib/html5parser.py", line 991, in processCharacters
    self.tree.insertText(token["data"])
  File "/usr/lib/python2.7/dist-packages/html5lib/treebuilders/_base.py", line 320, in insertText
    parent.insertText(data)
  File "/usr/lib/python2.7/dist-packages/html5lib/treebuilders/etree_lxml.py", line 240, in insertText
    builder.Element.insertText(self, data, insertBefore)
  File "/usr/lib/python2.7/dist-packages/html5lib/treebuilders/etree.py", line 108, in insertText
    self._element.text += data
  File "lxml.etree.pyx", line 921, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:41264)
  File "apihelpers.pxi", line 652, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18755)
  File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24545)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
olivier@inf-8660:~/svn/svn.debian.org/python-modules/packages/build-area$ dpkg -l python-html5lib
Souhait=inconnU/Installé/suppRimé/Purgé/H=à garder
| État=Non/Installé/fichier-Config/dépaqUeté/échec-conFig/H=semi-installé/W=attend-traitement-déclenchements
|/ Err?=(aucune)/besoin Réinstallation (État,Err: majuscule=mauvais)
||/ Nom                                         Version                    Architecture               Description
+++-===========================================-==========================-==========================-===========================================================================================
ii  python-html5lib                             0.99-1                     all                        HTML parser/tokenizer based on the WHATWG HTML5 specification

Are you sure this is a bug ?

Would you mind checking with upstream and/or forwarding the issue there ?

Best regards,