lxml.html adds a find_class method to elements:: >>> from lxml.etree import Comment >>> from lxml.html import document_fromstring, fragment_fromstring, tostring >>> from lxml.html import fragments_fromstring, fromstring >>> from lxml.html.clean import clean, clean_html >>> from lxml.html import usedoctest >>> try: unicode = unicode ... except NameError: unicode = str >>> h = document_fromstring(''' ... ... ... P1 ... P2 ... ''') >>> print(tostring(h, encoding=unicode)) P1 P2 >>> print([e.text for e in h.find_class('fn')]) ['P1'] >>> print([e.text for e in h.find_class('vcard')]) ['P1', 'P2'] as well as the ability to toggle classes using a set-like interface >>> el = fragment_fromstring('') >>> 'foo' in el.classes True >>> 'f00' in el.classes False >>> el.classes.update(('qux', 'quux')) >>> sorted(el.get('class').split()) ['bar', 'foo', 'quux', 'qux'] >>> el.classes.clear() >>> el.get('class') >>> list(el.classes) [] >>> el.classes.add('a') >>> el.classes.add('b') >>> el.classes.remove('a') >>> el.classes.remove('c') Traceback (most recent call last): ... KeyError: 'c' >>> el.classes.discard('c') >>> el.get('class') 'b' >>> el.classes.add('b') >>> el.get('class') 'b' >>> el.classes |= ('a', 'b') >>> el.get('class') 'b a' >>> el.classes -= ('b', 'c', 'd') >>> el.get('class') 'a' with an extra toggle method to switch the state of classes >>> el.get('class') 'a' >>> el.classes.toggle('a') False >>> el.get('class') >>> el.classes.toggle('foo') True >>> el.get('class') 'foo' >>> el.classes.toggle('foo') False >>> el.get('class') >>> el.classes.add("foo\n") Traceback (most recent call last): ... ValueError: Invalid class name: 'foo\n' >>> el.classes.remove("foo ") Traceback (most recent call last): ... ValueError: Invalid class name: 'foo ' Also added is a get_rel_links, which you can use to search for links like ````:: >>> h = document_fromstring(''' ... test 1 ... item 2 ... item 3 ... item 4''') >>> print([e.attrib['href'] for e in h.find_rel_links('tag')]) ['2', '4'] >>> print([e.attrib['href'] for e in h.find_rel_links('nofollow')]) [] Another method is ``get_element_by_id`` that does what it says:: >>> print(tostring(fragment_fromstring(''' ...

... stuff ...

''').get_element_by_id('test'), encoding=unicode)) stuff Or to get the content of an element without the tags, use text_content():: >>> el = fragment_fromstring(''' ...

This is a bold link

''') >>> el.text_content() 'This is a bold link' Or drop an element (leaving its content) or the entire tree, like:: >>> doc = document_fromstring(''' ... ... ...

... This is a test of stuff. ...

... ...

footer

... ... ''') >>> doc.get_element_by_id('link').drop_tag() >>> print(tostring(doc, encoding=unicode))

This is a test of stuff.

footer

>>> doc.get_element_by_id('body').drop_tree() >>> print(tostring(doc, encoding=unicode))

footer

Note, however, that comment text will not be merged into the tree when you drop the comment. Here, ``drop_tag()`` behaves exactly like ``drop_tree()``: >>> for comment in doc.getiterator(Comment): ... comment.drop_tag() >>> print(tostring(doc, encoding=unicode))

footer

In Python3 it should be possible to parse strings given as bytes objects, at least if an encoding is given. >>> from lxml.html import HTMLParser >>> enc = 'utf-8' >>> html_parser = HTMLParser(encoding=enc) >>> src = 'Test'.encode(enc) >>> doc = fromstring(src, parser=html_parser) >>> print(tostring(doc, encoding=unicode)) Test >>> docs = fragments_fromstring(src, parser=html_parser) >>> len(docs) 1 >>> print(docs[0]) Test Bug 599318: Call fromstring with a frameset fragment should not raise an error, the whole document is returned. >>> import lxml.html >>> content=''' ... ... ... ''' >>> etree_document = lxml.html.fromstring(content) >>> print(tostring(etree_document, encoding=unicode)) Bug 599318: Call fromstring with a div fragment should not raise an error, only the element is returned >>> import lxml.html >>> content='

' >>> etree_document = lxml.html.fromstring(content) >>> print(tostring(etree_document, encoding=unicode))

Bug 599318: Call fromstring with a head fragment should not raise an error, the whole document is returned. >>> import lxml.html >>> content='' >>> etree_document = lxml.html.fromstring(content) >>> print(tostring(etree_document, encoding=unicode)) Bug 690319: Leading whitespace before doctype declaration should not raise an error. >>> import lxml.html >>> content=''' ... ... ... ''' >>> etree_document = lxml.html.fromstring(content) >>> print(tostring(etree_document, encoding=unicode)) Feature https://github.com/lxml/lxml/pull/140: ensure_head_body option: >>> from lxml.html import document_fromstring, tostring >>> from functools import partial >>> tos = partial(tostring, encoding=unicode) >>> print(tos(document_fromstring('

test

')))

test

>>> print(tos(document_fromstring('

test

', ensure_head_body=True)))

test

>>> print(tos(document_fromstring(''))) >>> print(tos(document_fromstring('', ensure_head_body=True))) >>> print(tos(document_fromstring(''))) >>> print(tos(document_fromstring('', ensure_head_body=True)))