lxml.html adds a find_class method to elements:: >>> from lxml.etree import Comment >>> from lxml.html import document_fromstring, fragment_fromstring, tostring >>> from lxml.html.clean import clean, clean_html >>> from lxml.html import usedoctest >>> try: unicode = __builtins__["unicode"] ... except (KeyError, NameError): unicode = str >>> h = document_fromstring(''' ... ... ... P1 ... P2 ... ''') >>> print(tostring(h, encoding=unicode)) P1 P2 >>> print([e.text for e in h.find_class('fn')]) ['P1'] >>> print([e.text for e in h.find_class('vcard')]) ['P1', 'P2'] Also added is a get_rel_links, which you can use to search for links like ````:: >>> h = document_fromstring(''' ... test 1 ... ... item 3 ... item 4''') >>> print([e.attrib['href'] for e in h.find_rel_links('tag')]) ['2', '4'] >>> print([e.attrib['href'] for e in h.find_rel_links('nofollow')]) [] Another method is ``get_element_by_id`` that does what it says:: >>> print(tostring(fragment_fromstring(''' ...
... stuff ...
''').get_element_by_id('test'), encoding=unicode)) stuff Or to get the content of an element without the tags, use text_content():: >>> el = fragment_fromstring(''' ...
This is a bold link
''') >>> el.text_content() 'This is a bold link' Or drop an element (leaving its content) or the entire tree, like:: >>> doc = document_fromstring(''' ... ... ...
... This is a test of stuff. ...
... ...
footer
... ... ''') >>> doc.get_element_by_id('link').drop_tag() >>> print(tostring(doc, encoding=unicode))
This is a test of stuff.
footer
>>> doc.get_element_by_id('body').drop_tree() >>> print(tostring(doc, encoding=unicode))
footer
Note, however, that comment text will not be merged into the tree when you drop the comment. Here, ``drop_tag()`` behaves exactly like ``drop_tree()``: >>> for comment in doc.getiterator(Comment): ... comment.drop_tag() >>> print(tostring(doc, encoding=unicode))
footer