lxml.html.diff does HTML comparisons. These are word-based comparisons.
First, a handy function for normalizing whitespace and doing word wrapping::
>>> import re, textwrap
>>> def pwrapped(text):
... text = re.sub(r'[ \n\t\r]+', ' ', text)
... text = textwrap.fill(text)
... print(text)
>>> def pdiff(text1, text2):
... pwrapped(htmldiff(text1, text2))
Example::
>>> from lxml.html.diff import htmldiff, html_annotate
>>> html1 = '
This is some test text with some changes and some same stuff
'
>>> html2 = '''This is some test textual writing with some changed stuff
... and some same stuff
'''
>>> pdiff(html1, html2)
This is some test textual writing with some changed
stuff text with some changes and some same stuff
Style tags are largely ignored in terms of differences, though markup is not eliminated::
>>> html1 = 'Hi you guys
'
>>> html2 = 'Hi you guys
'
>>> pdiff(html1, html2)
Hi you guys
>>> pdiff('text', 'text
')
text
>>> pdiff('Hi guys !!', 'Hi guy !!')
Hi guy guys !!
>>> pdiff('Hi', 'Hi')
Hi Hi
>>> pdiff('A B C', 'A C')
A B C
>>> pdiff('A B C', 'B C')
A B C
>>> pdiff('', '')
>>> pdiff('Hi
', 'Bye
')
Bye
Hi
>>> pdiff('Hi Guy
', 'Bye Guy
')
Bye Hi Guy
>>> pdiff('Hey there
', '')
Hey there
Whitespace is ignored, as it's not meaningful in HTML::
>>> pdiff('Hi\n\nguys
', 'Hi guy
')
Hi guy guys
Movement between paragraphs is ignored, as tag-based changes are generally ignored::
>>>
>>> pdiff('Hello
World
', 'Hello World
')
Hello World
As a special case, changing the href of a link is displayed, and
images are treated like words:
>>> pdiff('search', 'search')
search Link: http://google.com
Link: http://yahoo.com
>>> pdiff('Print this 
', 'Print this
')
Print this 
>>> pdiff('search', 'search')
search
Images may sometimes not have 'src' attributes:
>>> pdiff('
![]()
', '![]()
')

![]()

A test of empty elements:
>>> pdiff('some
text', 'some
test')
some
test
text
The sixteen combinations::
First "insert start" (del start/middle/end/none):
>>> pdiff('A B C', 'D B C D A B C
>>> pdiff('A B C', 'D A C')
D A B C
>>> pdiff('A B C', 'D A B')
D A B C
>>> pdiff('A B C', 'D A B C')
D A B C
Next, "insert middle" (del start/middle/end/none):
>>> pdiff('A B C', 'D B C')
D A B C
>>> pdiff('A B C', 'A D C')
A D B C
>>> pdiff('A B C', 'A D B')
A D B C
This one case hits the threshold of our insensitive matching:
>>> pdiff('A B C', 'A D B C')
A D A B C
Then "insert end" (del start/middle/end/none):
>>> pdiff('A B C', 'B C D')
A B C D
>>> pdiff('A B C', 'A C D')
A B C D
>>> pdiff('A B C', 'A B D')
A B D C
>>> pdiff('A B C', 'A B C D')
A B C D
Then no insert (del start/middle/end):
>>> pdiff('A B C', 'B C')
A B C
>>> pdiff('A B C', 'A C')
A B C
>>> pdiff('A B C', 'A B')
A B C
>>> pdiff('A B C', 'A B')
A B C
>>> pdiff('A B C', 'A B')
A B C
>>> pdiff('A hey there how are you?
', 'A')
A hey there how are you?
Testing a larger document, to make sure there are not weird
unnecessary parallels found:
>>> pdiff('''
... This is a test document with many words in it that goes on
... for a while and doesn't have anything do to with the next
... document that we match this against
''', '''
... This is another document with few similarities to the preceding
... one, but enough that it may have overlap that could turn into
... a confusing series of deletes and inserts.
...
''')
This is another document with few similarities to the
preceding one, but enough that it may have overlap that could turn
into a confusing series of deletes and inserts.
This is a test document with many words in it that goes on for
a while and doesn't have anything do to with the next document that we
match this against
Annotation of content can also be done, where every bit of content is
marked up with information about where it came from.
First, some setup; note that html_annotate is called with a sequence
of documents and the annotation associated with that document. We'll
just use indexes, but you could use author or timestamp information.
>>> def markup(text, annotation):
... return '%s' % (annotation, text)
>>> def panno(*docs):
... pwrapped(html_annotate([(doc, index) for index, doc in enumerate(docs)],
... markup=markup))
Now, a sequence of documents:
>>> panno('Hello cruel world', 'Hi cruel world', 'Hi world')
Hi world
>>> panno('A similar document', 'A similar document',
... 'A similar document here')
A similar document here
>>> panno('P1 para
P2 para
', 'P1 para
P3 foo
')
P1 para
P3
foo
>>> panno('HelloThere World
','HelloThere Town
')
HelloThere Town
>>> panno('Hello
There World','Hello
There Town')
Hello
There
Town
>>> panno('Hello
There World
','Hello
There Town
')
Hello
There
Town
>>> panno('Hi
You
',
... 'Hi You
',
... 'Hi You 
')
Hi You 
>>> panno('Hey
',
... 'Hey
')
Hey
>>> panno('Hey You
',
... 'Hey Guy
')
Hey Guy
Internals
---------
Some utility functions::
>>> from lxml.html.diff import fixup_ins_del_tags, split_unbalanced
>>> def pfixup(text):
... print(fixup_ins_del_tags(text).strip())
>>> pfixup('some text and more text and more
')
some text and more text and more
>>> pfixup('Hi! you
')
Hi! you
>>> pfixup('')
>>> pfixup('''
... ''')
Testing split_unbalanced::
>>> split_unbalanced(['', 'hey', ''])
([], ['', 'hey', ''], [])
>>> split_unbalanced(['', 'hey'])
([''], ['hey'], [])
>>> split_unbalanced(['Hey', '', 'You', ''])
([], ['Hey', 'You'], ['', ''])
>>> split_unbalanced(['So', '', 'Hi', '', 'There', ''])
([], ['So', 'Hi', '', 'There', ''], [''])
>>> split_unbalanced(['So', '', 'Hi', '', 'There'])
([''], ['So', 'Hi', 'There'], [''])