src/lxml/html/tests/test_diff.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233

lxml.html.diff does HTML comparisons.  These are word-based comparisons.

First, a handy function for normalizing whitespace and doing word wrapping::

    >>> import re, textwrap
    >>> def pwrapped(text):
    ...     text = re.sub(r'[ \n\t\r]+', ' ', text)
    ...     text = textwrap.fill(text)
    ...     print(text)
    >>> def pdiff(text1, text2):
    ...     pwrapped(htmldiff(text1, text2))

Example::

    >>> from lxml.html.diff import htmldiff, html_annotate
    >>> html1 = '<p>This is some test text with some changes and some same stuff</p>'
    >>> html2 = '''<p>This is some test textual writing with some changed stuff 
    ... and some same stuff</p>'''
    >>> pdiff(html1, html2)
    <p>This is some test <ins>textual writing with some changed
    stuff</ins> <del>text with some changes</del> and some same stuff</p>

Style tags are largely ignored in terms of differences, though markup is not eliminated::

    >>> html1 = '<p>Hi <i>you guys</i></p>'
    >>> html2 = '<p>Hi <i>you</i> guys</p>'
    >>> pdiff(html1, html2)
    <p>Hi <i>you</i> guys</p>
    >>> pdiff('text', '<p>text</p>')
    <p>text</p>
    >>> pdiff('<i>Hi guys</i> !!', '<i>Hi guy</i> !!')
    <i>Hi <ins>guy</ins> <del>guys</del> </i> !!
    >>> pdiff('H<i>i</i>', 'Hi')
    <ins>Hi</ins> <del>H<i>i</i></del>
    >>> pdiff('<i>A B</i> C', '<i>A</i> C')
    <i>A <del>B</del> </i> C
    >>> pdiff('<i>A B</i> C', '<i>B</i> C')
    <i> <del>A</del> B</i> C
    >>> pdiff('<p></p>', '<p></p>')
    <p></p>
    >>> pdiff('<p>Hi</p>', '<p>Bye</p>')
    <p><ins>Bye</ins></p> <p><del>Hi</del></p>
    >>> pdiff('<p>Hi Guy</p>', '<p>Bye Guy</p>')
    <p> <ins>Bye</ins> <del>Hi</del> Guy</p>
    >>> pdiff('<p>Hey there</p>', '')
    <ins></ins> <p><del>Hey there</del></p>

Whitespace is ignored, as it's not meaningful in HTML::

    >>> pdiff('<div>Hi\n\nguys</div>', '<div>Hi guy</div>')
    <div>Hi <ins>guy</ins> <del>guys</del> </div>

Movement between paragraphs is ignored, as tag-based changes are generally ignored::
    >>> 
    >>> pdiff('<p>Hello</p><p>World</p>', '<p>Hello World</p>')
    <p>Hello World</p>

As a special case, changing the href of a link is displayed, and
images are treated like words:

    >>> pdiff('<a href="http://yahoo.com">search</a>', '<a href="http://google.com">search</a>')
    <a href="http://google.com">search <ins> Link: http://google.com</ins>
    <del> Link: http://yahoo.com</del> </a>
    >>> pdiff('<p>Print this <img src="print.gif"></p>', '<p>Print this</p>')
    <p>Print this <del><img src="print.gif"></del> </p>
    >>> pdiff('<a href="http://yahoo.com">search</a>', '<a href="http://yahoo.com">search</a>')
    <a href="http://yahoo.com">search</a>

Images may sometimes not have 'src' attributes:

    >>> pdiff('<img src="tease"><img><img src="test">', '<img><img src="toast">')
    <del><img src="tease"></del> <img><ins><img src="toast"></ins>
    <del><img src="test"></del>

A test of empty elements:

    >>> pdiff('some <br> text', 'some <br> test')
    some <ins><br> test</ins> <del><br> text</del>

The sixteen combinations::

First "insert start" (del start/middle/end/none):

    >>> pdiff('<b>A B C</b>', '<b>D B C</b')
    <b> <ins>D</ins> <del>A</del> B C</b>
    >>> pdiff('<b>A B C</b>', '<b>D A C</b>')
    <b> <ins>D</ins> A <del>B</del> C</b>
    >>> pdiff('<b>A B C</b>', '<b>D A B</b>')
    <b> <ins>D</ins> A B <del>C</del> </b>
    >>> pdiff('<b>A B C</b>', '<b>D A B C</b>')
    <b> <ins>D</ins> A B C</b>

Next, "insert middle" (del start/middle/end/none):

    >>> pdiff('<b>A B C</b>', '<b>D B C</b>')
    <b> <ins>D</ins> <del>A</del> B C</b>
    >>> pdiff('<b>A B C</b>', '<b>A D C</b>')
    <b>A <ins>D</ins> <del>B</del> C</b>
    >>> pdiff('<b>A B C</b>', '<b>A D B</b>')
    <b>A <ins>D</ins> B <del>C</del> </b>

This one case hits the threshold of our insensitive matching:

    >>> pdiff('<b>A B C</b>', '<b>A D B C</b>')
    <b> <ins>A D</ins> <del>A</del> B C</b>


Then "insert end" (del start/middle/end/none):

    >>> pdiff('<b>A B C</b>', '<b>B C D</b>')
    <b> <del>A</del> B C <ins>D</ins> </b>
    >>> pdiff('<b>A B C</b>', '<b>A C D</b>')
    <b>A <del>B</del> C <ins>D</ins> </b>
    >>> pdiff('<b>A B C</b>', '<b>A B D</b>')
    <b>A B <ins>D</ins> <del>C</del> </b>
    >>> pdiff('<b>A B C</b>', '<b>A B C D</b>')
    <b>A B C <ins>D</ins> </b>

Then no insert (del start/middle/end):

    >>> pdiff('<b>A B C</b>', '<b>B C</b>')
    <b> <del>A</del> B C</b>
    >>> pdiff('<b>A B C</b>', '<b>A C</b>')
    <b>A <del>B</del> C</b>
    >>> pdiff('<b>A B C</b>', '<b>A B</b>')
    <b>A B <del>C</del> </b>

    >>> pdiff('<b>A B</b> C', '<b>A B</b>')
    <b>A B</b> <del>C</del>
    >>> pdiff('<b>A B</b> <b>C</b>', '<b>A B</b>')
    <b>A B</b> <del><b>C</b></del>
    >>> pdiff('A <p><b>hey there</b> <i>how are you?</i></p>', 'A')
    A <p><del><b>hey there</b> <i>how are you?</i></del></p>
    
Testing a larger document, to make sure there are not weird
unnecessary parallels found:

    >>> pdiff('''
    ... <p>This is a test document with many words in it that goes on
    ... for a while and doesn't have anything do to with the next
    ... document that we match this against</p>''', '''
    ... <p>This is another document with few similarities to the preceding
    ... one, but enough that it may have overlap that could turn into
    ... a confusing series of deletes and inserts.
    ... </p>''')
    <p><ins>This is another document with few similarities to the
    preceding one, but enough that it may have overlap that could turn
    into a confusing series of deletes and inserts. </ins></p>
    <p><del>This is a test document with many words in it that goes on for
    a while and doesn't have anything do to with the next document that we
    match this against</del></p>


Annotation of content can also be done, where every bit of content is
marked up with information about where it came from.

First, some setup; note that html_annotate is called with a sequence
of documents and the annotation associated with that document.  We'll
just use indexes, but you could use author or timestamp information.

    >>> def markup(text, annotation):
    ...     return '<span version="%s">%s</span>' % (annotation, text)
    >>> def panno(*docs):
    ...     pwrapped(html_annotate([(doc, index) for index, doc in enumerate(docs)],
    ...                            markup=markup))

Now, a sequence of documents:

    >>> panno('Hello cruel world', 'Hi cruel world', 'Hi world')
    <span version="1">Hi</span> <span version="0">world</span>
    >>> panno('A similar document', 'A similar document',
    ...       'A similar document here')
    <span version="0">A similar document</span> <span
    version="2">here</span>
    >>> panno('<p>P1 para</p><p>P2 para</p>', '<p>P1 para</p><p>P3 foo</p>')
    <p><span version="0">P1 para</span></p><p><span version="1">P3
    foo</span></p>
    >>> panno('Hello<p>There World</p>','Hello<p>There Town</p>')
    <span version="0">Hello</span><p><span version="0">There</span> <span
    version="1">Town</span></p>
    >>> panno('<p>Hello</p>There World','<p>Hello</p>There Town')
    <p><span version="0">Hello</span></p><span version="0">There</span>
    <span version="1">Town</span>
    >>> panno('<p>Hello</p><p>There World</p>','<p>Hello</p><p>There Town</p>')
    <p><span version="0">Hello</span></p><p><span version="0">There</span>
    <span version="1">Town</span></p>
    >>> panno('<p>Hi <img src="/foo"> You</p>',
    ...       '<p>Hi You</p>',
    ...       '<p>Hi You <img src="/bar"></p>')
    <p><span version="0">Hi You</span> <span version="2"><img
    src="/bar"></span></p>
    >>> panno('<p><a href="/foo">Hey</a></p>',
    ...       '<p><a href="/bar">Hey</a></p>')
    <p><a href="/bar"><span version="0">Hey</span></a></p>
    >>> panno('<p><a href="/foo">Hey You</a></p>',
    ...       '<p><a href="/foo">Hey Guy</a></p>')
    <p><a href="/foo"><span version="0">Hey</span> <span
    version="1">Guy</span></a></p>

Internals
---------


Some utility functions::

    >>> from lxml.html.diff import fixup_ins_del_tags, split_unbalanced
    >>> def pfixup(text):
    ...     print(fixup_ins_del_tags(text).strip())
    >>> pfixup('<ins><p>some text <b>and more text</b> and more</p></ins>')
    <p><ins>some text <b>and more text</b> and more</ins></p>
    >>> pfixup('<p><ins>Hi!</ins> you</p>')
    <p><ins>Hi!</ins> you</p>
    >>> pfixup('<div>Some text <ins>and <p>more text</p></ins> </div>')
    <div>Some text <ins>and </ins><p><ins>more text</ins></p> </div>
    >>> pfixup('''
    ...    <ins><table><tr><td>One table</td><td>More stuff</td></tr></table></ins>''')
    <table><tr><td><ins>One table</ins></td><td><ins>More stuff</ins></td></tr></table>


Testing split_unbalanced::

    >>> split_unbalanced(['<a href="blah">', 'hey', '</a>'])
    ([], ['<a href="blah">', 'hey', '</a>'], [])
    >>> split_unbalanced(['<a href="blah">', 'hey'])
    (['<a href="blah">'], ['hey'], [])
    >>> split_unbalanced(['Hey', '</i>', 'You', '</b>'])
    ([], ['Hey', 'You'], ['</i>', '</b>'])
    >>> split_unbalanced(['So', '</i>', 'Hi', '<b>', 'There', '</b>'])
    ([], ['So', 'Hi', '<b>', 'There', '</b>'], ['</i>'])
    >>> split_unbalanced(['So', '</i>', 'Hi', '<b>', 'There'])
    (['<b>'], ['So', 'Hi', 'There'], ['</i>'])