Performance improvment of the Similarity checker (#4565)

* Adds execution time measurements * Remove @profile decorator * Changes the whole algorithm. The old one, while being very readable, is a performance bottleneck especially when comparing two big files. Let's try a more efficient one... * Use a copy of SuccessiveLinesLimits in the all_couples collection in order to avoid modification of the same object when removing successives common lines (in remove_successive method). * Remove old algorithm (dead code now) * Creates the LineSpecifs type, to be clearer when manipulating stripped lines. * Adds type hint in the stripped_lines function signature. Modifies docstring for the same function * LineSetStartCouple is now a classic class (no more NamedTuple). It allows to define __add__ dunder method to make operations clearer * Adds __repr__ method to SuccessiveLinesLimits class. Also update the docstring * Modifies the way the LinesChunk hash is computed. If the line is not empty or is empty but corresponds to a docstring then the hash is the classical one. Otherwise the hash is randomized in order to be sure that two empty lines corresponding to import line are not considered equal * Empty lines that were comments before being stripped are considered as equal * Rework the help message do distinguish the options * Adds a full line of comments in the test and adapts the expected results * ignore-docstrings by default is True and so all docstrings (differents or identicals) are considered identicals * In case of multiprocessing reports options * Simplifies the algoriothm and clarifies the use of the options. For now if something is ignored (docstrings, comments, signature, imports) then it is removed from stripped lines collection. No need of the LineType anymore. The drawback is that 2 chunks of lines in both file may have been detected as similar (which is correct) but have different number of lines because, for example, some comments are inserted and comments habe ignored. * The CplSuccessiveLinesLimits is no more a NamedTuple because we added the effective_cm_lines_nb member which has to be mutable. It holds the number of "true" common lines between both files (i.e the number of common lines in both stripped lines collection) * check_sim function is renamed filter_noncode_lines and check the similarities on the stripped lines collection (and no more the real lines collection). Adds the computation of the effective number of common lines (i.e the number of "true" common stripped lines) * Adapts legacy code so that the effective number of common lines is printed (the number of common lines in both stripped lines collection) and the corresponding component of the first file is printed too. * Updates the expected result so that they contain the effective common lines number * Stripped lines are purged from all that is ignored (by default comments and docstrings). Adapts the expected result in consquence * By default comment and docstrings are excluded from the comparison * Print also the ending line number in the report * Adapts the expected results to take into account the ending line number * Takes into account Pierre-Sassoulas remarks * Takes into accound the remarks of cdce8p * The parameters of the SImilarChecker are read from configuration also in the __init__ method Co-authored-by: Pierre Sassoulas <pierre.sassoulas@gmail.com>
author: hippo91 <guillaume.peillex@gmail.com> 2021-07-28 21:11:10 +0200
committer: GitHub <noreply@github.com> 2021-07-28 21:11:10 +0200
commit: 1d1619ef913b99b06647d2030bddff4800abdf63 (patch)
tree: e39429f9c08dd3fc5aa99f250b4a399d5385b46e /tests/checkers
parent: 7d07138772626000cec94204c39c072ca239fc1b (diff)
download: pylint-git-1d1619ef913b99b06647d2030bddff4800abdf63.tar.gz
1 files changed, 26 insertions, 51 deletions
diff --git a/tests/checkers/unittest_similar.py b/tests/checkers/unittest_similar.py
index ca64bdba3..db59b00c5 100644
--- a/tests/checkers/unittest_similar.py
+++ b/tests/checkers/unittest_similar.py
@@ -52,26 +52,27 @@ def test_ignore_comments():
         == (
             """
 10 similar lines in 2 files
-==%s:0
-==%s:0
+==%s:[0:11]
+==%s:[0:11]
    import one
    from two import two
    three
    four
    five
    six
+   # A full line comment
    seven
    eight
    nine
    ''' ten
-TOTAL lines=60 duplicates=10 percent=16.67
+TOTAL lines=62 duplicates=10 percent=16.13
 """
             % (SIMILAR1, SIMILAR2)
         ).strip()
     )
 
 
-def test_ignore_docsrings():
+def test_ignore_docstrings():
     output = StringIO()
     with redirect_stdout(output), pytest.raises(SystemExit) as ex:
         similar.Run(["--ignore-docstrings", SIMILAR1, SIMILAR2])
@@ -80,9 +81,9 @@ def test_ignore_docsrings():
         output.getvalue().strip()
         == (
             """
-8 similar lines in 2 files
-==%s:6
-==%s:6
+5 similar lines in 2 files
+==%s:[7:15]
+==%s:[7:15]
    seven
    eight
    nine
@@ -93,14 +94,14 @@ def test_ignore_docsrings():
    fourteen
 
 5 similar lines in 2 files
-==%s:0
-==%s:0
+==%s:[0:5]
+==%s:[0:5]
    import one
    from two import two
    three
    four
    five
-TOTAL lines=60 duplicates=13 percent=21.67
+TOTAL lines=62 duplicates=10 percent=16.13
 """
             % ((SIMILAR1, SIMILAR2) * 2)
         ).strip()
@@ -115,7 +116,7 @@ def test_ignore_imports():
     assert (
         output.getvalue().strip()
         == """
-TOTAL lines=60 duplicates=0 percent=0.00
+TOTAL lines=62 duplicates=0 percent=0.00
 """.strip()
     )
 
@@ -130,8 +131,8 @@ def test_multiline_imports():
         == (
             """
 8 similar lines in 2 files
-==%s:0
-==%s:0
+==%s:[0:8]
+==%s:[0:8]
    from foo import (
      bar,
      baz,
@@ -169,9 +170,9 @@ def test_ignore_signatures_fail():
         output.getvalue().strip()
         == (
             '''
-10 similar lines in 2 files
-==%s:1
-==%s:8
+9 similar lines in 2 files
+==%s:[1:11]
+==%s:[8:18]
        arg1: int = 3,
        arg2: Class1 = val1,
        arg3: Class2 = func3(val2),
@@ -182,7 +183,7 @@ def test_ignore_signatures_fail():
 
    def example():
        """Valid function definition with docstring only."""
-TOTAL lines=29 duplicates=10 percent=34.48
+TOTAL lines=29 duplicates=9 percent=31.03
 '''
             % (SIMILAR5, SIMILAR6)
         ).strip()
@@ -212,8 +213,8 @@ def test_ignore_signatures_empty_functions_fail():
         == (
             '''
 6 similar lines in 2 files
-==%s:1
-==%s:1
+==%s:[1:7]
+==%s:[1:7]
        arg1: int = 1,
        arg2: str = "2",
        arg3: int = 3,
@@ -258,14 +259,14 @@ def test_ignore_nothing():
         == (
             """
 5 similar lines in 2 files
-==%s:0
-==%s:0
+==%s:[0:5]
+==%s:[0:5]
    import one
    from two import two
    three
    four
    five
-TOTAL lines=60 duplicates=5 percent=8.33
+TOTAL lines=62 duplicates=5 percent=8.06
 """
             % (SIMILAR1, SIMILAR2)
         ).strip()
@@ -282,8 +283,8 @@ def test_lines_without_meaningful_content_do_not_trigger_similarity():
         == (
             """
 14 similar lines in 2 files
-==%s:11
-==%s:11
+==%s:[11:25]
+==%s:[11:25]
    b = (
        (
            [
@@ -340,12 +341,6 @@ def test_get_map_data():
     )
     expected_linelists = (
         (
-            "",
-            "",
-            "",
-            "",
-            "",
-            "",
             "def adipiscing(elit):",
             'etiam = "id"',
             'dictum = "purus,"',
@@ -354,26 +349,21 @@ def test_get_map_data():
             'nec = "ornare"',
             'tortor = "sit"',
             "return etiam, dictum, vitae, neque, nec, tortor",
-            "",
-            "",
             "class Amet:",
             "def similar_function_3_lines(self, tellus):",
             "agittis = 10",
             "tellus *= 300",
             "return agittis, tellus",
-            "",
             "def lorem(self, ipsum):",
             'dolor = "sit"',
             'amet = "consectetur"',
             "return (lorem, dolor, amet)",
-            "",
             "def similar_function_5_lines(self, similar):",
             "some_var = 10",
             "someother_var *= 300",
             'fusce = "sit"',
             'amet = "tortor"',
             "return some_var, someother_var, fusce, amet",
-            "",
             'def __init__(self, moleskie, lectus="Mauris", ac="pellentesque"):',
             'metus = "ut"',
             'lobortis = "urna."',
@@ -385,7 +375,6 @@ def test_get_map_data():
             'iaculis = "dolor"',
             'facilisis = "ultrices"',
             'vitae = "ut."',
-            "",
             "return (",
             "metus,",
             "lobortis,",
@@ -398,28 +387,18 @@ def test_get_map_data():
             "facilisis,",
             "vitae,",
             ")",
-            "",
             "def similar_function_3_lines(self, tellus):",
             "agittis = 10",
             "tellus *= 300",
             "return agittis, tellus",
         ),
         (
-            "",
-            "",
-            "",
-            "",
-            "",
-            "",
-            "",
             "class Nulla:",
             'tortor = "ultrices quis porta in"',
             'sagittis = "ut tellus"',
-            "",
             "def pulvinar(self, blandit, metus):",
             "egestas = [mauris for mauris in zip(blandit, metus)]",
             "neque = (egestas, blandit)",
-            "",
             "def similar_function_5_lines(self, similar):",
             "some_var = 10",
             "someother_var *= 300",
@@ -427,14 +406,10 @@ def test_get_map_data():
             'amet = "tortor"',
             'iaculis = "dolor"',
             "return some_var, someother_var, fusce, amet, iaculis, iaculis",
-            "",
-            "",
             "def tortor(self):",
             "ultrices = 2",
             'quis = ultricies * "porta"',
             "return ultricies, quis",
-            "",
-            "",
             "class Commodo:",
             "def similar_function_3_lines(self, tellus):",
             "agittis = 10",
@@ -460,5 +435,5 @@ def test_get_map_data():
     ):
         assert source_fname == lineset_obj.name
         # There doesn't seem to be a faster way of doing this, yet.
-        lines = (line for idx, line in lineset_obj.enumerate_stripped())
+        lines = (linespec.text for linespec in lineset_obj.stripped_lines)
         assert tuple(expected_lines) == tuple(lines)
author	hippo91 <guillaume.peillex@gmail.com>	2021-07-28 21:11:10 +0200
committer	GitHub <noreply@github.com>	2021-07-28 21:11:10 +0200
commit	1d1619ef913b99b06647d2030bddff4800abdf63 (patch)
tree	e39429f9c08dd3fc5aa99f250b4a399d5385b46e /tests/checkers
parent	7d07138772626000cec94204c39c072ca239fc1b (diff)
download	pylint-git-1d1619ef913b99b06647d2030bddff4800abdf63.tar.gz