docs/src/tutorial/profiling_tutorial.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386

.. highlight:: cython

.. _profiling:

*********
Profiling
*********

.. include::
    ../two-syntax-variants-used

This part describes the profiling abilities of Cython. If you are familiar
with profiling pure Python code, you can only read the first section
(:ref:`profiling_basics`). If you are not familiar with Python profiling you
should also read the tutorial (:ref:`profiling_tutorial`) which takes you
through a complete example step by step.

.. _profiling_basics:

Cython Profiling Basics
=======================

Profiling in Cython is controlled by a compiler directive.
It can be set either for an entire file or on a per function basis
via a Cython decorator.

Enabling profiling for a complete source file
---------------------------------------------

Profiling is enabled for a complete source file via a global directive to the
Cython compiler at the top of a file::

   # cython: profile=True

Note that profiling gives a slight overhead to each function call therefore making
your program a little slower (or a lot, if you call some small functions very
often).

Once enabled, your Cython code will behave just like Python code when called
from the cProfile module. This means you can just profile your Cython code
together with your Python code using the same tools as for Python code alone.

Disabling profiling function wise
---------------------------------

If your profiling is messed up because of the call overhead to some small
functions that you rather do not want to see in your profile - either because
you plan to inline them anyway or because you are sure that you can't make them
any faster - you can use a special decorator to disable profiling for one
function only (regardless of whether it is globally enabled or not):

.. tabs::

    .. group-tab:: Pure Python

        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/often_called.py

    .. group-tab:: Cython

        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/often_called.pyx

Enabling line tracing
---------------------

To get more detailed trace information (for tools that can make use of it),
you can enable line tracing::

   # cython: linetrace=True

This will also enable profiling support, so the above ``profile=True`` option
is not needed.  Line tracing is needed for coverage analysis, for example.

Note that even if line tracing is enabled via the compiler directive, it is
not used by default.  As the runtime slowdown can be substantial, it must
additionally be compiled in by the C compiler by setting the C macro definition
``CYTHON_TRACE=1``.  To include nogil functions in the trace, set
``CYTHON_TRACE_NOGIL=1`` (which implies ``CYTHON_TRACE=1``).  C macros can be
defined either in the extension definition of the ``setup.py`` script or by
setting the respective distutils options in the source file with the following
file header comment (if ``cythonize()`` is used for compilation)::

   # distutils: define_macros=CYTHON_TRACE_NOGIL=1


Enabling coverage analysis
--------------------------

Since Cython 0.23, line tracing (see above) also enables support for coverage
reporting with the `coverage.py <https://coverage.readthedocs.io/>`_ tool. To
make the coverage analysis understand Cython modules, you also need to enable
Cython's coverage plugin in your ``.coveragerc`` file as follows:

.. code-block:: ini

   [run]
   plugins = Cython.Coverage

With this plugin, your Cython source files should show up normally in the
coverage reports.

To include the coverage report in the Cython annotated HTML file, you need
to first run the coverage.py tool to generate an XML result file.  Pass
this file into the ``cython`` command as follows:

.. code-block:: bash

   $ cython  --annotate-coverage coverage.xml  package/mymodule.pyx

This will recompile the Cython module and generate one HTML output
file next to each Cython source file it processes, containing colour
markers for lines that were contained in the coverage report.


.. _profiling_tutorial:

Profiling Tutorial
==================

This will be a complete tutorial, start to finish, of profiling Python code,
turning it into Cython code and keep profiling until it is fast enough.

As a toy example, we would like to evaluate the summation of the reciprocals of
squares up to a certain integer :math:`n` for evaluating :math:`\pi`. The
relation we want to use has been proven by Euler in 1735 and is known as the
`Basel problem <https://en.wikipedia.org/wiki/Basel_problem>`_.


.. math::
   \pi^2 = 6 \sum_{k=1}^{\infty} \frac{1}{k^2} =
   6 \lim_{k \to \infty} \big( \frac{1}{1^2} +
         \frac{1}{2^2} + \dots + \frac{1}{k^2}  \big) \approx
   6 \big( \frac{1}{1^2} + \frac{1}{2^2} + \dots + \frac{1}{n^2}  \big)

A simple Python code for evaluating the truncated sum looks like this:

.. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi.py
    :caption: calc_pi.py

On my box, this needs approximately 4 seconds to run the function with the
default n. The higher we choose n, the better will be the approximation for
:math:`\pi`. An experienced Python programmer will already see plenty of
places to optimize this code. But remember the golden rule of optimization:
Never optimize without having profiled. Let me repeat this: **Never** optimize
without having profiled your code. Your thoughts about which part of your
code takes too much time are wrong. At least, mine are always wrong. So let's
write a short script to profile our code:

.. literalinclude:: ../../examples/tutorial/profiling_tutorial/profile.py
    :caption: profile.py

Running this on my box gives the following output:

.. code-block:: none

   Sat Nov  7 17:40:54 2009    Profile.prof

            10000004 function calls in 6.211 CPU seconds

      Ordered by: internal time

      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
           1    3.243    3.243    6.211    6.211 calc_pi.py:4(approx_pi)
    10000000    2.526    0.000    2.526    0.000 calc_pi.py:1(recip_square)
           1    0.442    0.442    0.442    0.442 {range}
           1    0.000    0.000    6.211    6.211 <string>:1(<module>)
           1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

This contains the information that the code runs in 6.2 CPU seconds. Note that
the code got slower by 2 seconds because it ran inside the cProfile module. The
table contains the real valuable information.  You might want to check the
Python `profiling documentation <https://docs.python.org/library/profile.html>`_
for the nitty gritty details. The most important columns here are totime (total
time spent in this function **not** counting functions that were called by this
function) and cumtime (total time spent in this function **also** counting the
functions called by this function). Looking at the tottime column, we see that
approximately half the time is spent in ``approx_pi()`` and the other half is spent
in ``recip_square()``. Also half a second is spent in range ... of course we should
have used xrange for such a big iteration. And in fact, just changing range to
xrange makes the code run in 5.8 seconds.

We could optimize a lot in the pure Python version, but since we are interested
in Cython, let's move forward and bring this module to Cython. We would do this
anyway at some time to get the loop run faster. Here is our first Cython version:

.. tabs::

    .. group-tab:: Pure Python

        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_2.py
            :caption: calc_pi.py

    .. group-tab:: Cython

        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_2.pyx
            :caption: calc_pi.pyx

Note the first line: We have to tell Cython that profiling should be enabled.
This makes the Cython code slightly slower, but without this we would not get
meaningful output from the cProfile module. The rest of the code is mostly
unchanged, I only typed some variables which will likely speed things up a bit.

We also need to modify our profiling script to import the Cython module directly.
Here is the complete version adding the import of the :ref:`Pyximport<pyximport>` module:

.. literalinclude:: ../../examples/tutorial/profiling_tutorial/profile_2.py
    :caption: profile.py

We only added two lines, the rest stays completely the same. Alternatively, we could also
manually compile our code into an extension; we wouldn't need to change the
profile script then at all. The script now outputs the following:

.. tabs::

    .. group-tab:: Pure Python

        .. code-block:: none

           Sat Nov  7 18:02:33 2009    Profile.prof

                    10000004 function calls in 4.406 CPU seconds

              Ordered by: internal time

              ncalls  tottime  percall  cumtime  percall filename:lineno(function)
                   1    3.305    3.305    4.406    4.406 calc_pi.py:6(approx_pi)
            10000000    1.101    0.000    1.101    0.000 calc_pi.py:3(recip_square)
                   1    0.000    0.000    4.406    4.406 {calc_pi.approx_pi}
                   1    0.000    0.000    4.406    4.406 <string>:1(<module>)
                   1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

    .. group-tab:: Cython

        .. code-block:: none

           Sat Nov  7 18:02:33 2009    Profile.prof

                    10000004 function calls in 4.406 CPU seconds

              Ordered by: internal time

              ncalls  tottime  percall  cumtime  percall filename:lineno(function)
                   1    3.305    3.305    4.406    4.406 calc_pi.pyx:6(approx_pi)
            10000000    1.101    0.000    1.101    0.000 calc_pi.pyx:3(recip_square)
                   1    0.000    0.000    4.406    4.406 {calc_pi.approx_pi}
                   1    0.000    0.000    4.406    4.406 <string>:1(<module>)
                   1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

We gained 1.8 seconds.  Not too shabby.  Comparing the output to the previous, we
see that the ``recip_square()`` function got faster while the ``approx_pi()``
function has not changed a lot.  Let's concentrate on the ``recip_square()`` function
a bit more.  First, note that this function is not to be called from code outside
of our module; so it would be wise to turn it into a cdef to reduce call overhead.
We should also get rid of the power operator: it is turned into a ``pow(i, 2)`` function
call by Cython, but we could instead just write ``i * i`` which could be faster.  The
whole function is also a good candidate for inlining.  Let's look at the
necessary changes for these ideas:

.. tabs::

    .. group-tab:: Pure Python

        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_3.py
            :caption: calc_pi.py

    .. group-tab:: Cython

        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_3.pyx
            :caption: calc_pi.pyx

Note that the ``except``/``@exceptval`` declaration is needed in the signature of ``recip_square()``
in order to propagate division by zero errors.

Now running the profile script yields:

.. tabs::

    .. group-tab:: Pure Python

        .. code-block:: none

           Sat Nov  7 18:10:11 2009    Profile.prof

                    10000004 function calls in 2.622 CPU seconds

              Ordered by: internal time

              ncalls  tottime  percall  cumtime  percall filename:lineno(function)
                   1    1.782    1.782    2.622    2.622 calc_pi.py:9(approx_pi)
            10000000    0.840    0.000    0.840    0.000 calc_pi.py:6(recip_square)
                   1    0.000    0.000    2.622    2.622 {calc_pi.approx_pi}
                   1    0.000    0.000    2.622    2.622 <string>:1(<module>)
                   1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

    .. group-tab:: Cython

        .. code-block:: none

           Sat Nov  7 18:10:11 2009    Profile.prof

                    10000004 function calls in 2.622 CPU seconds

              Ordered by: internal time

              ncalls  tottime  percall  cumtime  percall filename:lineno(function)
                   1    1.782    1.782    2.622    2.622 calc_pi.pyx:9(approx_pi)
            10000000    0.840    0.000    0.840    0.000 calc_pi.pyx:6(recip_square)
                   1    0.000    0.000    2.622    2.622 {calc_pi.approx_pi}
                   1    0.000    0.000    2.622    2.622 <string>:1(<module>)
                   1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

That bought us another 1.8 seconds. Not the dramatic change we could have
expected. And why is ``recip_square()`` still in this table; it is supposed to be
inlined, isn't it?  The reason for this is that Cython still generates profiling code
even if the function call is eliminated. Let's tell it to not
profile ``recip_square()`` any more; we couldn't get the function to be much faster anyway:

.. tabs::

    .. group-tab:: Pure Python

        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_4.py
            :caption: calc_pi.py

    .. group-tab:: Cython

        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_4.pyx
            :caption: calc_pi.pyx


Running this shows an interesting result:

.. tabs::

    .. group-tab:: Pure Python

        .. code-block:: none

           Sat Nov  7 18:15:02 2009    Profile.prof

                    4 function calls in 0.089 CPU seconds

              Ordered by: internal time

              ncalls  tottime  percall  cumtime  percall filename:lineno(function)
                   1    0.089    0.089    0.089    0.089 calc_pi.py:12(approx_pi)
                   1    0.000    0.000    0.089    0.089 {calc_pi.approx_pi}
                   1    0.000    0.000    0.089    0.089 <string>:1(<module>)
                   1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

    .. group-tab:: Cython

        .. code-block:: none

           Sat Nov  7 18:15:02 2009    Profile.prof

                    4 function calls in 0.089 CPU seconds

              Ordered by: internal time

              ncalls  tottime  percall  cumtime  percall filename:lineno(function)
                   1    0.089    0.089    0.089    0.089 calc_pi.pyx:12(approx_pi)
                   1    0.000    0.000    0.089    0.089 {calc_pi.approx_pi}
                   1    0.000    0.000    0.089    0.089 <string>:1(<module>)
                   1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

First note the tremendous speed gain: this version only takes 1/50 of the time
of our first Cython version.  Also note that ``recip_square()`` has vanished from the
table like we wanted.  But the most peculiar and important change is that
``approx_pi()`` also got much faster.  This is a problem with all profiling: calling a
function in a profile run adds a certain overhead to the function call.  This
overhead is **not** added to the time spent in the called function, but to the
time spent in the **calling** function.  In this example, ``approx_pi()`` didn't need 2.622
seconds in the last run; but it called ``recip_square()`` 10000000 times, each time taking a
little to set up profiling for it.  This adds up to the massive time loss of
around 2.6 seconds.  Having disabled profiling for the often called function now
reveals realistic timings for ``approx_pi()``; we could continue optimizing it now if
needed.

This concludes this profiling tutorial. There is still some room for
improvement in this code. We could try to replace the power operator in
``approx_pi()`` with a call to sqrt from the C stdlib; but this is not necessarily
faster than calling ``pow(x, 0.5)``.

Even so, the result we achieved here is quite satisfactory: we came up with a
solution that is much faster then our original Python version while retaining
functionality and readability.