23 files changed, 1838 insertions, 398 deletions
diff --git a/docs/src/tutorial/annotation_typing_table.csv b/docs/src/tutorial/annotation_typing_table.csv
new file mode 100644
index 000000000..43c48b1ab
--- /dev/null
+++ b/docs/src/tutorial/annotation_typing_table.csv
@@ -0,0 +1,9 @@
+Feature             ,Cython 0.29        ,Cython 3.0                
+``int``,Any Python object,Exact Python ``int`` (``language_level=3`` only)
+``float``,,C ``double``
+"Builtin type  e.g. ``dict``, ``list`` ",,"Exact type (no  subclasses), not ``None``"
+Extension type defined in Cython ,,"Specified type or a  subclasses, not ``None``"
+"``cython.int``,  ``cython.long``,  etc. ",,Equivalent C numeric type
+``typing.Optional[any_type]``,Not supported,"Specified type (which  must be a Python object), allows ``None``"
+``typing.List[any_type]``  (and similar) ,Not supported,"Exact ``list``, with the element type ignored currently "
+``typing.ClassVar[...]`` ,Not supported,Python-object class variable (when used in a class definition) 
diff --git a/docs/src/tutorial/appendix.rst b/docs/src/tutorial/appendix.rst
index b0ab0426e..82f225bbf 100644
--- a/docs/src/tutorial/appendix.rst
+++ b/docs/src/tutorial/appendix.rst
@@ -2,7 +2,7 @@ Appendix: Installing MinGW on Windows
 =====================================
 
  1. Download the MinGW installer from
-    http://www.mingw.org/wiki/HOWTO_Install_the_MinGW_GCC_Compiler_Suite.
+    https://www.mingw.org/wiki/HOWTO_Install_the_MinGW_GCC_Compiler_Suite.
     (As of this
     writing, the download link is a bit difficult to find; it's under
     "About" in the menu on the left-hand side). You want the file
@@ -28,4 +28,65 @@ procedure. Any contributions towards making the Windows install
 process smoother is welcomed; it is an unfortunate fact that none of
 the regular Cython developers have convenient access to Windows.
 
+Python 3.8+
+-----------
+
+Since Python 3.8, the search paths of DLL dependencies has been reset.
+(`changelog <https://docs.python.org/3/whatsnew/3.8.html#bpo-36085-whatsnew>`_)
+
+Only the system paths, the directory containing the DLL or PYD file
+are searched for load-time dependencies.
+Instead, a new function `os.add_dll_directory() <https://docs.python.org/3.8/library/os.html#os.add_dll_directory>`_
+was added to supply additional search paths.  But such a runtime update is not applicable in all situations.
+
+Unlike MSVC, MinGW has its owned standard libraries such as ``libstdc++-6.dll``,
+which are not placed in the system path (such as ``C:\Windows\System32``).
+For a C++ example, you can check the dependencies by MSVC tool ``dumpbin``::
+
+    > dumpbin /dependents my_gnu_extension.cp38-win_amd64.pyd
+    ...
+    Dump of file my_gnu_extension.cp38-win_amd64.pyd
+    
+    File Type: DLL
+    
+      Image has the following dependencies:
+      
+          python38.dll
+          KERNEL32.dll
+          msvcrt.dll
+          libgcc_s_seh-1.dll
+          libstdc++-6.dll
+          ...
+
+These standard libraries can be embedded via static linking, by adding the following options to the linker::
+
+    -static-libgcc -static-libstdc++ -Wl,-Bstatic,--whole-archive -lwinpthread -Wl,--no-whole-archive
+
+In ``setup.py``, a cross platform config can be added through
+extending ``build_ext`` class::
+
+    from setuptools import setup
+    from setuptools.command.build_ext import build_ext
+
+    link_args = ['-static-libgcc',
+                 '-static-libstdc++',
+                 '-Wl,-Bstatic,--whole-archive',
+                 '-lwinpthread',
+                 '-Wl,--no-whole-archive']
+
+    ...  # Add extensions
+
+    class Build(build_ext):
+        def build_extensions(self):
+            if self.compiler.compiler_type == 'mingw32':
+                for e in self.extensions:
+                    e.extra_link_args = link_args
+            super(Build, self).build_extensions()
+
+    setup(
+        ...
+        cmdclass={'build_ext': Build},
+        ...
+    )
+
 .. [WinInst] https://github.com/cython/cython/wiki/CythonExtensionsOnWindows
diff --git a/docs/src/tutorial/array.rst b/docs/src/tutorial/array.rst
index 4fb4e843a..fb255cf26 100644
--- a/docs/src/tutorial/array.rst
+++ b/docs/src/tutorial/array.rst
@@ -4,6 +4,9 @@
 Working with Python arrays
 ==========================
 
+.. include::
+    ../two-syntax-variants-used
+
 Python has a builtin array module supporting dynamic 1-dimensional arrays of
 primitive types. It is possible to access the underlying C array of a Python
 array from within Cython. At the same time they are ordinary Python objects
@@ -18,7 +21,16 @@ module is built into both Python and Cython.
 Safe usage with memory views
 ----------------------------
 
-.. literalinclude:: ../../examples/tutorial/array/safe_usage.pyx
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/array/safe_usage.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/array/safe_usage.pyx
+
 
 NB: the import brings the regular Python array object into the namespace
 while the cimport adds functions accessible from Cython.
@@ -32,7 +44,15 @@ memory view, there will be a slight overhead to construct the memory
 view. However, from that point on the variable can be passed to other
 functions without overhead, so long as it is typed:
 
-.. literalinclude:: ../../examples/tutorial/array/overhead.pyx
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/array/overhead.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/array/overhead.pyx
 
 
 Zero-overhead, unsafe access to raw C pointer
@@ -42,7 +62,16 @@ functions, it is possible to access the underlying contiguous array as a
 pointer. There is no type or bounds checking, so be careful to use the
 right type and signedness.
 
-.. literalinclude:: ../../examples/tutorial/array/unsafe_usage.pyx
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/array/unsafe_usage.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/array/unsafe_usage.pyx
+
 
 Note that any length-changing operation on the array object may invalidate the
 pointer.
@@ -55,13 +84,30 @@ it is possible to create a new array with the same type as a template,
 and preallocate a given number of elements. The array is initialized to
 zero when requested.
 
-.. literalinclude:: ../../examples/tutorial/array/clone.pyx
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/array/clone.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/array/clone.pyx
+
 
 An array can also be extended and resized; this avoids repeated memory
 reallocation which would occur if elements would be appended or removed
 one by one.
 
-.. literalinclude:: ../../examples/tutorial/array/resize.pyx
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/array/resize.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/array/resize.pyx
 
 
 API reference
@@ -94,48 +140,142 @@ e.g., ``myarray.data.as_ints``.
 
 Functions
 ~~~~~~~~~
-The following functions are available to Cython from the array module::
+The following functions are available to Cython from the array module
+
+.. tabs::
+    .. group-tab:: Pure Python
 
-    int resize(array self, Py_ssize_t n) except -1
+        .. code-block:: python
+
+            @cython.cfunc
+            @cython.exceptval(-1)
+            def resize(self: array.array, n: cython.Py_ssize_t) -> cython.int
+
+    .. group-tab:: Cython
+
+        .. code-block:: cython
+
+            cdef int resize(array.array self, Py_ssize_t n) except -1
 
 Fast resize / realloc. Not suitable for repeated, small increments; resizes
 underlying array to exactly the requested amount.
 
-::
+----
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            @cython.cfunc
+            @cython.exceptval(-1)
+            def resize_smart(self: array.array, n: cython.Py_ssize_t) -> cython.int
 
-    int resize_smart(array self, Py_ssize_t n) except -1
+    .. group-tab:: Cython
+
+        .. code-block:: cython
+
+            cdef int resize_smart(array.array self, Py_ssize_t n) except -1
 
 Efficient for small increments; uses growth pattern that delivers
 amortized linear-time appends.
 
-::
+----
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            @cython.cfunc
+            @cython.inline
+            def clone(template: array.array, length: cython.Py_ssize_t, zero: cython.bint) -> array.array
+
+    .. group-tab:: Cython
+
+        .. code-block:: cython
+
+            cdef inline array.array clone(array.array template, Py_ssize_t length, bint zero)
 
-    cdef inline array clone(array template, Py_ssize_t length, bint zero)
 
 Fast creation of a new array, given a template array. Type will be same as
 ``template``. If zero is ``True``, new array will be initialized with zeroes.
 
-::
+----
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            @cython.cfunc
+            @cython.inline
+            def copy(self: array.array) -> array.array
 
-    cdef inline array copy(array self)
+    .. group-tab:: Cython
+
+        .. code-block:: cython
+
+            cdef inline array.array copy(array.array self)
 
 Make a copy of an array.
 
-::
+----
+
+.. tabs::
+    .. group-tab:: Pure Python
 
-    cdef inline int extend_buffer(array self, char* stuff, Py_ssize_t n) except -1
+        .. code-block:: python
+
+            @cython.cfunc
+            @cython.inline
+            @cython.exceptval(-1)
+            def extend_buffer(self: array.array, stuff: cython.p_char, n: cython.Py_ssize_t) -> cython.int
+
+    .. group-tab:: Cython
+
+        .. code-block:: cython
+
+            cdef inline int extend_buffer(array.array self, char* stuff, Py_ssize_t n) except -1
 
 Efficient appending of new data of same type (e.g. of same array type)
 ``n``: number of elements (not number of bytes!)
 
-::
+----
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            @cython.cfunc
+            @cython.inline
+            @cython.exceptval(-1)
+            def extend(self: array.array, other: array.array) -> cython.int
 
-    cdef inline int extend(array self, array other) except -1
+    .. group-tab:: Cython
+
+        .. code-block:: cython
+
+            cdef inline int extend(array.array self, array.array other) except -1
 
 Extend array with data from another array; types must match.
 
-::
+----
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            @cython.cfunc
+            @cython.inline
+            def zero(self: array.array) -> cython.void
+
+    .. group-tab:: Cython
+
+        .. code-block:: cython
 
-    cdef inline void zero(array self)
+            cdef inline void zero(array.array self)
 
 Set all elements of array to zero.
diff --git a/docs/src/tutorial/caveats.rst b/docs/src/tutorial/caveats.rst
index 65443ae26..192313162 100644
--- a/docs/src/tutorial/caveats.rst
+++ b/docs/src/tutorial/caveats.rst
@@ -5,7 +5,6 @@ Since Cython mixes C and Python semantics, some things may be a bit
 surprising or unintuitive. Work always goes on to make Cython more natural
 for Python users, so this list may change in the future.
 
- - ``10**-2 == 0``, instead of ``0.01`` like in Python.
  - Given two typed ``int`` variables ``a`` and ``b``, ``a % b`` has the
    same sign as the second argument (following Python semantics) rather than
    having the same sign as the first (as in C).  The C behavior can be
diff --git a/docs/src/tutorial/cdef_classes.rst b/docs/src/tutorial/cdef_classes.rst
index a95b802a8..c3cd08ead 100644
--- a/docs/src/tutorial/cdef_classes.rst
+++ b/docs/src/tutorial/cdef_classes.rst
@@ -1,5 +1,9 @@
+***********************************
 Extension types (aka. cdef classes)
-===================================
+***********************************
+
+.. include::
+    ../two-syntax-variants-used
 
 To support object-oriented programming, Cython supports writing normal
 Python classes exactly as in Python:
@@ -8,8 +12,8 @@ Python classes exactly as in Python:
 
 Based on what Python calls a "built-in type", however, Cython supports
 a second kind of class: *extension types*, sometimes referred to as
-"cdef classes" due to the keywords used for their declaration.  They
-are somewhat restricted compared to Python classes, but are generally
+"cdef classes" due to the Cython language keywords used for their declaration.
+They are somewhat restricted compared to Python classes, but are generally
 more memory efficient and faster than generic Python classes.  The
 main difference is that they use a C struct to store their fields and methods
 instead of a Python dict.  This allows them to store arbitrary C types
@@ -24,36 +28,68 @@ single inheritance.  Normal Python classes, on the other hand, can
 inherit from any number of Python classes and extension types, both in
 Cython code and pure Python code.
 
-So far our integration example has not been very useful as it only
-integrates a single hard-coded function. In order to remedy this,
-with hardly sacrificing speed, we will use a cdef class to represent a
-function on floating point numbers:
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/cdef_classes/math_function_2.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/cdef_classes/math_function_2.pyx
+
+The ``cpdef`` command (or ``@cython.ccall`` in Python syntax) makes two versions
+of the method available; one fast for use from Cython and one slower for use
+from Python.
 
-.. literalinclude:: ../../examples/tutorial/cdef_classes/math_function_2.pyx
+Now we can add subclasses of the ``Function`` class that implement different
+math functions in the same ``evaluate()`` method.
 
-The directive cpdef makes two versions of the method available; one
-fast for use from Cython and one slower for use from Python. Then:
+Then:
 
-.. literalinclude:: ../../examples/tutorial/cdef_classes/sin_of_square.pyx
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/cdef_classes/sin_of_square.py
+            :caption: sin_of_square.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/cdef_classes/sin_of_square.pyx
+            :caption: sin_of_square.pyx
 
 This does slightly more than providing a python wrapper for a cdef
 method: unlike a cdef method, a cpdef method is fully overridable by
-methods and instance attributes in Python subclasses.  It adds a
+methods and instance attributes in Python subclasses.  This adds a
 little calling overhead compared to a cdef method.
 
 To make the class definitions visible to other modules, and thus allow for
 efficient C-level usage and inheritance outside of the module that
-implements them, we define them in a :file:`sin_of_square.pxd` file:
+implements them, we define them in a ``.pxd`` file with the same name
+as the module.  Note that we are using Cython syntax here, not Python syntax.
 
 .. literalinclude:: ../../examples/tutorial/cdef_classes/sin_of_square.pxd
+    :caption: sin_of_square.pxd
+
+With this way to implement different functions as subclasses with fast,
+Cython callable methods, we can now pass these ``Function`` objects into
+an algorithm for numeric integration, that evaluates an arbitrary user
+provided function over a value interval.
 
 Using this, we can now change our integration example:
 
-.. literalinclude:: ../../examples/tutorial/cdef_classes/integrate.pyx
+.. tabs::
+    .. group-tab:: Pure Python
 
-This is almost as fast as the previous code, however it is much more flexible
-as the function to integrate can be changed. We can even pass in a new
-function defined in Python-space::
+        .. literalinclude:: ../../examples/tutorial/cdef_classes/integrate.py
+            :caption: integrate.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/cdef_classes/integrate.pyx
+            :caption: integrate.pyx
+
+We can even pass in a new ``Function`` defined in Python space, which overrides
+the Cython implemented method of the base class::
 
   >>> import integrate
   >>> class MyPolynomial(integrate.Function):
@@ -63,39 +99,58 @@ function defined in Python-space::
   >>> integrate(MyPolynomial(), 0, 1, 10000)
   -7.8335833300000077
 
-This is about 20 times slower, but still about 10 times faster than
-the original Python-only integration code.  This shows how large the
-speed-ups can easily be when whole loops are moved from Python code
-into a Cython module.
+Since ``evaluate()`` is a Python method here, which requires Python objects
+as input and output, this is several times slower than the straight C call
+to the Cython method, but still faster than a plain Python variant.
+This shows how large the speed-ups can easily be when whole computational
+loops are moved from Python code into a Cython module.
 
 Some notes on our new implementation of ``evaluate``:
 
- - The fast method dispatch here only works because ``evaluate`` was
-   declared in ``Function``. Had ``evaluate`` been introduced in
-   ``SinOfSquareFunction``, the code would still work, but Cython
-   would have used the slower Python method dispatch mechanism
-   instead.
+-   The fast method dispatch here only works because ``evaluate`` was
+    declared in ``Function``. Had ``evaluate`` been introduced in
+    ``SinOfSquareFunction``, the code would still work, but Cython
+    would have used the slower Python method dispatch mechanism
+    instead.
 
- - In the same way, had the argument ``f`` not been typed, but only
-   been passed as a Python object, the slower Python dispatch would
-   be used.
+-   In the same way, had the argument ``f`` not been typed, but only
+    been passed as a Python object, the slower Python dispatch would
+    be used.
 
- - Since the argument is typed, we need to check whether it is
-   ``None``. In Python, this would have resulted in an ``AttributeError``
-   when the ``evaluate`` method was looked up, but Cython would instead
-   try to access the (incompatible) internal structure of ``None`` as if
-   it were a ``Function``, leading to a crash or data corruption.
+-   Since the argument is typed, we need to check whether it is
+    ``None``. In Python, this would have resulted in an ``AttributeError``
+    when the ``evaluate`` method was looked up, but Cython would instead
+    try to access the (incompatible) internal structure of ``None`` as if
+    it were a ``Function``, leading to a crash or data corruption.
 
 There is a *compiler directive* ``nonecheck`` which turns on checks
 for this, at the cost of decreased speed. Here's how compiler directives
 are used to dynamically switch on or off ``nonecheck``:
 
-.. literalinclude:: ../../examples/tutorial/cdef_classes/nonecheck.pyx
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/cdef_classes/nonecheck.py
+            :caption: nonecheck.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/cdef_classes/nonecheck.pyx
+            :caption: nonecheck.pyx
 
 Attributes in cdef classes behave differently from attributes in regular classes:
 
- - All attributes must be pre-declared at compile-time
- - Attributes are by default only accessible from Cython (typed access)
- - Properties can be declared to expose dynamic attributes to Python-space
+-   All attributes must be pre-declared at compile-time
+-   Attributes are by default only accessible from Cython (typed access)
+-   Properties can be declared to expose dynamic attributes to Python-space
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/cdef_classes/wave_function.py
+            :caption: wave_function.py
+
+    .. group-tab:: Cython
 
-.. literalinclude:: ../../examples/tutorial/cdef_classes/wave_function.pyx
+        .. literalinclude:: ../../examples/tutorial/cdef_classes/wave_function.pyx
+            :caption: wave_function.pyx
diff --git a/docs/src/tutorial/clibraries.rst b/docs/src/tutorial/clibraries.rst
index dd91d9ef2..3542dbe8e 100644
--- a/docs/src/tutorial/clibraries.rst
+++ b/docs/src/tutorial/clibraries.rst
@@ -5,6 +5,9 @@
 Using C libraries
 ******************
 
+.. include::
+    ../two-syntax-variants-used
+
 Apart from writing fast code, one of the main use cases of Cython is
 to call external C libraries from Python code.  As Cython code
 compiles down to C code itself, it is actually trivial to call C
@@ -24,7 +27,7 @@ decide to use its double ended queue implementation.  To make the
 handling easier, however, you decide to wrap it in a Python extension
 type that can encapsulate all memory management.
 
-.. [CAlg] Simon Howard, C Algorithms library, http://c-algorithms.sourceforge.net/
+.. [CAlg] Simon Howard, C Algorithms library, https://fragglet.github.io/c-algorithms/
 
 
 Defining external declarations
@@ -33,15 +36,17 @@ Defining external declarations
 You can download CAlg `here <https://codeload.github.com/fragglet/c-algorithms/zip/master>`_.
 
 The C API of the queue implementation, which is defined in the header
-file ``c-algorithms/src/queue.h``, essentially looks like this:
+file :file:`c-algorithms/src/queue.h`, essentially looks like this:
 
 .. literalinclude:: ../../examples/tutorial/clibraries/c-algorithms/src/queue.h
     :language: C
+    :caption: queue.h
 
 To get started, the first step is to redefine the C API in a ``.pxd``
-file, say, ``cqueue.pxd``:
+file, say, :file:`cqueue.pxd`:
 
 .. literalinclude:: ../../examples/tutorial/clibraries/cqueue.pxd
+    :caption: cqueue.pxd
 
 Note how these declarations are almost identical to the header file
 declarations, so you can often just copy them over.  However, you do
@@ -100,20 +105,30 @@ Writing a wrapper class
 
 After declaring our C library's API, we can start to design the Queue
 class that should wrap the C queue.  It will live in a file called
-``queue.pyx``. [#]_
+:file:`queue.pyx`/:file:`queue.py`. [#]_
 
-.. [#] Note that the name of the ``.pyx`` file must be different from
-       the ``cqueue.pxd`` file with declarations from the C library,
+.. [#] Note that the name of the ``.pyx``/``.py`` file must be different from
+       the :file:`cqueue.pxd` file with declarations from the C library,
        as both do not describe the same code.  A ``.pxd`` file next to
-       a ``.pyx`` file with the same name defines exported
-       declarations for code in the ``.pyx`` file.  As the
-       ``cqueue.pxd`` file contains declarations of a regular C
-       library, there must not be a ``.pyx`` file with the same name
+       a ``.pyx``/``.py`` file with the same name defines exported
+       declarations for code in the ``.pyx``/``.py`` file.  As the
+       :file:`cqueue.pxd` file contains declarations of a regular C
+       library, there must not be a ``.pyx``/``.py`` file with the same name
        that Cython associates with it.
 
 Here is a first start for the Queue class:
 
-.. literalinclude:: ../../examples/tutorial/clibraries/queue.pyx
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/clibraries/queue.py
+            :caption: queue.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/clibraries/queue.pyx
+            :caption: queue.pyx
 
 Note that it says ``__cinit__`` rather than ``__init__``.  While
 ``__init__`` is available as well, it is not guaranteed to be run (for
@@ -122,10 +137,11 @@ ancestor's constructor).  Because not initializing C pointers often
 leads to hard crashes of the Python interpreter, Cython provides
 ``__cinit__`` which is *always* called immediately on construction,
 before CPython even considers calling ``__init__``, and which
-therefore is the right place to initialise ``cdef`` fields of the new
-instance.  However, as ``__cinit__`` is called during object
-construction, ``self`` is not fully constructed yet, and one must
-avoid doing anything with ``self`` but assigning to ``cdef`` fields.
+therefore is the right place to initialise static attributes
+(``cdef`` fields) of the new instance.  However, as ``__cinit__`` is
+called during object construction, ``self`` is not fully constructed yet,
+and one must avoid doing anything with ``self`` but assigning to static
+attributes (``cdef`` fields).
 
 Note also that the above method takes no parameters, although subtypes
 may want to accept some.  A no-arguments ``__cinit__()`` method is a
@@ -152,7 +168,17 @@ pointer to the new queue.
 The Python way to get out of this is to raise a ``MemoryError`` [#]_.
 We can thus change the init function as follows:
 
-.. literalinclude:: ../../examples/tutorial/clibraries/queue2.pyx
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/clibraries/queue2.py
+            :caption: queue.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/clibraries/queue2.pyx
+            :caption: queue.pyx
 
 .. [#] In the specific case of a ``MemoryError``, creating a new
    exception instance in order to raise it may actually fail because
@@ -169,31 +195,60 @@ longer used (i.e. all references to it have been deleted).  To this
 end, CPython provides a callback that Cython makes available as a
 special method ``__dealloc__()``.  In our case, all we have to do is
 to free the C Queue, but only if we succeeded in initialising it in
-the init method::
+the init method:
+
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            def __dealloc__(self):
+                if self._c_queue is not cython.NULL:
+                    cqueue.queue_free(self._c_queue)
 
-        def __dealloc__(self):
-            if self._c_queue is not NULL:
-                cqueue.queue_free(self._c_queue)
+    .. group-tab:: Cython
 
+        .. code-block:: cython
+
+            def __dealloc__(self):
+                if self._c_queue is not NULL:
+                    cqueue.queue_free(self._c_queue)
 
 Compiling and linking
 =====================
 
 At this point, we have a working Cython module that we can test.  To
-compile it, we need to configure a ``setup.py`` script for distutils.
-Here is the most basic script for compiling a Cython module::
+compile it, we need to configure a ``setup.py`` script for setuptools.
+Here is the most basic script for compiling a Cython module
+
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            from setuptools import Extension, setup
+            from Cython.Build import cythonize
+
+            setup(
+                ext_modules = cythonize([Extension("queue", ["queue.py"])])
+            )
+
+    .. group-tab:: Cython
 
-    from distutils.core import setup
-    from distutils.extension import Extension
-    from Cython.Build import cythonize
+        .. code-block:: cython
 
-    setup(
-        ext_modules = cythonize([Extension("queue", ["queue.pyx"])])
-    )
+            from setuptools import Extension, setup
+            from Cython.Build import cythonize
+
+            setup(
+                ext_modules = cythonize([Extension("queue", ["queue.pyx"])])
+            )
 
 
 To build against the external C library, we need to make sure Cython finds the necessary libraries.
-There are two ways to archive this. First we can tell distutils where to find
+There are two ways to archive this. First we can tell setuptools where to find
 the c-source to compile the :file:`queue.c` implementation automatically. Alternatively,
 we can build and install C-Alg as system library and dynamically link it. The latter is useful
 if other applications also use C-Alg.
@@ -202,33 +257,69 @@ if other applications also use C-Alg.
 Static Linking
 ---------------
 
-To build the c-code automatically we need to include compiler directives in `queue.pyx`::
+To build the c-code automatically we need to include compiler directives in :file:`queue.pyx`/:file:`queue.py`
+
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            # distutils: sources = c-algorithms/src/queue.c
+            # distutils: include_dirs = c-algorithms/src/
+
+            import cython
+            from cython.cimports import cqueue
 
-    # distutils: sources = c-algorithms/src/queue.c
-    # distutils: include_dirs = c-algorithms/src/
+            @cython.cclass
+            class Queue:
+                _c_queue = cython.declare(cython.pointer(cqueue.Queue))
 
-    cimport cqueue
+                def __cinit__(self):
+                    self._c_queue = cqueue.queue_new()
+                    if self._c_queue is cython.NULL:
+                        raise MemoryError()
 
-    cdef class Queue:
-        cdef cqueue.Queue* _c_queue
-        def __cinit__(self):
-            self._c_queue = cqueue.queue_new()
-            if self._c_queue is NULL:
-                raise MemoryError()
+                def __dealloc__(self):
+                    if self._c_queue is not cython.NULL:
+                        cqueue.queue_free(self._c_queue)
 
-        def __dealloc__(self):
-            if self._c_queue is not NULL:
-                cqueue.queue_free(self._c_queue)
+    .. group-tab:: Cython
+
+        .. code-block:: cython
+
+            # distutils: sources = c-algorithms/src/queue.c
+            # distutils: include_dirs = c-algorithms/src/
+
+
+            cimport cqueue
+
+
+            cdef class Queue:
+                cdef cqueue.Queue* _c_queue
+
+                def __cinit__(self):
+                    self._c_queue = cqueue.queue_new()
+                    if self._c_queue is NULL:
+                        raise MemoryError()
+
+                def __dealloc__(self):
+                    if self._c_queue is not NULL:
+                        cqueue.queue_free(self._c_queue)
 
 The ``sources`` compiler directive gives the path of the C
-files that distutils is going to compile and
+files that setuptools is going to compile and
 link (statically) into the resulting extension module.
 In general all relevant header files should be found in ``include_dirs``.
-Now we can build the project using::
+Now we can build the project using:
+
+.. code-block:: bash
 
     $ python setup.py build_ext -i
 
-And test whether our build was successful::
+And test whether our build was successful:
+
+.. code-block:: bash
 
     $ python -c 'import queue; Q = queue.Queue()'
 
@@ -240,14 +331,18 @@ Dynamic linking is useful, if the library we are going to wrap is already
 installed on the system. To perform dynamic linking we first need to
 build and install c-alg.
 
-To build c-algorithms on your system::
+To build c-algorithms on your system:
+
+.. code-block:: bash
 
     $ cd c-algorithms
     $ sh autogen.sh
     $ ./configure
     $ make
 
-to install CAlg run::
+to install CAlg run:
+
+.. code-block:: bash
 
     $ make install
 
@@ -262,26 +357,53 @@ Afterwards the file :file:`/usr/local/lib/libcalg.so` should exist.
 In this approach we need to tell the setup script to link with an external library.
 To do so we need to extend the setup script to install change the extension setup from
 
-::
+.. tabs::
+
+    .. group-tab:: Pure Python
 
-    ext_modules = cythonize([Extension("queue", ["queue.pyx"])])
+        .. code-block:: python
+
+            ext_modules = cythonize([Extension("queue", ["queue.py"])])
+
+    .. group-tab:: Cython
+
+        .. code-block:: cython
+
+            ext_modules = cythonize([Extension("queue", ["queue.pyx"])])
 
 to
 
-::
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
 
-    ext_modules = cythonize([
-        Extension("queue", ["queue.pyx"],
-                  libraries=["calg"])
-        ])
+            ext_modules = cythonize([
+                Extension("queue", ["queue.py"],
+                          libraries=["calg"])
+                ])
 
-Now we should be able to build the project using::
+    .. group-tab:: Cython
+
+        .. code-block:: cython
+
+            ext_modules = cythonize([
+                Extension("queue", ["queue.pyx"],
+                          libraries=["calg"])
+                ])
+
+Now we should be able to build the project using:
+
+.. code-block:: bash
 
     $ python setup.py build_ext -i
 
 If the `libcalg` is not installed in a 'normal' location, users can provide the
 required parameters externally by passing appropriate C compiler
-flags, such as::
+flags, such as:
+
+.. code-block:: bash
 
     CFLAGS="-I/usr/local/otherdir/calg/include"  \
     LDFLAGS="-L/usr/local/otherdir/calg/lib"     \
@@ -290,12 +412,16 @@ flags, such as::
 
 
 Before we run the module, we also need to make sure that `libcalg` is in
-the `LD_LIBRARY_PATH` environment variable, e.g. by setting::
+the `LD_LIBRARY_PATH` environment variable, e.g. by setting:
+
+.. code-block:: bash
 
    $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
 
 Once we have compiled the module for the first time, we can now import
-it and instantiate a new Queue::
+it and instantiate a new Queue:
+
+.. code-block:: bash
 
     $ export PYTHONPATH=.
     $ python -c 'import queue; Q = queue.Queue()'
@@ -313,7 +439,7 @@ practice to look at what interfaces Python offers, e.g. in its
 queue, it's enough to provide the methods ``append()``, ``peek()`` and
 ``pop()``, and additionally an ``extend()`` method to add multiple
 values at once.  Also, since we already know that all values will be
-coming from C, it's best to provide only ``cdef`` methods for now, and
+coming from C, it's best to provide only ``cdef``/``@cfunc`` methods for now, and
 to give them a straight C interface.
 
 In C, it is common for data structures to store data as a ``void*`` to
@@ -323,28 +449,76 @@ additional memory allocations through a trick: we cast our ``int`` values
 to ``void*`` and vice versa, and store the value directly as the
 pointer value.
 
-Here is a simple implementation for the ``append()`` method::
+Here is a simple implementation for the ``append()`` method:
+
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            @cython.cfunc
+            def append(self, value: cython.int):
+                cqueue.queue_push_tail(self._c_queue, cython.cast(cython.p_void, value))
+
+    .. group-tab:: Cython
+
+        .. code-block:: cython
 
-        cdef append(self, int value):
-            cqueue.queue_push_tail(self._c_queue, <void*>value)
+            cdef append(self, int value):
+                cqueue.queue_push_tail(self._c_queue, <void*>value)
 
 Again, the same error handling considerations as for the
 ``__cinit__()`` method apply, so that we end up with this
-implementation instead::
+implementation instead:
+
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            @cython.cfunc
+            def append(self, value: cython.int):
+                if not cqueue.queue_push_tail(self._c_queue,
+                                              cython.cast(cython.p_void, value)):
+                    raise MemoryError()
+
+    .. group-tab:: Cython
+
+        .. code-block:: cython
 
-        cdef append(self, int value):
-            if not cqueue.queue_push_tail(self._c_queue,
-                                          <void*>value):
-                raise MemoryError()
+            cdef append(self, int value):
+                if not cqueue.queue_push_tail(self._c_queue,
+                                              <void*>value):
+                    raise MemoryError()
 
-Adding an ``extend()`` method should now be straight forward::
+Adding an ``extend()`` method should now be straight forward:
 
-    cdef extend(self, int* values, size_t count):
-        """Append all ints to the queue.
-        """
-        cdef int value
-        for value in values[:count]:  # Slicing pointer to limit the iteration boundaries.
-            self.append(value)
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            @cython.cfunc
+            def extend(self, values: cython.p_int, count: cython.size_t):
+                """Append all ints to the queue.
+                """
+                value: cython.int
+                for value in values[:count]:  # Slicing pointer to limit the iteration boundaries.
+                    self.append(value)
+
+    .. group-tab:: Cython
+
+        .. code-block:: cython
+
+            cdef extend(self, int* values, size_t count):
+                """Append all ints to the queue.
+                """
+                cdef int value
+                for value in values[:count]:  # Slicing pointer to limit the iteration boundaries.
+                    self.append(value)
 
 This becomes handy when reading values from a C array, for example.
 
@@ -353,13 +527,31 @@ the two methods to get the first element: ``peek()`` and ``pop()``,
 which provide read-only and destructive read access respectively.
 To avoid compiler warnings when casting ``void*`` to ``int`` directly,
 we use an intermediate data type that is big enough to hold a ``void*``.
-Here, ``Py_ssize_t``::
+Here, ``Py_ssize_t``:
+
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            @cython.cfunc
+            def peek(self) -> cython.int:
+                return cython.cast(cython.Py_ssize_t, cqueue.queue_peek_head(self._c_queue))
+
+            @cython.cfunc
+            def pop(self) -> cython.int:
+                return cython.cast(cython.Py_ssize_t, cqueue.queue_pop_head(self._c_queue))
+
+    .. group-tab:: Cython
 
-    cdef int peek(self):
-        return <Py_ssize_t>cqueue.queue_peek_head(self._c_queue)
+        .. code-block:: cython
 
-    cdef int pop(self):
-        return <Py_ssize_t>cqueue.queue_pop_head(self._c_queue)
+            cdef int peek(self):
+                return <Py_ssize_t>cqueue.queue_peek_head(self._c_queue)
+
+            cdef int pop(self):
+                return <Py_ssize_t>cqueue.queue_pop_head(self._c_queue)
 
 Normally, in C, we risk losing data when we convert a larger integer type
 to a smaller integer type without checking the boundaries, and ``Py_ssize_t``
@@ -380,62 +572,88 @@ from ints, we cannot distinguish anymore if the return value was
 the queue was ``0``.  In Cython code, we want the first case to
 raise an exception, whereas the second case should simply return
 ``0``.  To deal with this, we need to special case this value,
-and check if the queue really is empty or not::
+and check if the queue really is empty or not:
+
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            @cython.cfunc
+            def peek(self) -> cython.int:
+                value: cython.int = cython.cast(cython.Py_ssize_t, cqueue.queue_peek_head(self._c_queue))
+                if value == 0:
+                    # this may mean that the queue is empty, or
+                    # that it happens to contain a 0 value
+                    if cqueue.queue_is_empty(self._c_queue):
+                        raise IndexError("Queue is empty")
+                return value
+
+    .. group-tab:: Cython
+
+        .. code-block:: cython
 
-    cdef int peek(self) except? -1:
-        cdef int value = <Py_ssize_t>cqueue.queue_peek_head(self._c_queue)
-        if value == 0:
-            # this may mean that the queue is empty, or
-            # that it happens to contain a 0 value
-            if cqueue.queue_is_empty(self._c_queue):
-                raise IndexError("Queue is empty")
-        return value
+            cdef int peek(self):
+                cdef int value = <Py_ssize_t>cqueue.queue_peek_head(self._c_queue)
+                if value == 0:
+                    # this may mean that the queue is empty, or
+                    # that it happens to contain a 0 value
+                    if cqueue.queue_is_empty(self._c_queue):
+                        raise IndexError("Queue is empty")
+                return value
 
 Note how we have effectively created a fast path through the method in
 the hopefully common cases that the return value is not ``0``.  Only
 that specific case needs an additional check if the queue is empty.
 
-The ``except? -1`` declaration in the method signature falls into the
-same category.  If the function was a Python function returning a
+If the ``peek`` function was a Python function returning a
 Python object value, CPython would simply return ``NULL`` internally
 instead of a Python object to indicate an exception, which would
 immediately be propagated by the surrounding code.  The problem is
 that the return type is ``int`` and any ``int`` value is a valid queue
 item value, so there is no way to explicitly signal an error to the
-calling code.  In fact, without such a declaration, there is no
-obvious way for Cython to know what to return on exceptions and for
-calling code to even know that this method *may* exit with an
-exception.
+calling code.
 
 The only way calling code can deal with this situation is to call
 ``PyErr_Occurred()`` when returning from a function to check if an
 exception was raised, and if so, propagate the exception.  This
-obviously has a performance penalty.  Cython therefore allows you to
-declare which value it should implicitly return in the case of an
+obviously has a performance penalty.  Cython therefore uses a dedicated value
+that it implicitly returns in the case of an
 exception, so that the surrounding code only needs to check for an
 exception when receiving this exact value.
 
-We chose to use ``-1`` as the exception return value as we expect it
-to be an unlikely value to be put into the queue.  The question mark
-in the ``except? -1`` declaration indicates that the return value is
-ambiguous (there *may* be a ``-1`` value in the queue, after all) and
-that an additional exception check using ``PyErr_Occurred()`` is
-needed in calling code.  Without it, Cython code that calls this
-method and receives the exception return value would silently (and
-sometimes incorrectly) assume that an exception has been raised.  In
-any case, all other return values will be passed through almost
+By default, the value ``-1`` is used as the exception return value.
+All other return values will be passed through almost
 without a penalty, thus again creating a fast path for 'normal'
-values.
+values. See :ref:`error_return_values` for more details.
+
 
 Now that the ``peek()`` method is implemented, the ``pop()`` method
 also needs adaptation.  Since it removes a value from the queue,
 however, it is not enough to test if the queue is empty *after* the
-removal.  Instead, we must test it on entry::
+removal.  Instead, we must test it on entry:
+
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            @cython.cfunc
+            def pop(self) -> cython.int:
+                if cqueue.queue_is_empty(self._c_queue):
+                    raise IndexError("Queue is empty")
+                return cython.cast(cython.Py_ssize_t, cqueue.queue_pop_head(self._c_queue))
+
+    .. group-tab:: Cython
 
-    cdef int pop(self) except? -1:
-        if cqueue.queue_is_empty(self._c_queue):
-            raise IndexError("Queue is empty")
-        return <Py_ssize_t>cqueue.queue_pop_head(self._c_queue)
+        .. code-block:: cython
+
+            cdef int pop(self):
+                if cqueue.queue_is_empty(self._c_queue):
+                    raise IndexError("Queue is empty")
+                return <Py_ssize_t>cqueue.queue_pop_head(self._c_queue)
 
 The return value for exception propagation is declared exactly as for
 ``peek()``.
@@ -450,7 +668,7 @@ code can use either name)::
 
 Note that this method returns either ``True`` or ``False`` as we
 declared the return type of the ``queue_is_empty()`` function as
-``bint`` in ``cqueue.pxd``.
+``bint`` in :file:`cqueue.pxd`.
 
 
 Testing the result
@@ -464,14 +682,14 @@ you can call.  C methods are not visible from Python code, and thus
 not callable from doctests.
 
 A quick way to provide a Python API for the class is to change the
-methods from ``cdef`` to ``cpdef``.  This will let Cython generate two
-entry points, one that is callable from normal Python code using the
-Python call semantics and Python objects as arguments, and one that is
-callable from C code with fast C semantics and without requiring
-intermediate argument conversion from or to Python types. Note that ``cpdef``
-methods ensure that they can be appropriately overridden by Python
-methods even when they are called from Cython. This adds a tiny overhead
-compared to ``cdef`` methods.
+methods from ``cdef``/``@cfunc`` to ``cpdef``/``@ccall``.  This will
+let Cython generate two entry points, one that is callable from normal
+Python code using the Python call semantics and Python objects as arguments,
+and one that is callable from C code with fast C semantics and without requiring
+intermediate argument conversion from or to Python types. Note that
+``cpdef``/``@ccall`` methods ensure that they can be appropriately overridden
+by Python methods even when they are called from Cython. This adds a tiny overhead
+compared to ``cdef``/``@cfunc`` methods.
 
 Now that we have both a C-interface and a Python interface for our
 class, we should make sure that both interfaces are consistent.
@@ -482,14 +700,24 @@ C arrays and C memory.  Both signatures are incompatible.
 We will solve this issue by considering that in C, the API could also
 want to support other input types, e.g. arrays of ``long`` or ``char``,
 which is usually supported with differently named C API functions such as
-``extend_ints()``, ``extend_longs()``, extend_chars()``, etc.  This allows
+``extend_ints()``, ``extend_longs()``, ``extend_chars()``, etc.  This allows
 us to free the method name ``extend()`` for the duck typed Python method,
 which can accept arbitrary iterables.
 
 The following listing shows the complete implementation that uses
-``cpdef`` methods where possible:
+``cpdef``/``@ccall`` methods where possible:
+
+.. tabs::
+
+    .. group-tab:: Pure Python
 
-.. literalinclude:: ../../examples/tutorial/clibraries/queue3.pyx
+        .. literalinclude:: ../../examples/tutorial/clibraries/queue3.py
+            :caption: queue.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/clibraries/queue3.pyx
+            :caption: queue.pyx
 
 Now we can test our Queue implementation using a python script,
 for example here :file:`test_queue.py`:
@@ -540,29 +768,73 @@ C-API into the callback function.  We will use this to pass our Python
 predicate function.
 
 First, we have to define a callback function with the expected
-signature that we can pass into the C-API function::
-
-    cdef int evaluate_predicate(void* context, cqueue.QueueValue value):
-        "Callback function that can be passed as predicate_func"
-        try:
-            # recover Python function object from void* argument
-            func = <object>context
-            # call function, convert result into 0/1 for True/False
-            return bool(func(<int>value))
-        except:
-            # catch any Python errors and return error indicator
-            return -1
+signature that we can pass into the C-API function:
+
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            @cython.cfunc
+            @cython.exceptval(check=False)
+            def evaluate_predicate(context: cython.p_void, value: cqueue.QueueValue) -> cython.int:
+                "Callback function that can be passed as predicate_func"
+                try:
+                    # recover Python function object from void* argument
+                    func = cython.cast(object, context)
+                    # call function, convert result into 0/1 for True/False
+                    return bool(func(cython.cast(int, value)))
+                except:
+                    # catch any Python errors and return error indicator
+                    return -1
+
+        .. note:: ``@cfunc`` functions in pure python are defined as ``@exceptval(-1, check=True)``
+            by default. Since ``evaluate_predicate()`` should be passed to function as parameter,
+            we need to turn off exception checking entirely.
+
+    .. group-tab:: Cython
+
+        .. code-block:: cython
+
+            cdef int evaluate_predicate(void* context, cqueue.QueueValue value):
+                "Callback function that can be passed as predicate_func"
+                try:
+                    # recover Python function object from void* argument
+                    func = <object>context
+                    # call function, convert result into 0/1 for True/False
+                    return bool(func(<int>value))
+                except:
+                    # catch any Python errors and return error indicator
+                    return -1
 
 The main idea is to pass a pointer (a.k.a. borrowed reference) to the
 function object as the user context argument. We will call the C-API
-function as follows::
-
-    def pop_until(self, python_predicate_function):
-        result = cqueue.queue_pop_head_until(
-            self._c_queue, evaluate_predicate,
-            <void*>python_predicate_function)
-        if result == -1:
-            raise RuntimeError("an error occurred")
+function as follows:
+
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
+
+            def pop_until(self, python_predicate_function):
+                result = cqueue.queue_pop_head_until(
+                    self._c_queue, evaluate_predicate,
+                    cython.cast(cython.p_void, python_predicate_function))
+                if result == -1:
+                    raise RuntimeError("an error occurred")
+
+    .. group-tab:: Cython
+
+        .. code-block:: cython
+
+            def pop_until(self, python_predicate_function):
+                result = cqueue.queue_pop_head_until(
+                    self._c_queue, evaluate_predicate,
+                    <void*>python_predicate_function)
+                if result == -1:
+                    raise RuntimeError("an error occurred")
 
 The usual pattern is to first cast the Python object reference into
 a :c:type:`void*` to pass it into the C-API function, and then cast
diff --git a/docs/src/tutorial/cython_tutorial.rst b/docs/src/tutorial/cython_tutorial.rst
index f80a8b016..e3ab46005 100644
--- a/docs/src/tutorial/cython_tutorial.rst
+++ b/docs/src/tutorial/cython_tutorial.rst
@@ -6,6 +6,9 @@
 Basic Tutorial
 **************
 
+.. include::
+    ../two-syntax-variants-used
+
 The Basics of Cython
 ====================
 
@@ -18,7 +21,7 @@ serve for now.) The Cython compiler will convert it into C code which makes
 equivalent calls to the Python/C API.
 
 But Cython is much more than that, because parameters and variables can be
-declared to have C data types. Code which manipulates Python values and C
+declared to have C data types. Code which manipulates :term:`Python values<Python object>` and C
 values can be freely intermixed, with conversions occurring automatically
 wherever possible. Reference count maintenance and error checking of Python
 operations is also automatic, and the full power of Python's exception
@@ -40,7 +43,7 @@ Save this code in a file named :file:`helloworld.pyx`.  Now we need to create
 the :file:`setup.py`, which is like a python Makefile (for more information
 see :ref:`compilation`). Your :file:`setup.py` should look like::
 
-    from distutils.core import setup
+    from setuptools import setup
     from Cython.Build import cythonize
 
     setup(
@@ -49,7 +52,7 @@ see :ref:`compilation`). Your :file:`setup.py` should look like::
 
 To use this to build your Cython file use the commandline options:
 
-.. sourcecode:: text
+.. code-block:: text
 
     $ python setup.py build_ext --inplace
 
@@ -103,13 +106,18 @@ Now following the steps for the Hello World example we first rename the file
 to have a `.pyx` extension, lets say :file:`fib.pyx`, then we create the
 :file:`setup.py` file. Using the file created for the Hello World example, all
 that you need to change is the name of the Cython filename, and the resulting
-module name, doing this we have:
+module name, doing this we have::
+
+    from setuptools import setup
+    from Cython.Build import cythonize
 
-.. literalinclude:: ../../examples/tutorial/cython_tutorial/setup.py
+    setup(
+        ext_modules=cythonize("fib.pyx"),
+    )
 
 Build the extension with the same command used for the helloworld.pyx:
 
-.. sourcecode:: text
+.. code-block:: text
 
     $ python setup.py build_ext --inplace
 
@@ -127,29 +135,59 @@ Primes
 Here's a small example showing some of what can be done. It's a routine for
 finding prime numbers. You tell it how many primes you want, and it returns
 them as a Python list.
+ 
+.. tabs::
+    .. group-tab:: Pure Python
 
-:file:`primes.pyx`:
+        .. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.py
+            :linenos:
+            :caption: primes.py
 
-.. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.pyx
-    :linenos:
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.pyx
+            :linenos:
+            :caption: primes.pyx
 
 You'll see that it starts out just like a normal Python function definition,
-except that the parameter ``nb_primes`` is declared to be of type ``int`` . This
+except that the parameter ``nb_primes`` is declared to be of type ``int``. This
 means that the object passed will be converted to a C integer (or a
 ``TypeError.`` will be raised if it can't be).
 
-Now, let's dig into the core of the function::
+Now, let's dig into the core of the function:
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.py
+            :lines: 2,3
+            :dedent:
+            :lineno-start: 2
 
-    cdef int n, i, len_p
-    cdef int p[1000]
+        .. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.py
+            :lines: 11,12
+            :dedent:
+            :lineno-start: 11
+        
+        Lines 2, 3, 11 and 12 use the variable annotations
+        to define some local C variables.
+        The result is stored in the C array ``p`` during processing,
+        and will be copied into a Python list at the end (line 26).
 
-Lines 2 and 3 use the ``cdef`` statement to define some local C variables.
-The result is stored in the C array ``p`` during processing,
-and will be copied into a Python list at the end (line 22).
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.pyx
+            :lines: 2,3
+            :dedent:
+            :lineno-start: 2
+
+        Lines 2 and 3 use the ``cdef`` statement to define some local C variables.
+        The result is stored in the C array ``p`` during processing,
+        and will be copied into a Python list at the end (line 26).
 
 .. NOTE:: You cannot create very large arrays in this manner, because
-          they are allocated on the C function call stack, which is a
-          rather precious and scarce resource.
+          they are allocated on the C function call :term:`stack<Stack allocation>`,
+          which is a rather precious and scarce resource.
           To request larger arrays,
           or even arrays with a length only known at runtime,
           you can learn how to make efficient use of
@@ -157,61 +195,83 @@ and will be copied into a Python list at the end (line 22).
           :ref:`Python arrays <array-array>`
           or :ref:`NumPy arrays <memoryviews>` with Cython.
 
-::
-
-    if nb_primes > 1000:
-        nb_primes = 1000
+.. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.pyx
+    :lines: 5,6
+    :dedent:
+    :lineno-start: 5
 
 As in C, declaring a static array requires knowing the size at compile time.
 We make sure the user doesn't set a value above 1000 (or we would have a
-segmentation fault, just like in C).  ::
+segmentation fault, just like in C)
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.py
+            :lines: 8,9
+            :dedent:
+            :lineno-start: 8
 
-    len_p = 0  # The number of elements in p
-    n = 2
-    while len_p < nb_primes:
+        When we run this code from Python, we have to initialize the items in the array.
+        This is most easily done by filling it with zeros (as seen on line 8-9).
+        When we compile this with Cython, on the other hand, the array will
+        behave as in C.  It is allocated on the function call stack with a fixed
+        length of 1000 items that contain arbitrary data from the last time that
+        memory was used.  We will then overwrite those items in our calculation.
 
-Lines 7-9 set up for a loop which will test candidate numbers for primeness
-until the required number of primes has been found. ::
+        .. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.py
+            :lines: 10-13
+            :dedent:
+            :lineno-start: 10
 
-    # Is n prime?
-    for i in p[:len_p]:
-        if n % i == 0:
-            break
+    .. group-tab:: Cython
 
-Lines 11-12, which try dividing a candidate by all the primes found so far,
+        .. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.pyx
+            :lines: 10-13
+            :dedent:
+            :lineno-start: 10
+
+Lines 11-13 set up a while loop which will test numbers-candidates to primes
+until the required number of primes has been found.
+
+.. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.pyx
+    :lines: 14-17
+    :dedent:
+    :lineno-start: 14
+
+Lines 15-16, which try to divide a candidate by all the primes found so far,
 are of particular interest. Because no Python objects are referred to,
 the loop is translated entirely into C code, and thus runs very fast.
-You will notice the way we iterate over the ``p`` C array.  ::
+You will notice the way we iterate over the ``p`` C array.
 
-    for i in p[:len_p]:
+.. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.pyx
+    :lines: 15
+    :dedent:
+    :lineno-start: 15
 
 The loop gets translated into a fast C loop and works just like iterating
 over a Python list or NumPy array.  If you don't slice the C array with
 ``[:len_p]``, then Cython will loop over the 1000 elements of the array.
 
-::
-
-    # If no break occurred in the loop
-    else:
-        p[len_p] = n
-        len_p += 1
-    n += 1
+.. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.pyx
+    :lines: 19-23
+    :dedent:
+    :lineno-start: 19
 
 If no breaks occurred, it means that we found a prime, and the block of code
-after the ``else`` line 16 will be executed. We add the prime found to ``p``.
+after the ``else`` line 20 will be executed. We add the prime found to ``p``.
 If you find having an ``else`` after a for-loop strange, just know that it's a
 lesser known features of the Python language, and that Cython executes it at
 C speed for you.
 If the for-else syntax confuses you, see this excellent
 `blog post <https://shahriar.svbtle.com/pythons-else-clause-in-loops>`_.
 
-::
-
-    # Let's put the result in a python list:
-    result_as_list  = [prime for prime in p[:len_p]]
-    return result_as_list
+.. literalinclude:: ../../examples/tutorial/cython_tutorial/primes.pyx
+    :lines: 25-27
+    :dedent:
+    :lineno-start: 25
 
-In line 22, before returning the result, we need to copy our C array into a
+In line 26, before returning the result, we need to copy our C array into a
 Python list, because Python can't read C arrays.  Cython can automatically
 convert many C types from and to Python types, as described in the
 documentation on :ref:`type conversion <type-conversion>`, so we can use
@@ -225,11 +285,20 @@ Because the variable ``result_as_list`` hasn't been explicitly declared with a t
 it is assumed to hold a Python object, and from the assignment, Cython also knows
 that the exact type is a Python list.
 
-Finally, at line 18, a normal
-Python return statement returns the result list.
+Finally, at line 27, a normal Python return statement returns the result list.
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        Compiling primes.py with the Cython compiler produces an extension module
+        which we can try out in the interactive interpreter as follows:
 
-Compiling primes.pyx with the Cython compiler produces an extension module
-which we can try out in the interactive interpreter as follows::
+    .. group-tab:: Cython
+
+        Compiling primes.pyx with the Cython compiler produces an extension module
+        which we can try out in the interactive interpreter as follows:
+
+.. code-block:: python
 
     >>> import primes
     >>> primes.primes(10)
@@ -238,12 +307,20 @@ which we can try out in the interactive interpreter as follows::
 See, it works! And if you're curious about how much work Cython has saved you,
 take a look at the C code generated for this module.
 
-
 Cython has a way to visualise where interaction with Python objects and
 Python's C-API is taking place. For this, pass the
 ``annotate=True`` parameter to ``cythonize()``. It produces a HTML file. Let's see:
 
-.. figure:: htmlreport.png
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. figure:: htmlreport_py.png
+            :scale: 90 %
+
+    .. group-tab:: Cython
+
+        .. figure:: htmlreport_pyx.png
+            :scale: 90 %
 
 If a line is white, it means that the code generated doesn't interact
 with Python, so will run as fast as normal C code.  The darker the yellow, the more
@@ -262,42 +339,64 @@ Python behavior, the language will perform division checks at runtime,
 just like Python does. You can deactivate those checks by using the
 :ref:`compiler directives<compiler-directives>`.
 
-Now let's see if, even if we have division checks, we obtained a boost in speed.
-Let's write the same program, but Python-style:
+Now let's see if we get a speed increase even if there is a division check.
+Let's write the same program, but in Python:
 
 .. literalinclude:: ../../examples/tutorial/cython_tutorial/primes_python.py
+    :caption: primes_python.py / primes_python_compiled.py
 
-It is also possible to take a plain ``.py`` file and to compile it with Cython.
-Let's take ``primes_python``, change the function name to ``primes_python_compiled`` and
-compile it with Cython (without changing the code). We will also change the name of the
-file to ``example_py_cy.py`` to differentiate it from the others.
-Now the ``setup.py`` looks like this::
+It is possible to take a plain (unannotated) ``.py`` file and to compile it with Cython.
+Let's create a copy of ``primes_python`` and name it ``primes_python_compiled``
+to be able to compare it to the (non-compiled) Python module.
+Then we compile that file with Cython, without changing the code.
+Now the ``setup.py`` looks like this:
 
-    from distutils.core import setup
-    from Cython.Build import cythonize
+.. tabs::
+    .. group-tab:: Pure Python
 
-    setup(
-        ext_modules=cythonize(['example.pyx',        # Cython code file with primes() function
-                               'example_py_cy.py'],  # Python code file with primes_python_compiled() function
-                              annotate=True),        # enables generation of the html annotation file
-    )
+        .. code-block:: python
+
+            from setuptools import setup
+            from Cython.Build import cythonize
+            
+            setup(
+                ext_modules=cythonize(
+                    ['primes.py',                   # Cython code file with primes() function
+                     'primes_python_compiled.py'],  # Python code file with primes() function
+                    annotate=True),                 # enables generation of the html annotation file
+            )
+
+    .. group-tab:: Cython
+
+        .. code-block:: python
+
+            from setuptools import setup
+            from Cython.Build import cythonize
+
+            setup(
+                ext_modules=cythonize(
+                    ['primes.pyx',                  # Cython code file with primes() function
+                     'primes_python_compiled.py'],  # Python code file with primes() function
+                    annotate=True),                 # enables generation of the html annotation file
+            )
 
 Now we can ensure that those two programs output the same values::
 
-    >>> primes_python(1000) == primes(1000)
+    >>> import primes, primes_python, primes_python_compiled
+    >>> primes_python.primes(1000) == primes.primes(1000)
     True
-    >>> primes_python_compiled(1000) == primes(1000)
+    >>> primes_python_compiled.primes(1000) == primes.primes(1000)
     True
 
 It's possible to compare the speed now::
 
-    python -m timeit -s 'from example_py import primes_python' 'primes_python(1000)'
+    python -m timeit -s "from primes_python import primes" "primes(1000)"
     10 loops, best of 3: 23 msec per loop
 
-    python -m timeit -s 'from example_py_cy import primes_python_compiled' 'primes_python_compiled(1000)'
+    python -m timeit -s "from primes_python_compiled import primes" "primes(1000)"
     100 loops, best of 3: 11.9 msec per loop
 
-    python -m timeit -s 'from example import primes' 'primes(1000)'
+    python -m timeit -s "from primes import primes" "primes(1000)"
     1000 loops, best of 3: 1.65 msec per loop
 
 The cythonize version of ``primes_python`` is 2 times faster than the Python one,
@@ -325,9 +424,9 @@ Primes with C++
 With Cython, it is also possible to take advantage of the C++ language, notably,
 part of the C++ standard library is directly importable from Cython code.
 
-Let's see what our :file:`primes.pyx` becomes when
-using `vector <https://en.cppreference.com/w/cpp/container/vector>`_ from the C++
-standard library.
+Let's see what our code becomes when using
+`vector <https://en.cppreference.com/w/cpp/container/vector>`_
+from the C++ standard library.
 
 .. note::
 
@@ -338,8 +437,19 @@ standard library.
     how many elements you are going to put in the vector. For more details
     see `this page from cppreference <https://en.cppreference.com/w/cpp/container/vector>`_.
 
-.. literalinclude:: ../../examples/tutorial/cython_tutorial/primes_cpp.pyx
-    :linenos:
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/cython_tutorial/primes_cpp.py
+            :linenos:
+
+        .. include::
+            ../cimport-warning
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/cython_tutorial/primes_cpp.pyx
+            :linenos:
 
 The first line is a compiler directive. It tells Cython to compile your code to C++.
 This will enable the use of C++ language features and the C++ standard library.
@@ -357,4 +467,3 @@ Language Details
 For more about the Cython language, see :ref:`language-basics`.
 To dive right in to using Cython in a numerical computation context,
 see :ref:`memoryviews`.
-
diff --git a/docs/src/tutorial/embedding.rst b/docs/src/tutorial/embedding.rst
new file mode 100644
index 000000000..819506cde
--- /dev/null
+++ b/docs/src/tutorial/embedding.rst
@@ -0,0 +1,84 @@
+.. highlight:: cython
+
+.. _embedding:
+
+**********************************************
+Embedding Cython modules in C/C++ applications
+**********************************************
+
+**This is a stub documentation page. PRs very welcome.**
+
+Quick links:
+
+* `CPython docs <https://docs.python.org/3/extending/embedding.html>`_
+
+* `Cython Wiki <https://github.com/cython/cython/wiki/EmbeddingCython>`_
+
+* See the ``--embed`` option to the ``cython`` and ``cythonize`` frontends
+  for generating a C main function and the
+  `cython_freeze <https://github.com/cython/cython/blob/master/bin/cython_freeze>`_
+  script for merging multiple extension modules into one library.
+
+* `Embedding demo program <https://github.com/cython/cython/tree/master/Demos/embed>`_
+
+* See the documentation of the `module init function
+  <https://docs.python.org/3/extending/extending.html#the-module-s-method-table-and-initialization-function>`_
+  in CPython and `PEP 489 <https://www.python.org/dev/peps/pep-0489/>`_ regarding the module
+  initialisation mechanism in CPython 3.5 and later.
+
+
+Initialising your main module
+=============================
+
+Most importantly, DO NOT call the module init function instead of importing
+the module.  This is not the right way to initialise an extension module.
+(It was always wrong but used to work before, but since Python 3.5, it is
+wrong *and* no longer works.)
+
+For details, see the documentation of the
+`module init function <https://docs.python.org/3/extending/extending.html#the-module-s-method-table-and-initialization-function>`_
+in CPython and `PEP 489 <https://www.python.org/dev/peps/pep-0489/>`_ regarding the module
+initialisation mechanism in CPython 3.5 and later.
+
+The `PyImport_AppendInittab() <https://docs.python.org/3/c-api/import.html#c.PyImport_AppendInittab>`_
+function in CPython allows registering statically (or dynamically) linked extension
+modules for later imports.  An example is given in the documentation of the module
+init function that is linked above.
+
+
+Embedding example code
+======================
+
+The following is a simple example that shows the main steps for embedding a
+Cython module (``embedded.pyx``) in Python 3.x.
+
+First, here is a Cython module that exports a C function to be called by external
+code.  Note that the ``say_hello_from_python()`` function is declared as ``public``
+to export it as a linker symbol that can be used by other C files, which in this
+case is ``embedded_main.c``.
+
+.. literalinclude:: ../../examples/tutorial/embedding/embedded.pyx
+
+The C ``main()`` function of your program could look like this:
+
+.. literalinclude:: ../../examples/tutorial/embedding/embedded_main.c
+    :linenos:
+    :language: c
+
+(Adapted from the `CPython documentation
+<https://docs.python.org/3/extending/extending.html#the-module-s-method-table-and-initialization-function>`_.)
+
+Instead of writing such a ``main()`` function yourself, you can also let
+Cython generate one into your module's C file with the ``cython --embed``
+option.  Or use the
+`cython_freeze <https://github.com/cython/cython/blob/master/bin/cython_freeze>`_
+script to embed multiple modules.  See the
+`embedding demo program <https://github.com/cython/cython/tree/master/Demos/embed>`_
+for a complete example setup.
+
+Be aware that your application will not contain any external dependencies that
+you use (including Python standard library modules) and so may not be truly portable.
+If you want to generate a portable application we recommend using a specialized
+tool (e.g. `PyInstaller <https://pyinstaller.org/en/stable/>`_
+or `cx_freeze <https://cx-freeze.readthedocs.io/en/latest/index.html>`_) to find and
+bundle these dependencies.
diff --git a/docs/src/tutorial/external.rst b/docs/src/tutorial/external.rst
index b55b96505..d0c5af0a0 100644
--- a/docs/src/tutorial/external.rst
+++ b/docs/src/tutorial/external.rst
@@ -1,6 +1,9 @@
 Calling C functions
 ====================
 
+.. include::
+    ../two-syntax-variants-used
+
 This tutorial describes shortly what you need to know in order to call
 C library functions from Cython code.  For a longer and more
 comprehensive tutorial about using external C libraries, wrapping them
@@ -15,7 +18,17 @@ For example, let's say you need a low-level way to parse a number from
 a ``char*`` value.  You could use the ``atoi()`` function, as defined
 by the ``stdlib.h`` header file.  This can be done as follows:
 
-.. literalinclude:: ../../examples/tutorial/external/atoi.pyx
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/external/atoi.py
+            :caption: atoi.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/external/atoi.pyx
+            :caption: atoi.pyx
 
 You can find a complete list of these standard cimport files in
 Cython's source package
@@ -28,12 +41,33 @@ Cython also has a complete set of declarations for CPython's C-API.
 For example, to test at C compilation time which CPython version
 your code is being compiled with, you can do this:
 
-.. literalinclude:: ../../examples/tutorial/external/py_version_hex.pyx
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/external/py_version_hex.py
+            :caption: py_version_hex.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/external/py_version_hex.pyx
+            :caption: py_version_hex.pyx
+
+.. _libc.math:
 
 Cython also provides declarations for the C math library:
 
-.. literalinclude:: ../../examples/tutorial/external/libc_sin.pyx
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/external/libc_sin.py
+            :caption: libc_sin.py
 
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/external/libc_sin.pyx
+            :caption: libc_sin.pyx
 
 Dynamic linking
 ---------------
@@ -41,7 +75,7 @@ Dynamic linking
 The libc math library is special in that it is not linked by default
 on some Unix-like systems, such as Linux. In addition to cimporting the
 declarations, you must configure your build system to link against the
-shared library ``m``.  For distutils, it is enough to add it to the
+shared library ``m``.  For setuptools, it is enough to add it to the
 ``libraries`` parameter of the ``Extension()`` setup:
 
 .. literalinclude:: ../../examples/tutorial/external/setup.py
@@ -81,6 +115,9 @@ This allows the C declaration to be reused in other Cython modules,
 while still providing an automatically generated Python wrapper in
 this specific module.
 
+.. note:: External declarations must be placed in a ``.pxd`` file in Pure
+    Python mode.
+
 
 Naming parameters
 -----------------
@@ -101,7 +138,19 @@ You can now make it clear which of the two arguments does what in
 your call, thus avoiding any ambiguities and often making your code
 more readable:
 
-.. literalinclude:: ../../examples/tutorial/external/keyword_args_call.pyx
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/external/keyword_args_call.py
+            :caption: keyword_args_call.py
+        .. literalinclude:: ../../examples/tutorial/external/strstr.pxd
+            :caption: strstr.pxd
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/external/keyword_args_call.pyx
+            :caption: keyword_args_call.pyx
 
 Note that changing existing parameter names later is a backwards
 incompatible API modification, just as for Python code.  Thus, if
diff --git a/docs/src/tutorial/htmlreport.png b/docs/src/tutorial/htmlreport.png
deleted file mode 100644
index 4fc98f3e0..000000000
--- a/docs/src/tutorial/htmlreport.png
+++ /dev/null
diff --git a/docs/src/tutorial/htmlreport_py.png b/docs/src/tutorial/htmlreport_py.png
new file mode 100644
index 000000000..80a89697c
--- /dev/null
+++ b/docs/src/tutorial/htmlreport_py.png
diff --git a/docs/src/tutorial/htmlreport_pyx.png b/docs/src/tutorial/htmlreport_pyx.png
new file mode 100644
index 000000000..9843cb9c5
--- /dev/null
+++ b/docs/src/tutorial/htmlreport_pyx.png
diff --git a/docs/src/tutorial/index.rst b/docs/src/tutorial/index.rst
index 14bc5d9ee..02d34fbfc 100644
--- a/docs/src/tutorial/index.rst
+++ b/docs/src/tutorial/index.rst
@@ -13,10 +13,11 @@ Tutorials
    profiling_tutorial
    strings
    memory_allocation
+   embedding
    pure
    numpy
    array
+   parallelization
    readings
    related_work
    appendix
-
diff --git a/docs/src/tutorial/memory_allocation.rst b/docs/src/tutorial/memory_allocation.rst
index f53c1119a..bf8b29f6a 100644
--- a/docs/src/tutorial/memory_allocation.rst
+++ b/docs/src/tutorial/memory_allocation.rst
@@ -4,6 +4,9 @@
 Memory Allocation
 *****************
 
+.. include::
+    ../two-syntax-variants-used
+
 Dynamic memory allocation is mostly a non-issue in Python.  Everything is an
 object, and the reference counting system and garbage collector automatically
 return memory to the system when it is no longer being used.
@@ -19,10 +22,10 @@ In some situations, however, these objects can still incur an unacceptable
 amount of overhead, which can then makes a case for doing manual memory
 management in C.
 
-Simple C values and structs (such as a local variable ``cdef double x``) are
-usually allocated on the stack and passed by value, but for larger and more
+Simple C values and structs (such as a local variable ``cdef double x`` / ``x: cython.double``) are
+usually :term:`allocated on the stack<Stack allocation>` and passed by value, but for larger and more
 complicated objects (e.g. a dynamically-sized list of doubles), the memory must
-be manually requested and released.  C provides the functions :c:func:`malloc`,
+be :term:`manually requested and released<Dynamic allocation or Heap allocation>`.  C provides the functions :c:func:`malloc`,
 :c:func:`realloc`, and :c:func:`free` for this purpose, which can be imported
 in cython from ``clibc.stdlib``. Their signatures are:
 
@@ -34,8 +37,15 @@ in cython from ``clibc.stdlib``. Their signatures are:
 
 A very simple example of malloc usage is the following:
 
-.. literalinclude:: ../../examples/tutorial/memory_allocation/malloc.pyx
-    :linenos:
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/memory_allocation/malloc.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/memory_allocation/malloc.pyx
 
 Note that the C-API functions for allocating memory on the Python heap
 are generally preferred over the low-level C functions above as the
@@ -45,9 +55,20 @@ smaller memory blocks, which speeds up their allocation by avoiding
 costly operating system calls.
 
 The C-API functions can be found in the ``cpython.mem`` standard
-declarations file::
+declarations file:
+
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. code-block:: python
 
-    from cpython.mem cimport PyMem_Malloc, PyMem_Realloc, PyMem_Free
+            from cython.cimports.cpython.mem import PyMem_Malloc, PyMem_Realloc, PyMem_Free
+
+    .. group-tab:: Cython
+
+        .. code-block:: cython
+
+            from cpython.mem cimport PyMem_Malloc, PyMem_Realloc, PyMem_Free
 
 Their interface and usage is identical to that of the corresponding
 low-level C functions.
@@ -64,4 +85,11 @@ If a chunk of memory needs a larger lifetime than can be managed by a
 to a Python object to leverage the Python runtime's memory management,
 e.g.:
 
-.. literalinclude:: ../../examples/tutorial/memory_allocation/some_memory.pyx
+.. tabs::
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/memory_allocation/some_memory.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/memory_allocation/some_memory.pyx
diff --git a/docs/src/tutorial/numpy.rst b/docs/src/tutorial/numpy.rst
index 5fa205976..0a5535da6 100644
--- a/docs/src/tutorial/numpy.rst
+++ b/docs/src/tutorial/numpy.rst
@@ -28,7 +28,7 @@ systems, it will be :file:`yourmod.pyd`). We
 run a Python session to test both the Python version (imported from
 ``.py``-file) and the compiled Cython module.
 
-.. sourcecode:: ipython
+.. code-block:: ipythonconsole
 
     In [1]: import numpy as np
     In [2]: import convolve_py
@@ -69,7 +69,7 @@ compatibility. Consider this code (*read the comments!*) :
 
 After building this and continuing my (very informal) benchmarks, I get:
 
-.. sourcecode:: ipython
+.. code-block:: ipythonconsole
 
     In [21]: import convolve2
     In [22]: %timeit -n2 -r3 convolve2.naive_convolve(f, g)
@@ -97,7 +97,7 @@ These are the needed changes::
 
 Usage:
 
-.. sourcecode:: ipython
+.. code-block:: ipythonconsole
 
     In [18]: import convolve3
     In [19]: %timeit -n3 -r100 convolve3.naive_convolve(f, g)
@@ -143,7 +143,7 @@ if we try to actually use negative indices with this disabled.
 The function call overhead now starts to play a role, so we compare the latter
 two examples with larger N:
 
-.. sourcecode:: ipython
+.. code-block:: ipythonconsole
 
     In [11]: %timeit -n3 -r100 convolve4.naive_convolve(f, g)
     3 loops, best of 100: 5.97 ms per loop
@@ -170,6 +170,16 @@ function call.)
     The actual rules are a bit more complicated but the main message is clear:
     Do not use typed objects without knowing that they are not set to None.
 
+What typing does not do
+=======================
+
+The main purpose of typing things as :obj:`ndarray` is to allow efficient
+indexing of single elements, and to speed up access to a small number of
+attributes such as ``.shape``. Typing does not allow Cython to speed
+up mathematical operations on the whole array (for example, adding two arrays
+together). Typing does not allow Cython to speed up calls to Numpy global
+functions or to methods of the array.
+
 More generic code
 ==================
 
diff --git a/docs/src/tutorial/parallelization.rst b/docs/src/tutorial/parallelization.rst
new file mode 100644
index 000000000..f5e9ba4b2
--- /dev/null
+++ b/docs/src/tutorial/parallelization.rst
@@ -0,0 +1,335 @@
+.. _parallel-tutorial:
+
+=================================
+Writing parallel code with Cython
+=================================
+
+.. include::
+    ../two-syntax-variants-used
+
+One method of speeding up your Cython code is parallelization:
+you write code that can be run on multiple cores of your CPU simultaneously.
+For code that lends itself to parallelization this can produce quite
+dramatic speed-ups, equal to the number of cores your CPU has (for example
+a 4× speed-up on a 4-core CPU).
+
+This tutorial assumes that you are already familiar with Cython's 
+:ref:`"typed memoryviews"<memoryviews>` (since code using memoryviews is often
+the sort of code that's easy to parallelize with Cython), and also that you're
+somewhat familiar with the pitfalls of writing parallel code in general
+(it aims to be a Cython tutorial rather than a complete introduction
+to parallel programming).
+
+Before starting, a few notes:
+
+- Not all code can be parallelized - for some code the algorithm simply
+  relies on being executed in order and you should not attempt to
+  parallelize it.  A cumulative sum is a good example.
+  
+- Not all code is worth parallelizing.  There's a reasonable amount of
+  overhead in starting a parallel section and so you need to make sure
+  that you're operating on enough data to make this overhead worthwhile.
+  Additionally, make sure that you are doing actual work on the data!
+  Multiple threads simply reading the same data tends not to parallelize
+  too well.  If in doubt, time it.
+
+- Cython requires the contents of parallel blocks to be ``nogil``.  If
+  your algorithm requires access to Python objects then it may not be
+  suitable for parallelization.
+  
+- Cython's inbuilt parallelization uses the OpenMP constructs
+  ``omp parallel for`` and ``omp parallel``.  These are ideal
+  for parallelizing relatively small, self-contained blocks of code
+  (especially loops).  However, If you want to use other models of 
+  parallelization such as spawning and waiting for tasks, or 
+  off-loading some "side work" to a continuously running secondary 
+  thread, then you might be better using other methods (such as 
+  Python's ``threading`` module).
+  
+- Actually implementing your parallel Cython code should probably be 
+  one of the last steps in your optimization.  You should start with
+  some working serial code first.  However, it's worth planning for
+  early since it may affect your choice of algorithm.
+
+This tutorial does not aim to explore all the options available to
+customize parallelization.  See the
+:ref:`main parallelism documentation<parallel>` for details.
+You should also be aware that a lot of the choices Cython makes
+about how your code is parallelized are fairly fixed and if you want
+specific OpenMP behaviour that Cython doesn't provide by default you
+may be better writing it in C yourself.
+
+Compilation
+===========
+
+OpenMP requires support from your C/C++ compiler. This support is
+usually enabled through a special command-line argument:
+on GCC this is ``-fopenmp`` while on MSVC it is
+``/openmp``. If your compiler doesn't support OpenMP (or if you
+forget to pass the argument) then the code will usually still
+compile but will not run in parallel.
+
+The following ``setup.py`` file can be used to compile the
+examples in this tutorial:
+
+.. literalinclude:: ../../examples/tutorial/parallelization/setup.py
+
+Element-wise parallel operations
+================================
+
+The easiest and most common parallel operation in Cython is to
+iterate across an array element-wise, performing the same
+operation on each array element.  In the simple example
+below we calculate the ``sin`` of every element in an array:
+
+.. tabs::
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/parallelization/parallel_sin.pyx
+
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/parallelization/parallel_sin.py
+
+We parallelize the outermost loop.  This is usually a good idea
+since there is some overhead to entering and leaving a parallel block.
+However, you should also consider the likely size of your arrays.
+If ``input`` usually had a size of ``(2, 10000000)`` then parallelizing
+over the dimension of length ``2`` would likely be a worse choice.
+
+The body of the loop itself is ``nogil`` - i.e. you cannot perform
+"Python" operations.  This is a fairly strong limitation and if you
+find that you need to use the GIL then it is likely that Cython's
+parallelization features are not suitable for you.  It is possible
+to throw exceptions from within the loop, however -- Cython simply
+regains the GIL and raises an exception, then terminates the loop
+on all threads.
+
+It's necessary to explicitly type the loop variable ``i``  as a
+C integer.  For a non-parallel loop Cython can infer this, but it
+does not currently infer the loop variable for parallel loops, 
+so not typing ``i`` will lead to compile errors since it will be
+a Python object and so unusable without the GIL.
+
+The C code generated is shown below, for the benefit of experienced
+users of OpenMP. It is simplified a little for readability here:
+
+.. code-block:: C
+
+    #pragma omp parallel
+    {
+        #pragma omp for firstprivate(i) lastprivate(i) lastprivate(j)
+        for (__pyx_t_8 = 0; __pyx_t_8 < __pyx_t_9; __pyx_t_8++){
+            i = __pyx_t_8;
+            /* body goes here */
+        }
+    }
+    
+Private variables
+-----------------
+
+One useful point to note from the generated C code above - variables
+used in the loops like ``i`` and ``j`` are marked as ``firstprivate``
+and ``lastprivate``.  Within the loop each thread has its own copy of
+the data, the data is initialized
+according to its value before the loop, and after the loop the "global"
+copy is set equal to the last iteration (i.e. as if the loop were run
+in serial).
+
+The basic rules that Cython applies are:
+
+- C scalar variables within a ``prange`` block are made 
+  ``firstprivate`` and ``lastprivate``,
+
+- C scalar variables assigned within a  
+  :ref:`parallel block<parallel-block>`
+  are ``private`` (which means they can't be used to pass data in
+  and out of the block),
+
+- array variables (e.g. memoryviews) are not made private.  Instead
+  Cython assumes that you have structured your loop so that each iteration
+  is acting on different data,
+
+- Python objects are also not made private, although access to them
+  is controlled via Python's GIL.
+
+Cython does not currently provide much opportunity of override these
+choices.
+  
+Reductions
+==========
+
+The second most common parallel operation in Cython is the "reduction"
+operation.  A common example is to accumulate a sum over the whole
+array, such as in the calculation of a vector norm below:
+
+.. tabs::
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/parallelization/norm.pyx
+
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/parallelization/norm.py
+
+Cython is able to infer reductions for ``+=``, ``*=``, ``-=``,
+``&=``, ``|=``, and ``^=``.  These only apply to C scalar variables
+so you cannot easily reduce a 2D memoryview to a 1D memoryview for
+example.
+
+The C code generated is approximately:
+
+.. code-block:: C
+
+    #pragma omp parallel reduction(+:total)
+    {
+        #pragma omp for firstprivate(i) lastprivate(i)
+        for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
+            i = __pyx_t_2;
+            total = total + /* some indexing code */;
+            
+        }
+    }
+
+.. _parallel-block:
+    
+``parallel`` blocks
+===================
+
+Much less frequently used than ``prange`` is Cython's ``parallel``
+operator.  ``parallel`` generates a block of code that is run simultaneously
+on multiple threads at once.  Unlike ``prange``, however, work is
+not automatically divided between threads.
+
+Here we present three common uses for the ``parallel`` block:
+
+Stringing together prange blocks
+--------------------------------
+
+There is some overhead in entering and leaving a parallelized section.
+Therefore, if you have multiple parallel sections with small
+serial sections in between it can be more efficient to
+write one large parallel block.  Any small serial
+sections are duplicated, but the overhead is reduced.
+
+In the example below we do an in-place normalization of a vector.
+The first parallel loop calculates the norm, the second parallel
+loop applies the norm to the vector, and we avoid jumping in and out of serial
+code in between.
+
+.. tabs::
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/parallelization/normalize.pyx
+        
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/parallelization/normalize.py
+
+The C code is approximately:
+
+.. code-block:: C
+
+    #pragma omp parallel private(norm) reduction(+:total)
+    {
+        /* some calculations of array size... */
+        #pragma omp for firstprivate(i) lastprivate(i)
+        for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
+            /* ... */
+        }
+        norm = sqrt(total);
+        #pragma omp for firstprivate(i) lastprivate(i)
+        for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
+            /* ... */
+        }
+    }
+
+Allocating "scratch space" for each thread
+------------------------------------------
+
+Suppose that each thread requires a small amount of scratch space
+to work in.  They cannot share scratch space because that would
+lead to data races.  In this case the allocation and deallocation
+is done in a parallel section (so occurs on a per-thread basis)
+surrounding a loop which then uses the scratch space.
+
+Our example here uses C++ to find the median of each column in
+a 2D array (just a parallel version of ``numpy.median(x, axis=0)``).
+We must reorder each column to find the median of it, but don't want
+to modify the input array.  Therefore, we allocate a C++ vector per
+thread to use as scratch space, and work in that.  For efficiency
+the vector is allocated outside the ``prange`` loop.
+
+.. tabs::
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/parallelization/median.pyx
+        
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/parallelization/median.py
+
+.. note::
+
+    Pure and classic syntax examples are not quite identical
+    since pure Python syntax does not support C++ "new", so we allocate the
+    scratch space slightly differently
+
+In the generated code the ``scratch`` variable is marked as
+``private`` in the outer parallel block.  A rough outline is:
+
+.. code-block:: C++
+
+    #pragma omp parallel private(scratch)
+    {
+        scratch = new std::vector<double> ((x.shape[0]))
+        #pragma omp for firstprivate(i) lastprivate(i) lastprivate(j) lastprivate(median_it)
+        for (__pyx_t_9 = 0; __pyx_t_9 < __pyx_t_10; __pyx_t_9++){
+            i = __pyx_t_9;
+            /* implementation goes here */
+        }
+        /* some exception handling detail omitted */
+        delete scratch;
+    }
+
+Performing different tasks on each thread
+-----------------------------------------
+
+Finally, if you manually specify the number of threads and
+then identify each thread using ``omp.get_thread_num()``
+you can manually split work between threads.  This is
+a fairly rare use-case in Cython, and probably suggests
+that the ``threading`` module is more suitable for what
+you're trying to do.  However it is an option.
+
+.. tabs::
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/parallelization/manual_work.pyx
+           :lines: 2-
+        
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/parallelization/manual_work.py
+           :lines: 2-
+
+The utility of this kind of block is limited by the fact that
+variables assigned to in the block are ``private`` to each thread,
+so cannot be accessed in the serial section afterwards.
+
+The generated C code for the example above is fairly simple:
+
+.. code-block:: C
+
+    #pragma omp parallel private(thread_num) 
+    {
+        thread_num = omp_get_thread_num();
+        switch (thread_num) {
+            /* ... */
+        }
+    }
diff --git a/docs/src/tutorial/profiling_tutorial.rst b/docs/src/tutorial/profiling_tutorial.rst
index a7cfab0a8..77110206f 100644
--- a/docs/src/tutorial/profiling_tutorial.rst
+++ b/docs/src/tutorial/profiling_tutorial.rst
@@ -6,6 +6,9 @@
 Profiling
 *********
 
+.. include::
+    ../two-syntax-variants-used
+
 This part describes the profiling abilities of Cython. If you are familiar
 with profiling pure Python code, you can only read the first section
 (:ref:`profiling_basics`). If you are not familiar with Python profiling you
@@ -46,7 +49,15 @@ you plan to inline them anyway or because you are sure that you can't make them
 any faster - you can use a special decorator to disable profiling for one
 function only (regardless of whether it is globally enabled or not):
 
-.. literalinclude:: ../../examples/tutorial/profiling_tutorial/often_called.pyx
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/often_called.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/often_called.pyx
 
 Enabling line tracing
 ---------------------
@@ -75,8 +86,8 @@ Enabling coverage analysis
 --------------------------
 
 Since Cython 0.23, line tracing (see above) also enables support for coverage
-reporting with the `coverage.py <http://coverage.readthedocs.io/>`_ tool.
-To make the coverage analysis understand Cython modules, you also need to enable
+reporting with the `coverage.py <https://coverage.readthedocs.io/>`_ tool. To
+make the coverage analysis understand Cython modules, you also need to enable
 Cython's coverage plugin in your ``.coveragerc`` file as follows:
 
 .. code-block:: ini
@@ -123,6 +134,7 @@ relation we want to use has been proven by Euler in 1735 and is known as the
 A simple Python code for evaluating the truncated sum looks like this:
 
 .. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi.py
+    :caption: calc_pi.py
 
 On my box, this needs approximately 4 seconds to run the function with the
 default n. The higher we choose n, the better will be the approximation for
@@ -134,6 +146,7 @@ code takes too much time are wrong. At least, mine are always wrong. So let's
 write a short script to profile our code:
 
 .. literalinclude:: ../../examples/tutorial/profiling_tutorial/profile.py
+    :caption: profile.py
 
 Running this on my box gives the following output:
 
@@ -146,8 +159,8 @@ Running this on my box gives the following output:
       Ordered by: internal time
 
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
-           1    3.243    3.243    6.211    6.211 calc_pi.py:7(approx_pi)
-    10000000    2.526    0.000    2.526    0.000 calc_pi.py:4(recip_square)
+           1    3.243    3.243    6.211    6.211 calc_pi.py:4(approx_pi)
+    10000000    2.526    0.000    2.526    0.000 calc_pi.py:1(recip_square)
            1    0.442    0.442    0.442    0.442 {range}
            1    0.000    0.000    6.211    6.211 <string>:1(<module>)
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
@@ -160,8 +173,8 @@ for the nitty gritty details. The most important columns here are totime (total
 time spent in this function **not** counting functions that were called by this
 function) and cumtime (total time spent in this function **also** counting the
 functions called by this function). Looking at the tottime column, we see that
-approximately half the time is spent in approx_pi and the other half is spent
-in recip_square. Also half a second is spent in range ... of course we should
+approximately half the time is spent in ``approx_pi()`` and the other half is spent
+in ``recip_square()``. Also half a second is spent in range ... of course we should
 have used xrange for such a big iteration. And in fact, just changing range to
 xrange makes the code run in 5.8 seconds.
 
@@ -169,7 +182,17 @@ We could optimize a lot in the pure Python version, but since we are interested
 in Cython, let's move forward and bring this module to Cython. We would do this
 anyway at some time to get the loop run faster. Here is our first Cython version:
 
-.. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_2.pyx
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_2.py
+            :caption: calc_pi.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_2.pyx
+            :caption: calc_pi.pyx
 
 Note the first line: We have to tell Cython that profiling should be enabled.
 This makes the Cython code slightly slower, but without this we would not get
@@ -180,99 +203,184 @@ We also need to modify our profiling script to import the Cython module directly
 Here is the complete version adding the import of the :ref:`Pyximport<pyximport>` module:
 
 .. literalinclude:: ../../examples/tutorial/profiling_tutorial/profile_2.py
+    :caption: profile.py
 
 We only added two lines, the rest stays completely the same. Alternatively, we could also
 manually compile our code into an extension; we wouldn't need to change the
 profile script then at all. The script now outputs the following:
 
-.. code-block:: none
+.. tabs::
 
-   Sat Nov  7 18:02:33 2009    Profile.prof
+    .. group-tab:: Pure Python
 
-            10000004 function calls in 4.406 CPU seconds
+        .. code-block:: none
 
-      Ordered by: internal time
+           Sat Nov  7 18:02:33 2009    Profile.prof
 
-      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
-           1    3.305    3.305    4.406    4.406 calc_pi.pyx:7(approx_pi)
-    10000000    1.101    0.000    1.101    0.000 calc_pi.pyx:4(recip_square)
-           1    0.000    0.000    4.406    4.406 {calc_pi.approx_pi}
-           1    0.000    0.000    4.406    4.406 <string>:1(<module>)
-           1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
+                    10000004 function calls in 4.406 CPU seconds
+
+              Ordered by: internal time
+
+              ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+                   1    3.305    3.305    4.406    4.406 calc_pi.py:6(approx_pi)
+            10000000    1.101    0.000    1.101    0.000 calc_pi.py:3(recip_square)
+                   1    0.000    0.000    4.406    4.406 {calc_pi.approx_pi}
+                   1    0.000    0.000    4.406    4.406 <string>:1(<module>)
+                   1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 
-We gained 1.8 seconds. Not too shabby. Comparing the output to the previous, we
-see that recip_square function got faster while the approx_pi function has not
-changed a lot. Let's concentrate on the recip_square function a bit more. First
-note, that this function is not to be called from code outside of our module;
-so it would be wise to turn it into a cdef to reduce call overhead. We should
-also get rid of the power operator: it is turned into a pow(i,2) function call by
-Cython, but we could instead just write i*i which could be faster. The
+    .. group-tab:: Cython
+
+        .. code-block:: none
+
+           Sat Nov  7 18:02:33 2009    Profile.prof
+
+                    10000004 function calls in 4.406 CPU seconds
+
+              Ordered by: internal time
+
+              ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+                   1    3.305    3.305    4.406    4.406 calc_pi.pyx:6(approx_pi)
+            10000000    1.101    0.000    1.101    0.000 calc_pi.pyx:3(recip_square)
+                   1    0.000    0.000    4.406    4.406 {calc_pi.approx_pi}
+                   1    0.000    0.000    4.406    4.406 <string>:1(<module>)
+                   1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
+
+We gained 1.8 seconds.  Not too shabby.  Comparing the output to the previous, we
+see that the ``recip_square()`` function got faster while the ``approx_pi()``
+function has not changed a lot.  Let's concentrate on the ``recip_square()`` function
+a bit more.  First, note that this function is not to be called from code outside
+of our module; so it would be wise to turn it into a cdef to reduce call overhead.
+We should also get rid of the power operator: it is turned into a ``pow(i, 2)`` function
+call by Cython, but we could instead just write ``i * i`` which could be faster.  The
 whole function is also a good candidate for inlining.  Let's look at the
 necessary changes for these ideas:
 
-.. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_3.pyx
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_3.py
+            :caption: calc_pi.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_3.pyx
+            :caption: calc_pi.pyx
+
+Note that the ``except``/``@exceptval`` declaration is needed in the signature of ``recip_square()``
+in order to propagate division by zero errors.
 
 Now running the profile script yields:
 
-.. code-block:: none
+.. tabs::
 
-   Sat Nov  7 18:10:11 2009    Profile.prof
+    .. group-tab:: Pure Python
 
-            10000004 function calls in 2.622 CPU seconds
+        .. code-block:: none
 
-      Ordered by: internal time
+           Sat Nov  7 18:10:11 2009    Profile.prof
 
-      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
-           1    1.782    1.782    2.622    2.622 calc_pi.pyx:7(approx_pi)
-    10000000    0.840    0.000    0.840    0.000 calc_pi.pyx:4(recip_square)
-           1    0.000    0.000    2.622    2.622 {calc_pi.approx_pi}
-           1    0.000    0.000    2.622    2.622 <string>:1(<module>)
-           1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
+                    10000004 function calls in 2.622 CPU seconds
+
+              Ordered by: internal time
+
+              ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+                   1    1.782    1.782    2.622    2.622 calc_pi.py:9(approx_pi)
+            10000000    0.840    0.000    0.840    0.000 calc_pi.py:6(recip_square)
+                   1    0.000    0.000    2.622    2.622 {calc_pi.approx_pi}
+                   1    0.000    0.000    2.622    2.622 <string>:1(<module>)
+                   1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
+
+    .. group-tab:: Cython
+
+        .. code-block:: none
+
+           Sat Nov  7 18:10:11 2009    Profile.prof
+
+                    10000004 function calls in 2.622 CPU seconds
+
+              Ordered by: internal time
+
+              ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+                   1    1.782    1.782    2.622    2.622 calc_pi.pyx:9(approx_pi)
+            10000000    0.840    0.000    0.840    0.000 calc_pi.pyx:6(recip_square)
+                   1    0.000    0.000    2.622    2.622 {calc_pi.approx_pi}
+                   1    0.000    0.000    2.622    2.622 <string>:1(<module>)
+                   1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 
 That bought us another 1.8 seconds. Not the dramatic change we could have
-expected. And why is recip_square still in this table; it is supposed to be
+expected. And why is ``recip_square()`` still in this table; it is supposed to be
 inlined, isn't it?  The reason for this is that Cython still generates profiling code
 even if the function call is eliminated. Let's tell it to not
-profile recip_square any more; we couldn't get the function to be much faster anyway:
+profile ``recip_square()`` any more; we couldn't get the function to be much faster anyway:
+
+.. tabs::
+
+    .. group-tab:: Pure Python
+
+        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_4.py
+            :caption: calc_pi.py
+
+    .. group-tab:: Cython
+
+        .. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_4.pyx
+            :caption: calc_pi.pyx
 
-.. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_4.pyx
 
 Running this shows an interesting result:
 
-.. code-block:: none
+.. tabs::
 
-   Sat Nov  7 18:15:02 2009    Profile.prof
+    .. group-tab:: Pure Python
 
-            4 function calls in 0.089 CPU seconds
+        .. code-block:: none
 
-      Ordered by: internal time
+           Sat Nov  7 18:15:02 2009    Profile.prof
 
-      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
-           1    0.089    0.089    0.089    0.089 calc_pi.pyx:10(approx_pi)
-           1    0.000    0.000    0.089    0.089 {calc_pi.approx_pi}
-           1    0.000    0.000    0.089    0.089 <string>:1(<module>)
-           1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
+                    4 function calls in 0.089 CPU seconds
+
+              Ordered by: internal time
+
+              ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+                   1    0.089    0.089    0.089    0.089 calc_pi.py:12(approx_pi)
+                   1    0.000    0.000    0.089    0.089 {calc_pi.approx_pi}
+                   1    0.000    0.000    0.089    0.089 <string>:1(<module>)
+                   1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
+
+    .. group-tab:: Cython
+
+        .. code-block:: none
+
+           Sat Nov  7 18:15:02 2009    Profile.prof
+
+                    4 function calls in 0.089 CPU seconds
+
+              Ordered by: internal time
+
+              ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+                   1    0.089    0.089    0.089    0.089 calc_pi.pyx:12(approx_pi)
+                   1    0.000    0.000    0.089    0.089 {calc_pi.approx_pi}
+                   1    0.000    0.000    0.089    0.089 <string>:1(<module>)
+                   1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 
 First note the tremendous speed gain: this version only takes 1/50 of the time
-of our first Cython version. Also note that recip_square has vanished from the
-table like we wanted. But the most peculiar and import change is that
-approx_pi also got much faster. This is a problem with all profiling: calling a
-function in a profile run adds a certain overhead to the function call. This
+of our first Cython version.  Also note that ``recip_square()`` has vanished from the
+table like we wanted.  But the most peculiar and important change is that
+``approx_pi()`` also got much faster.  This is a problem with all profiling: calling a
+function in a profile run adds a certain overhead to the function call.  This
 overhead is **not** added to the time spent in the called function, but to the
-time spent in the **calling** function. In this example, approx_pi didn't need 2.622
-seconds in the last run; but it called recip_square 10000000 times, each time taking a
-little to set up profiling for it. This adds up to the massive time loss of
-around 2.6 seconds. Having disabled profiling for the often called function now
-reveals realistic timings for approx_pi; we could continue optimizing it now if
+time spent in the **calling** function.  In this example, ``approx_pi()`` didn't need 2.622
+seconds in the last run; but it called ``recip_square()`` 10000000 times, each time taking a
+little to set up profiling for it.  This adds up to the massive time loss of
+around 2.6 seconds.  Having disabled profiling for the often called function now
+reveals realistic timings for ``approx_pi()``; we could continue optimizing it now if
 needed.
 
 This concludes this profiling tutorial. There is still some room for
 improvement in this code. We could try to replace the power operator in
-approx_pi with a call to sqrt from the C stdlib; but this is not necessarily
-faster than calling pow(x,0.5).
+``approx_pi()`` with a call to sqrt from the C stdlib; but this is not necessarily
+faster than calling ``pow(x, 0.5)``.
 
 Even so, the result we achieved here is quite satisfactory: we came up with a
 solution that is much faster then our original Python version while retaining
 functionality and readability.
-
-
diff --git a/docs/src/tutorial/pure.rst b/docs/src/tutorial/pure.rst
index 775e7719c..32a7fa0ca 100644
--- a/docs/src/tutorial/pure.rst
+++ b/docs/src/tutorial/pure.rst
@@ -13,7 +13,7 @@ To go beyond that, Cython provides language constructs to add static typing
 and cythonic functionalities to a Python module to make it run much faster
 when compiled, while still allowing it to be interpreted.
 This is accomplished via an augmenting ``.pxd`` file, via Python
-type annotations (following
+type :ref:`pep484_type_annotations` (following
 `PEP 484 <https://www.python.org/dev/peps/pep-0484/>`_ and
 `PEP 526 <https://www.python.org/dev/peps/pep-0526/>`_), and/or
 via special functions and decorators available after importing the magic
@@ -29,6 +29,7 @@ In pure mode, you are more or less restricted to code that can be expressed
 beyond that can only be done in .pyx files with extended language syntax,
 because it depends on features of the Cython compiler.
 
+.. _augmenting_pxd:
 
 Augmenting .pxd
 ---------------
@@ -82,7 +83,7 @@ in the :file:`.pxd`, that is, to be accessible from Python,
 
 
 In the example above, the type of the local variable `a` in `myfunction()`
-is not fixed and will thus be a Python object.  To statically type it, one
+is not fixed and will thus be a :term:`Python object`.  To statically type it, one
 can use Cython's ``@cython.locals`` decorator (see :ref:`magic_attributes`,
 and :ref:`magic_attributes_pxd`).
 
@@ -153,26 +154,10 @@ Static typing
     @exceptval(-1, check=False)  # cdef int func() except -1:
     @exceptval(check=True)       # cdef int func() except *:
     @exceptval(-1, check=True)   # cdef int func() except? -1:
+    @exceptval(check=False)      # no exception checking/propagation
 
-* Python annotations can be used to declare argument types, as shown in the
-  following example.  To avoid conflicts with other kinds of annotation
-  usages, this can be disabled with the directive ``annotation_typing=False``.
-
-  .. literalinclude:: ../../examples/tutorial/pure/annotations.py
-
-  This can be combined with the ``@cython.exceptval()`` decorator for non-Python
-  return types:
-
-  .. literalinclude:: ../../examples/tutorial/pure/exceptval.py
-
-  Since version 0.27, Cython also supports the variable annotations defined
-  in `PEP 526 <https://www.python.org/dev/peps/pep-0526/>`_. This allows to
-  declare types of variables in a Python 3.6 compatible way as follows:
-
-  .. literalinclude:: ../../examples/tutorial/pure/pep_526.py
-
-  There is currently no way to express the visibility of object attributes.
-
+  If exception propagation is disabled, any Python exceptions that are raised
+  inside of the function will be printed and ignored.
 
 C types
 ^^^^^^^
@@ -225,6 +210,68 @@ Here is an example of a :keyword:`cdef` function::
         return a == b
 
 
+Managing the Global Interpreter Lock
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* ``cython.nogil`` can be used as a context manager or as a decorator to replace the :keyword:`nogil` keyword::
+
+    with cython.nogil:
+        # code block with the GIL released
+
+    @cython.nogil
+    @cython.cfunc
+    def func_released_gil() -> cython.int:
+        # function that can be run with the GIL released
+        
+  Note that the two uses differ: the context manager releases the GIL while the decorator marks that a
+  function *can* be run without the GIL. See :ref:`<cython_and_gil>` for more details.
+
+* ``cython.gil`` can be used as a context manager to replace the :keyword:`gil` keyword::
+
+    with cython.gil:
+        # code block with the GIL acquired
+
+  .. Note:: Cython currently does not support the ``@cython.with_gil`` decorator.
+
+Both directives accept an optional boolean parameter for conditionally
+releasing or acquiring the GIL. The condition must be constant (at compile time)::
+
+  with cython.nogil(False):
+      # code block with the GIL not released
+
+  @cython.nogil(True)
+  @cython.cfunc
+  def func_released_gil() -> cython.int:
+      # function with the GIL released
+
+  with cython.gil(False):
+      # code block with the GIL not acquired
+
+  with cython.gil(True):
+      # code block with the GIL acquired
+
+A common use case for conditionally acquiring and releasing the GIL are fused types
+that allow different GIL handling depending on the specific type (see :ref:`gil_conditional`).
+
+.. py:module:: cython.cimports
+
+cimports
+^^^^^^^^
+
+The special ``cython.cimports`` package name gives access to cimports
+in code that uses Python syntax.  Note that this does not mean that C
+libraries become available to Python code.  It only means that you can
+tell Cython what cimports you want to use, without requiring special
+syntax.  Running such code in plain Python will fail.
+
+.. literalinclude:: ../../examples/tutorial/pure/py_cimport.py
+
+Since such code must necessarily refer to the non-existing
+``cython.cimports`` 'package', the plain cimport form
+``cimport cython.cimports...`` is not available.
+You must use the form ``from cython.cimports...``.
+
+
 Further Cython functions and declarations
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -242,6 +289,13 @@ Further Cython functions and declarations
     print(cython.sizeof(cython.longlong))
     print(cython.sizeof(n))
 
+* ``typeof`` returns a string representation of the argument's type for debugging purposes.  It can take expressions.
+
+  ::
+
+    cython.declare(n=cython.longlong)
+    print(cython.typeof(n))
+
 * ``struct`` can be used to create struct types.::
 
     MyStruct = cython.struct(x=cython.int, y=cython.int, data=cython.double)
@@ -272,6 +326,12 @@ Further Cython functions and declarations
     t1 = cython.cast(T, t)
     t2 = cython.cast(T, t, typecheck=True)
 
+* ``fused_type`` creates a new type definition that refers to the multiple types.
+  The following example declares a new type called ``my_fused_type`` which can
+  be either an ``int`` or a ``double``.::
+
+    my_fused_type = cython.fused_type(cython.int, cython.float)
+
 .. _magic_attributes_pxd:
 
 Magic Attributes within the .pxd
@@ -289,10 +349,94 @@ can be augmented with the following :file:`.pxd` file :file:`dostuff.pxd`:
 The :func:`cython.declare()` function can be used to specify types for global
 variables in the augmenting :file:`.pxd` file.
 
+.. _pep484_type_annotations:
+
+PEP-484 type annotations
+------------------------
+
+Python `type hints <https://www.python.org/dev/peps/pep-0484>`_
+can be used to declare argument types, as shown in the
+following example:
+
+  .. literalinclude:: ../../examples/tutorial/pure/annotations.py
+
+Note the use of ``cython.int`` rather than ``int`` - Cython does not translate
+an ``int`` annotation to a C integer by default since the behaviour can be
+quite different with respect to overflow and division.
+
+Annotations can be combined with the ``@cython.exceptval()`` decorator for non-Python
+return types:
+
+  .. literalinclude:: ../../examples/tutorial/pure/exceptval.py
+
+Note that the default exception handling behaviour when returning C numeric types
+is to check for ``-1``, and if that was returned, check Python's error indicator
+for an exception.  This means, if no ``@exceptval`` decorator is provided, and the
+return type is a numeric type, then the default with type annotations is
+``@exceptval(-1, check=True)``, in order to make sure that exceptions are correctly
+and efficiently reported to the caller.  Exception propagation can be disabled
+explicitly with ``@exceptval(check=False)``, in which case any Python exceptions
+raised inside of the function will be printed and ignored.
+
+Since version 0.27, Cython also supports the variable annotations defined
+in `PEP 526 <https://www.python.org/dev/peps/pep-0526/>`_. This allows to
+declare types of variables in a Python 3.6 compatible way as follows:
+
+.. literalinclude:: ../../examples/tutorial/pure/pep_526.py
+
+There is currently no way to express the visibility of object attributes.
+
+Disabling annotations
+^^^^^^^^^^^^^^^^^^^^^
+
+To avoid conflicts with other kinds of annotation
+usages, Cython's use of annotations to specify types can be disabled with the
+``annotation_typing`` :ref:`compiler directive<compiler-directives>`. From Cython 3
+you can use this as a decorator or a with statement, as shown in the following example:
+
+.. literalinclude:: ../../examples/tutorial/pure/disabled_annotations.py
+
+
+
+``typing`` Module
+^^^^^^^^^^^^^^^^^
+
+Support for the full range of annotations described by PEP-484 is not yet
+complete. Cython 3 currently understands the following features from the
+``typing`` module:
+
+* ``Optional[tp]``, which is interpreted as ``tp or None``;
+* typed containers such as ``List[str]``, which is interpreted as ``list``. The
+  hint that the elements are of type ``str`` is currently ignored;
+* ``Tuple[...]``, which is converted into a Cython C-tuple where possible
+  and a regular Python ``tuple`` otherwise.
+* ``ClassVar[...]``, which is understood in the context of
+  ``cdef class`` or ``@cython.cclass``.
+
+Some of the unsupported features are likely to remain
+unsupported since these type hints are not relevant for the compilation to
+efficient C code. In other cases, however, where the generated C code could
+benefit from these type hints but does not currently, help is welcome to
+improve the type analysis in Cython.
+
+Reference table
+^^^^^^^^^^^^^^^
+
+The following reference table documents how type annotations are currently interpreted.
+Cython 0.29 behaviour is only shown where it differs from Cython 3.0 behaviour.
+The current limitations will likely be lifted at some point.
+
+.. csv-table:: Annotation typing rules
+   :file: annotation_typing_table.csv
+   :header-rows: 1
+   :class: longtable
+   :widths: 1 1 1
 
 Tips and Tricks
 ---------------
 
+.. _calling-c-functions:
+
 Calling C functions
 ^^^^^^^^^^^^^^^^^^^
 
diff --git a/docs/src/tutorial/pxd_files.rst b/docs/src/tutorial/pxd_files.rst
index 5fc5b0a89..0a22f7a2a 100644
--- a/docs/src/tutorial/pxd_files.rst
+++ b/docs/src/tutorial/pxd_files.rst
@@ -11,27 +11,25 @@ using the ``cimport`` keyword.
 
 ``pxd`` files have many use-cases:
 
- 1. They can be used for sharing external C declarations.
- 2. They can contain functions which are well suited for inlining by
-    the C compiler. Such functions should be marked ``inline``, example:
-    ::
+1.  They can be used for sharing external C declarations.
+2.  They can contain functions which are well suited for inlining by
+    the C compiler. Such functions should be marked ``inline``, example::
 
        cdef inline int int_min(int a, int b):
            return b if b < a else a
 
- 3. When accompanying an equally named ``pyx`` file, they
+3.  When accompanying an equally named ``pyx`` file, they
     provide a Cython interface to the Cython module so that other
     Cython modules can communicate with it using a more efficient
     protocol than the Python one.
 
 In our integration example, we might break it up into ``pxd`` files like this:
 
- 1. Add a ``cmath.pxd`` function which defines the C functions available from
+1.  Add a ``cmath.pxd`` function which defines the C functions available from
     the C ``math.h`` header file, like ``sin``. Then one would simply do
     ``from cmath cimport sin`` in ``integrate.pyx``.
- 2. Add a ``integrate.pxd`` so that other modules written in Cython
-    can define fast custom functions to integrate.
-    ::
+2.  Add a ``integrate.pxd`` so that other modules written in Cython
+    can define fast custom functions to integrate::
 
        cdef class Function:
            cpdef evaluate(self, double x)
@@ -41,3 +39,37 @@ In our integration example, we might break it up into ``pxd`` files like this:
     Note that if you have a cdef class with attributes, the attributes must
     be declared in the class declaration ``pxd`` file (if you use one), not
     the ``pyx`` file. The compiler will tell you about this.
+
+
+__init__.pxd
+^^^^^^^^^^^^
+
+Cython also supports ``__init__.pxd`` files for declarations in package's
+namespaces, similar to ``__init__.py`` files in Python.
+
+Continuing the integration example, we could package the module as follows:
+
+1.  Place the module files in a directory tree as one usually would for
+    Python:
+
+    .. code-block:: text
+
+        CyIntegration/
+        ├── __init__.pyx
+        ├── __init__.pxd
+        ├── integrate.pyx
+        └── integrate.pxd
+
+2.  In ``__init__.pxd``, use ``cimport`` for any declarations that one
+    would want to be available from the package's main namespace::
+
+        from CyIntegration cimport integrate
+
+    Other modules would then be able to use ``cimport`` on the package in
+    order to recursively gain faster, Cython access to the entire package
+    and the data declared in its modules::
+
+        cimport CyIntegration
+        
+        cpdef do_integration(CyIntegration.integrate.Function f):
+            return CyIntegration.integrate.integrate(f, 0., 2., 1)
diff --git a/docs/src/tutorial/python_division.png b/docs/src/tutorial/python_division.png
index 617be942c..a54fdd0b7 100644
--- a/docs/src/tutorial/python_division.png
+++ b/docs/src/tutorial/python_division.png
diff --git a/docs/src/tutorial/readings.rst b/docs/src/tutorial/readings.rst
index a3f09d39e..80ed26e66 100644
--- a/docs/src/tutorial/readings.rst
+++ b/docs/src/tutorial/readings.rst
@@ -1,7 +1,7 @@
 Further reading
 ===============
 
-The main documentation is located at http://docs.cython.org/. Some
+The main documentation is located at https://docs.cython.org/. Some
 recent features might not have documentation written yet, in such
 cases some notes can usually be found in the form of a Cython
 Enhancement Proposal (CEP) on https://github.com/cython/cython/wiki/enhancements.
@@ -16,7 +16,7 @@ features for managing it.
 Finally, don't hesitate to ask questions (or post reports on
 successes!) on the Cython users mailing list [UserList]_.  The Cython
 developer mailing list, [DevList]_, is also open to everybody, but
-focusses on core development issues.  Feel free to use it to report a
+focuses on core development issues.  Feel free to use it to report a
 clear bug, to ask for guidance if you have time to spare to develop
 Cython, or if you have suggestions for future development.
 
diff --git a/docs/src/tutorial/related_work.rst b/docs/src/tutorial/related_work.rst
index 01cc5b327..af55ae88b 100644
--- a/docs/src/tutorial/related_work.rst
+++ b/docs/src/tutorial/related_work.rst
@@ -41,8 +41,9 @@ Python modules.
 
 .. [ctypes] https://docs.python.org/library/ctypes.html.
 .. there's also the original ctypes home page: http://python.net/crew/theller/ctypes/
-.. [Pyrex] G. Ewing, Pyrex: C-Extensions for Python,
-   http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/
+..
+   [Pyrex] G. Ewing, Pyrex: C-Extensions for Python,
+   https://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/
 .. [ShedSkin] M. Dufour, J. Coughlan, ShedSkin,
    https://github.com/shedskin/shedskin
 .. [SWIG] David M. Beazley et al.,
diff --git a/docs/src/tutorial/strings.rst b/docs/src/tutorial/strings.rst
index a4ce2b9d0..0a3e348dc 100644
--- a/docs/src/tutorial/strings.rst
+++ b/docs/src/tutorial/strings.rst
@@ -124,6 +124,9 @@ Python variable::
     from c_func cimport c_call_returning_a_c_string
 
     cdef char* c_string = c_call_returning_a_c_string()
+    if c_string is NULL:
+        ...  # handle error
+
     cdef bytes py_string = c_string
 
 A type cast to :obj:`object` or :obj:`bytes` will do the same thing::
@@ -441,7 +444,7 @@ characters and is compatible with plain ASCII encoded text that it
 encodes efficiently.  This makes it a very good choice for source code
 files which usually consist mostly of ASCII characters.
 
-.. _`UTF-8`: http://en.wikipedia.org/wiki/UTF-8
+.. _`UTF-8`: https://en.wikipedia.org/wiki/UTF-8
 
 As an example, putting the following line into a UTF-8 encoded source
 file will print ``5``, as UTF-8 encodes the letter ``'ö'`` in the two
@@ -554,7 +557,7 @@ above character.
 For more information on this topic, it is worth reading the `Wikipedia
 article about the UTF-16 encoding`_.
 
-.. _`Wikipedia article about the UTF-16 encoding`: http://en.wikipedia.org/wiki/UTF-16/UCS-2
+.. _`Wikipedia article about the UTF-16 encoding`: https://en.wikipedia.org/wiki/UTF-16/UCS-2
 
 The same properties apply to Cython code that gets compiled for a
 narrow CPython runtime environment.  In most cases, e.g. when