Lecture 12: Numpy + Cython = AWESOME

This lecture is about how to efficiently combine Numpy and Cython to write fast numerical code.

We will focus on the problem of computing the standard deviation of a list of floating point numbers.   Let $x_1, \ldots, x_n$ be a list of $n$ real numbers, and let $$\mu = \frac{1}{n}\sum_{i=1}^n x_i$$ be their mean.   We define their standard deviation to be $$\sqrt{\frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2}.$$

Note: In statistics it is common to divide by $n-1$ instead of $n$ when computing the standard deviation of a sample and using it to estimate the standard deviation of a population.  We will not do this, since our goal today is illustrating programming techniques, not learning techniques of statistics.

 

Running Example: Compute the standard deviation of a list of 64-bit floating point numbers.   Our data set is computed using the random.random method, which generates numbers uniformly between 0 and 1.

{{{id=3| set_random_seed(0) import random v = [random.random() for _ in range(10^5)] /// }}}

First we write a naive straightforward implementation of computation of the standard deviation.

{{{id=9| def my_std(z): mean = sum(z)/len(z) return sqrt(sum((x-mean)^2 for x in z)/len(z)) /// }}} {{{id=10| time my_std(v) /// 0.28822469197281247 Time: CPU 0.06 s, Wall: 0.06 s }}} {{{id=11| timeit('my_std(v)') /// 5 loops, best of 3: 55.6 ms per loop }}}

Next we try the std function in Sage, which was implemented by UW undergrad Andrew Hou as part of paid work he did on Sage after he took Math 480.

{{{id=1| time std(v, bias=True) /// 0.28822469197281247 Time: CPU 0.03 s, Wall: 0.03 s }}} {{{id=12| timeit('std(v, bias=True)') /// 25 loops, best of 3: 22.5 ms per loop }}}

Next we try Numpy, which is much faster than the above.

{{{id=7| import numpy v_numpy = numpy.array(v, dtype=numpy.float64) /// }}} {{{id=14| v_numpy.dtype /// dtype('float64') }}} {{{id=21| v_numpy.std() /// 0.28822469197281253 }}} {{{id=6| timeit('v_numpy.std()') /// 625 loops, best of 3: 664 µs per loop }}} {{{id=76| 22.5/.664 /// 33.8855421686747 }}}

Sage also has code for working with TimeSeries, which happens to have a method for computing the standard deviation.  It is a couple of times faster than Numpy.  

{{{id=16| v_stats = stats.TimeSeries(v) /// }}} {{{id=20| v_stats.standard_deviation(bias=True) /// 0.28822469197281247 }}} {{{id=15| timeit('v_stats.standard_deviation(bias=True)') /// 625 loops, best of 3: 239 µs per loop }}}

The TimeSeries code is nearly optimal.  A TimeSeries is represented by a contiguous array of double's, and the code for computing standard deviation is very straightforward Cython that maps directly to C.  (I wrote it, by the way.)

{{{id=17| /// }}}

Goal: Write a function that computes the standard deviation of a numpy array as quickly as stats.TimeSeries does, hence is faster than Numpy itself.

First approach: Use numpy "vectorized operations".  This doesn't help at all (and is also wasteful of memory, by the way).

{{{id=23| def std_numpy1(v): m = v.mean() # mean of entries w = v - m # subtracts m from each entry: "broadcasting" w2 = w**2 # squares each entry componentwise. return math.sqrt(w2.mean()) /// }}} {{{id=19| std_numpy1(v_numpy) /// 0.28822469197281247 }}} {{{id=18| timeit('std_numpy1(v_numpy)') /// 625 loops, best of 3: 1.11 ms per loop }}}

Let's see how the time gets spent between each step.  It turns out to be about equally spent among each line.

{{{id=34| m = v_numpy.mean() timeit('v_numpy.mean()') /// 625 loops, best of 3: 140 µs per loop }}} {{{id=37| w = v_numpy - m timeit('v_numpy - m') /// 625 loops, best of 3: 148 µs per loop }}} {{{id=38| w2 = w**2 timeit('w**2') /// 625 loops, best of 3: 126 µs per loop }}} {{{id=36| m2 = w2.mean() timeit('math.sqrt(w2.mean())') /// 625 loops, best of 3: 146 µs per loop }}} {{{id=39| /// }}}

Next try Cython with no special type declarations.  Not surprisingly, this does not help in the least bit.

{{{id=28| %cython import math def std_numpy2(v): m = v.mean() # mean of entries w = v - m # subtracts m from each entry: "broadcasting" w2 = w**2 # squares each entry componentwise. return math.sqrt(w2.mean()) /// }}} {{{id=25| std_numpy2(v_numpy) /// 0.28822469197281247 }}} {{{id=24| timeit('std_numpy2(v_numpy)') /// 625 loops, best of 3: 672 µs per loop }}}

Next try Cython with special support for Numpy.  This gets powerful... as we will see.

{{{id=30| %cython from numpy cimport ndarray import math def std_numpy3(ndarray v not None): m = v.mean() # mean of entries w = v - m # subtracts m from each entry: "broadcasting" w2 = w**2 # squares each entry componentwise. return math.sqrt(w2.mean()) /// }}} {{{id=33| std_numpy3(v_numpy) /// 0.28822469197281247 }}} {{{id=42| timeit('std_numpy3(v_numpy)') /// 625 loops, best of 3: 573 µs per loop }}}

Look at Cython + Numpy documentation (by Googling "cython numpy"), and we learn that if we declare v a little more precisely, then we get fast direct access to the underlying elements in v.

{{{id=46| %cython cimport numpy as np import math def std_numpy4a(np.ndarray[np.float64_t, ndim=1] v): cdef Py_ssize_t i cdef Py_ssize_t n = v.shape[0] # how many entries # Compute the mean cdef double m = 0 for i in range(n): m += v[i] m /= n # just doing the mean for now... return m /// }}} {{{id=45| std_numpy4a(v_numpy) /// 0.49964878415234498 }}}

Timing looks good...

{{{id=44| timeit('std_numpy4a(v_numpy)') /// 625 loops, best of 3: 124 µs per loop }}} {{{id=79| /// }}}

Let's finish it the function and see how it compares.

{{{id=56| %cython cimport numpy as np cdef extern: double sqrt(double) def std_numpy4b(np.ndarray[np.float64_t, ndim=1] v): cdef Py_ssize_t i cdef Py_ssize_t n = v.shape[0] # how many entries # Compute the mean cdef double m = 0 for i in range(n): m += v[i] m /= n # Compute variance cdef double s = 0 for i in range(n): s += (v[i] - m)**2 return sqrt(s/n) /// }}} {{{id=55| std_numpy4b(v_numpy) /// 0.28822469197281247 }}} {{{id=63| timeit('std_numpy4b(v_numpy)') /// 625 loops, best of 3: 271 µs per loop }}} {{{id=54| timeit('v_stats.standard_deviation(bias=True)') /// 625 loops, best of 3: 233 µs per loop }}} {{{id=58| timeit('v_numpy.std()') /// 625 loops, best of 3: 545 µs per loop }}}

Very nice!!

{{{id=60| /// }}}

Finally, we try again, after disabling bounds checking.   This is even better; almost as good as stats.TimeSeries.

{{{id=50| %cython cimport numpy as np cdef extern: double sqrt(double) cimport cython @cython.boundscheck(False) # turn of bounds-checking for entire function def std_numpy5a(np.ndarray[np.float64_t, ndim=1] v): cdef Py_ssize_t i cdef Py_ssize_t n = v.shape[0] # how many entries # Compute the mean cdef double m = 0 for i in range(n): m += v[i] m /= n # Compute variance cdef double s = 0 for i in range(n): s += (v[i] - m)**2 return sqrt(s/n) /// }}} {{{id=49| timeit('std_numpy5a(v_numpy)') /// 625 loops, best of 3: 256 µs per loop }}} {{{id=43| timeit('v_stats.standard_deviation(bias=True)') /// 625 loops, best of 3: 259 µs per loop }}}

Yeah, we did it!!   

For smaller input, interestingly we get a massive win over Numpy.   If you were, e.g., computing a sliding window of standard deviations (say) for a time series, this would be important.

{{{id=65| a = numpy.array([1,2,3,4], dtype=float); a /// array([ 1., 2., 3., 4.]) }}} {{{id=67| timeit('std_numpy5a(a)') /// 625 loops, best of 3: 486 ns per loop }}} {{{id=68| timeit('a.std()') /// 625 loops, best of 3: 24.4 µs per loop }}} {{{id=69| b = stats.TimeSeries(a) timeit('b.standard_deviation(bias=True)') /// 625 loops, best of 3: 550 ns per loop }}} {{{id=83| /// }}} {{{id=82| /// }}} {{{id=81| /// }}}