At The Data Incubator, we get tons of interest from PhDs looking to attend our free fellowship, which trains PhDs to join industry as quants and data scientists. A lot of them have asked what they can do to make themselves stronger candidates. One of the critical skills for being a data scientist is understanding computation and algorithms. Below is a (cursory) guide meant to whet your appetite for the computational techniques that I found useful as a data scientist. Remember, there’s a lifetime of things to learn and these are just some highlights:
- Vectorized Linear Algebra. Let’s say you want to dot product two very large vectors. You could use a for loop, but that’s slow. Instead, consider using vectorized linear algebra that calls out to professional numerical libraries like BLAS or LAPACK:
import numpy as np x = np.random.randn(1000000) y = np.random.randn(1000000) z1 = sum(x[i] * y[i] for i in xrange(len(x))) # elapsed time: 1.2176 seconds z2 = np.dot(x, y) # elapsed time: 0.0045 seconds
Run both samples and see which one takes longer (try using python’s timeit module if you are not already familiar with it). In our example, for loops are 600 times slower. You can learn more about numerical computation from the numpy and scipy websites.
- Simulation. The simplest workhorse when trying to get a handle on a complex system is simulation. For example, consider the problem:You flip a fair coin n times. What is the expected number of heads? What is the standard deviation of the number of heads?
Obviously, this is just a Bernoulli Distribution and does not require simulation but we will use it to didactically illustrate what one might do in more complex situations. The simulation code might be:
np.random.seed(42) samples = np.random.randint(0,2,10000) print samples.mean() # 0.4987 print samples.std() # 0.499998309997
Notice how much faster vectorized linear algebra is compared with running python for loops. Obviously, the field can get very complex, with Dynamical Systems, Monte Carlo, Gibbs Sampling, Importance Sampling, and MCMC. Don’t forget to use bootstrapping to estimate error bars.
- Recursion. Of course, simulations can never give you an exact answer. One technique to get an exact answer that works in many cases is recursion. The simplest example of recursion comes from implementing the Fibonacci sequence:
def fib(n): if n < 2: return 1 else: return fib(n-1) + fib(n-2)
Try timing the runs to guess the running time of the Fibonacci sequence (spoiler alert: it’s exponential). You may be surprised by how slow it is (can you guess why?). To see how we might use this to solve the last problem, notice that on the n-th draw, we can either increase or decrease the average number of heads by 1/n from the n-th draw, and that each occurs with probability 1/2. Here is the recursive code:
def average_heads(n): if n == 1: return 0.5 else: return np.mean([average_heads(n-1) + 1./n, average_heads(n-1) - 1./n])
Think a little about how you might compute the standard deviation using this technique (hint: it may help to review alternative formulas for variance). Another popular use of recursion is in graph traversal algorithms. Consider the question:
How many possible ways are there to draw US coins that sum up to 50 cents?
For the sake of definiteness, we will say that order of the drawn coins matters. You can solve this problem by traversing the “graph” of all possible draws until we reach exactly 50 cents:
coins = [1, 5, 10, 25, 50] def count(remainder): if remainder < 0: return 0 if remainder == 0: return 1 return sum(count(remainder - coin) for coin in coins) count(50)
- Memoization and Dynamic Programming. Notice that in both the above examples, we make multiple calls to the recursive function for the same input, which seems inefficient. A common way to speed this up is to remember (or memoize) the results of previous computations. As a simple example, take a look at the python memoized class which uses python decorator syntax. Here it is in action:
@memoized def fib(n): if n < 2: return 1 else: return fib(n-1) + fib(n-2)
This has now effectively turned a recursive program into one using Dynamic Programming. Try using timing to guess the running time of the Fibonacci sequence (spoiler alert: it’s linear). It’s amazing how much of a difference a single line makes!
- Divide and Conquer. Another common approach is to break up a large problem into smaller subproblems. A classic example of this is Merge Sort. For the Fibonacci sequence, consider using the Matrix Form of the Fibonacci sequence:
M = np.matrix([[1, 1], [1, 0]]) def fib(n): if n < 2: return 1 MProd = M.copy() for _ in xrange(n-2): MProd *= M return MProd[0,0] + MProd[0,1]
Since the code relies on repeated matrix multiplication, it is very amenable to Divide and Conquer techniques (hint: $ M^8 = (((M^2)^2)^2) $ ). We’ll let you write down the algorithm and time it to verify that it is a log algorithm. (Isn’t it amazing that this can be done in sub-linear time!)
- Coda: This is it for this tutorial but if you know a little bit about matrix factorization, try working out a (pseudo-)constant time answer. Why isn’t this really constant time? (Spoiler alert: read a little bit about Arbitrary Precision Arithmetic.) We’re not going to emphasize it because the technique isn’t really all that generalizable but it is still fun to think about.
Of course, this is just a very high-level overview that is meant to pique your interest rather than give you a full exposition. If you find this kind of stuff interesting, consider applying to be a Fellow at The Data Incubator!