Data are becoming the new raw material of business
The Economist

# Parallelizing Jupyter Notebook Tests

## How we cut our end-to-end test suite runtime by 66% using parallelism

While there’s a common stereotype that data scientists are poor software engineers, at The Data Incubator, we believe that mastering the fundamentals of software engineering is important for data science and we strive to implement rigorous engineering standards for our data science company.  We have an extensive curriculum for data science corporate training, data science fellowship, and online data science course leveraging the jupyter (née ipython) notebook format.  Last year, we published a post about testing Jupyter notebooks — applying rigorous software engineering testing standards to new technologies popular in data science.

However, over time, as our codebase as grown, we’ve added in more and more notebooks to our curriculum material. This led to tests on our curriculum taking ~30 minutes to run! We quickly identified parallelism as a low-hanging fruit that would make sense for a first approach, with a couple of points:

1. We have curriculum materials that run code in Spark 2.0 parallelizing runs in that kernel is hard because of how the Spark execution environment spins up.  We also have curriculum materials in the jupyter R Kernel.
2. Subprocess communication in Python (what our testing code is written in) is a pain, so maybe there’s a way to use some other parallelization library to avoid having to reinvent that wheel.
3. Most of our notebooks are in Python, so those shouldn’t have any issues.

These issues aside, this seemed like a reasonable approach because each Jupyter notebook executes as its own subprocess in our current setup – we just had to take each of those processes and run them at the same time. Taking a stab at 3., parallelizing python tests, while finding a way around 2. – annoying multiprocess communication issues – yielded great results!

## The library: nose

Anyone who’s written production-grade python is probably familiar with nosetests, the ubiquitous test suite runner. In another codebase of ours, we use nose in conjunction with the flaky plugin to rerun some tests whose output can be… less deterministic than would be ideal.

Nose has good support for parallel testing out-of-the-box (incidentally, the flaky plugin and parallel testing don’t play nice together), so it seemed like a clear candidate for handling test-related subprocess communication for us.

## The hurdle: dynamically adding tests

We run our test suite in many different configurations: on pull requests, for example, we’ll only run modified notebooks to facilitate speedier development – and we save the full build for when we merge into master. Given this, we need to dynamically add tests – 1 per notebook we want to test in a given run. A popular Python mechanism for this, that we’ve historically employed, is using something like this:

suite = unittest.TestSuit()
for filename in notebook_filenames:
suite.addTest(NotebookTestCase(filename))

Nose, unfortunately, does not like this and doesn’t play nice with unittest. It insists, instead, on test discovery. So we had to get creative. What did “creative” mean, exactly? Unfortunately for the pythonistas among us, it meant we had to use some of Python’s introspection functionality.

## The solution: dynamically adding functions to Python classes

The hack we came up with was the following:

1. Dynamically search out notebooks and add a test function for each to a class. In python, this involves defining a function, setting its __name__ attribute, and then using setattr on the parent class to add that function with the appropriate name. This took care of adding parallel tests in.
2. Use the nose attr plugin to specify attributes on the tests, so we can maintain speedy single-notebook PR testing as described above. We have code that keeps track of the current diffed filenames (from master), and adds two sets of tests: one under the all attribute, and another under the change attribute. You can see the @attr decorator being used below.

You can see the class below. In a wrapper file, we call the add_tests() function as soon as that file is imported (i.e. before nose attempts any “discovery”) – the ipynb_all and ipynb_change_nbs functions live outside of the class but simply search out appropriate filenames.

class IpynbSelectorTestCase(object):
"""
Parallelizable TestCase to be used for nose tests.
To use, inherit and overridecheck_ipynb to define how to check each notebook.
Call add_tests in a global call immediately after the class declaration.
Tests can be invoked via (e.g.):

nosetests -a 'all'

Do not inherit unittest.TestCase: this will break parallelization
"""
def check_ipynb(self, ipynb):
raise NotImplemented

@classmethod
@attr(prefix)
def func(self):
self.check_ipynb(ipynb)

_, nbname = os.path.split(ipynb)
func.__name__ = 'test_{}_{}'.format(prefix, nbname.split('.')[0])
func.__doc__ = 'Test {}'.format(nbname)
setattr(cls, func.__name__, func)

@classmethod
for ipynb in ipynb_all():

for ipynb in ipynb_change_nbs():
cls.add_func(ipynb, 'change')

The results

So, our full build used to take 30 minutes to run, typically. With added parallelism, that time has dropped to 11 minutes! We tested with a few different process counts, and continued seeing marginal improvement up to 6 processes. We made some plots! (Made with seaborn).

Not only is the reduction numerically dramatic, but gains like these add up in terms of curriculum developer productivity and allows us to rapidly iterate on our curriculum.