How we cut our end-to-end test suite runtime by 66% using parallelism
While there’s a common stereotype that data scientists are poor software engineers, at The Data Incubator, we believe that mastering the fundamentals of software engineering is important for data science and we strive to implement rigorous engineering standards for our data science company. We have an extensive curriculum for data science corporate training, data science fellowship, and online data science course leveraging the jupyter (née ipython) notebook format. Last year, we published a post about testing Jupyter notebooks — applying rigorous software engineering testing standards to new technologies popular in data science.
However, over time, as our codebase as grown, we’ve added in more and more notebooks to our curriculum material. This led to tests on our curriculum taking ~30 minutes to run! We quickly identified parallelism as a low-hanging fruit that would make sense for a first approach, with a couple of points:
- We have curriculum materials that run code in Spark 2.0 parallelizing runs in that kernel is hard because of how the Spark execution environment spins up. We also have curriculum materials in the jupyter R Kernel.
- Subprocess communication in Python (what our testing code is written in) is a pain, so maybe there’s a way to use some other parallelization library to avoid having to reinvent that wheel.
- Most of our notebooks are in Python, so those shouldn’t have any issues.
These issues aside, this seemed like a reasonable approach because each Jupyter notebook executes as its own subprocess in our current setup – we just had to take each of those processes and run them at the same time. Taking a stab at 3., parallelizing python tests, while finding a way around 2. – annoying multiprocess communication issues – yielded great results! Continue reading