Data are becoming the new raw material of business
The Economist


Parallelizing Jupyter Notebook Tests

How we cut our end-to-end test suite runtime by 66% using parallelismjupyter-logo

While there’s a common stereotype that data scientists are poor software engineers, at The Data Incubator, we believe that mastering the fundamentals of software engineering is important for data science and we strive to implement rigorous engineering standards for our data science company.  We have an extensive curriculum for data science corporate training, data science fellowship, and online data science course leveraging the jupyter (née ipython) notebook format.  Last year, we published a post about testing Jupyter notebooks — applying rigorous software engineering testing standards to new technologies popular in data science.

However, over time, as our codebase as grown, we’ve added in more and more notebooks to our curriculum material. This led to tests on our curriculum taking ~30 minutes to run! We quickly identified parallelism as a low-hanging fruit that would make sense for a first approach, with a couple of points:

  1. We have curriculum materials that run code in Spark 2.0 parallelizing runs in that kernel is hard because of how the Spark execution environment spins up.  We also have curriculum materials in the jupyter R Kernel.
  2. Subprocess communication in Python (what our testing code is written in) is a pain, so maybe there’s a way to use some other parallelization library to avoid having to reinvent that wheel.
  3. Most of our notebooks are in Python, so those shouldn’t have any issues.

These issues aside, this seemed like a reasonable approach because each Jupyter notebook executes as its own subprocess in our current setup – we just had to take each of those processes and run them at the same time. Taking a stab at 3., parallelizing python tests, while finding a way around 2. – annoying multiprocess communication issues – yielded great results!

The library: nose

Anyone who’s written production-grade python is probably familiar with nosetests, the ubiquitous test suite runner. In another codebase of ours, we use nose in conjunction with the flaky plugin to rerun some tests whose output can be… less deterministic than would be ideal.

Nose has good support for parallel testing out-of-the-box (incidentally, the flaky plugin and parallel testing don’t play nice together), so it seemed like a clear candidate for handling test-related subprocess communication for us.

The hurdle: dynamically adding tests

We run our test suite in many different configurations: on pull requests, for example, we’ll only run modified notebooks to facilitate speedier development – and we save the full build for when we merge into master. Given this, we need to dynamically add tests – 1 per notebook we want to test in a given run. A popular Python mechanism for this, that we’ve historically employed, is using something like this:

 

suite = unittest.TestSuit()
for filename in notebook_filenames:
     suite.addTest(NotebookTestCase(filename))

 

Nose, unfortunately, does not like this and doesn’t play nice with unittest. It insists, instead, on test discovery. So we had to get creative. What did “creative” mean, exactly? Unfortunately for the pythonistas among us, it meant we had to use some of Python’s introspection functionality.

The solution: dynamically adding functions to Python classes

The hack we came up with was the following:

  1. Dynamically search out notebooks and add a test function for each to a class. In python, this involves defining a function, setting its __name__ attribute, and then using setattr on the parent class to add that function with the appropriate name. This took care of adding parallel tests in.
  2. Use the nose attr plugin to specify attributes on the tests, so we can maintain speedy single-notebook PR testing as described above. We have code that keeps track of the current diffed filenames (from master), and adds two sets of tests: one under the all attribute, and another under the change attribute. You can see the @attr decorator being used below.

You can see the class below. In a wrapper file, we call the add_tests() function as soon as that file is imported (i.e. before nose attempts any “discovery”) – the ipynb_all and ipynb_change_nbs functions live outside of the class but simply search out appropriate filenames.

class IpynbSelectorTestCase(object):
    """
     Parallelizable TestCase to be used for nose tests.
     To use, inherit and override`check_ipynb` to define how to check each notebook.
     Call `add_tests` in a global call immediately after the class declaration.
     See http://nose.readthedocs.io/en/latest/writing_tests.html#test-generators
     Tests can be invoked via (e.g.):
 
     nosetests -a 'all'
 
     Do not inherit `unittest.TestCase`: this will break parallelization
     """
     def check_ipynb(self, ipynb):
         raise NotImplemented
 
     @classmethod
     def add_func(cls, ipynb, prefix):
         @attr(prefix)
         def func(self):
             self.check_ipynb(ipynb)
 
         _, nbname = os.path.split(ipynb)
         func.__name__ = 'test_{}_{}'.format(prefix, nbname.split('.')[0])
         func.__doc__ = 'Test {}'.format(nbname)
         setattr(cls, func.__name__, func)
 
     @classmethod
     def add_tests(cls):
         for ipynb in ipynb_all():
             cls.add_func(ipynb, 'all')
 
         for ipynb in ipynb_change_nbs():
             cls.add_func(ipynb, 'change')

The results

So, our full build used to take 30 minutes to run, typically. With added parallelism, that time has dropped to 11 minutes! We tested with a few different process counts, and continued seeing marginal improvement up to 6 processes. We made some plots! (Made with seaborn).

 

runtime_vs_processesjupyter_runtime_comparison

 

Not only is the reduction numerically dramatic, but gains like these add up in terms of curriculum developer productivity and allows us to rapidly iterate on our curriculum.

 

By Christian Moscardi and Michael Li

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Spark comparison: AWS vs. GCP

This post was written collectively by myself and Ariel M’ndange-Pfupfu. The original post for this piece can be found at O’Reilly

There’s little doubt that cloud computing will play an important role in data science for the foreseeable future. The flexible, scalable, on-demand computing power available is an important resource, and as a result, there’s a lot of competition between the providers of this service. Two of the biggest players in the space are Amazon Web Services (AWS) and Google Cloud Platform (GCP).

This article includes a short comparison of distributed Spark workloads in AWS and GCP—both in terms of setup time and operating cost. We ran this experiment with our students at The Data Incubator, a big data training organization that helps companies hire top-notch data scientists and train their employees on the latest data science skills. Even with the efficiencies built into Spark, the cost and time of distributed workloads can be substantial, and we are always looking for the most efficient technologies so our students are learning the best and fastest tools.

Submitting Spark jobs to the cloud

Spark is a popular distributed computation engine that incorporates MapReduce-like aggregations into a more flexible, abstract framework. There are APIs for Python and Java, but writing applications in Spark’s native Scala is preferable. That makes job submission simple, as you can package your application and all its dependencies into one JAR file.

It’s common to use Spark in conjunction with HDFS for distributed data storage, and YARN for cluster management; this makes Spark a perfect fit for AWS’s Elastic MapReduce (EMR) clusters and GCP’s Dataproc clusters. Both EMR and Dataproc clusters have HDFS and YARN preconfigured, with no extra work required.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Multi-Armed Bandits

MAB

Special thanks to Brian Farris for contributing this post

Who should care?

Anyone who is involved in testing. Whether you are testing creatives for a marketing campaign, pricing strategies, website designs, or even pharmaceutical treatments, multi-armed bandit algorithms can help you increase the accuracy of your tests while cutting down costs and automating your process.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Testing Jupyter Notebooks

This was originally posted on Christian Moscardi’s blog and is a follow-up piece to another post on Embedding D3 in IPython Notebook. Christian is our Lead Developer! 

Jupyter is a fantastic tool that we use at The Data Incubator for instructional purposes. One perk of using Jupyter is that we can easily test our code samples across any language there’s a Jupyter kernel for. In this post, I’ll show you some of the code we use to test notebooks!

First, a quick discussion of the current state of testing ipython notebooks: there isn’t much documented about the process. ipython_nose is a really helpful extension for writing tests into your notebooks, but there’s no documentation or information about easy end-to-end testing. In particular, we want the programmatic equivalent of clicking “run all cells”.After poking around things like github’s list of cool ipython notebooks and the Jupyter docs, two things became apparent to us:

  1. Most people do not test their notebooks.
  2. Automated end-to-end testing is extremely easy to implement. Continue reading
Tweet about this on TwitterShare on FacebookShare on LinkedIn

NLTK vs. spaCy: Natural Language Processing in Python

The venerable NLTK has been the standard tool for natural language processing in Python for some time. It contains an amazing variety of tools, algorithms, and corpuses. Recently, a competitor has arisen in the form of spaCy, which has the goal of providing powerful, streamlined language processing. Let’s see how these toolkits compare.

Philosophy

NLTK provides a number of algorithms to choose from. For a researcher, this is a great boon. Its nine different stemming libraries, for example, allow you to finely customize your model. For the developer who just wants a stemmer to use as part of a larger project, this tends to be a hindrance. Which algorithm performs the best? Which is the fastest? Which is being maintained?

In contrast, spaCy implements a single stemmer, the one that the spaCy developers feel to be best. They promise to keep it updated, and may replace it with an improved algorithm as the state of the art progresses. You may update your version of spaCy and find that improvements to the library have boosted your application without any work necessary. (The downside is that you may need to rewrite some test cases.)

As a quick glance through the NLTK documentation demonstrates, different languages may need different algorithms. NLTK lets you mix and match the algorithms you need, but spaCy has to make a choice for each language. This is a long process and spaCy currently only has support for English.

Strings versus objects

NLTK is essentially a string processing library. All the tools take strings as input and return strings or lists of strings as output. This is simple to deal with at first, but it requires the user to explore the documentation to discover the functions they need.

In contrast, spaCy uses an object-oriented approach. Parsing some text returns a document object, whose words and sentences are represented by objects themselves. Each of these objects has a number of useful attributes and methods, which can be discovered through introspection. This object-oriented approach lends itself much better to modern Python style than does the string-handling system of NLTK.

A more detailed comparison between these approaches is available in this notebook.

Performance

An important part of a production-ready library is its performance, and spaCy brags that it’s ready to be used. We’ll run some tests on the text of the Wikipedia article on NLP, which contains about 10 kB of text. The tests will be word tokenization (splitting a document into words), sentence tokenization (splitting a document into sentences), and part-of-speech tagging (labeling the grammatical function of each word).

timing

It is fairly obvious that spaCy dramatically out-performs NLTK in word tokenization and part-of-speech tagging. Its poor performance in sentence tokenization is a result of differing approaches: NLTK simply attempts to split the text into sentences. In contrast, spaCy is actually constructing a syntactic tree for each sentence, a more robust method that yields much more information about the text. (You can see a visualization of the result here.)

Conclusion

While NLTK is certainly capable, I feel that spaCy is a better choice for most common uses. It makes the hard choices about algorithms for you, providing state-of-the-art solutions. Its Pythonic API will fit in well with modern Python programming practices, and its fast performance will be much appreciated.

Unfortunately, spaCy is English only at the moment, so developers concerned with other languages will need to use NLTK. Developers that need to ensure a particular algorithm is being used will also want to stick with NLTK. Everyone else should take a look at spaCy.

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Automatically Generating License Data from Python Dependencies

We all know how important keeping track of your open-source licensing is for the average startup.  While most people think of open-source licenses as all being the same, there are meaningful differences that could have potentially serious legal implications for your code base.  From permissive licenses like MIT or BSD to so-called “reciprocal” or “copyleft” licenses, keeping track of the alphabet soup of dependencies in your source code can be a pain.

Today, we’re releasing pylicense, a simple python module that will add license data as comments directly from your requirements.txt or environment.yml files.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Chef and Google App Engine

This post was written by our very own Christian Moscardi and originally posted here on his blog.

 

Anyone who’s ever tried to write a nontrivial application on Google App Engine has encountered at least seven* design decisions that have led to serious head-scratching moments. One of those happened to me about a month ago, while integrating Chef into our course at The Data Incubator. Our goal was to allow for one-click spinning up (on DigitalOcean’s cloud) and monitoring of our Fellows’ course machines, already under Chef management.

* No basis in fact – there are probably more than seven. It should be noted that the Google Cloud Platform is going to greatly improve this situation by allowing you to deploy Docker containers – woohoo!

 

A First Look

Chef servers have an HTTP API. Seems like it’d be an easy integration, right? While GAE doesn’t let you do many things (including making SMTP connections), one thing you, thankfully, can do with relative ease is make HTTP requests (although everyone’s favorite Python HTTP library, requests, is a total nightmare – but that’s for another blogpost). This was going to be a quick job – we’d spend a couple days coding, write some tests, and have one-click deployment, right? Right? As you probably guessed, that timeline was anything but right.  Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Painlessly Deploying Data Apps with Bokeh, Flask, and Heroku

Here at The Data Incubator, our Fellows deploy their own fully functional, public-facing web app to showcase their data science skills to employers. This not only gives them valuable experience dynamically fetching and displaying data, but also encourages them to think about end user interaction. To demo the process, we decided to marry together some of our favorite technologies:

  • Flask, a slick web framework for Python
  • Heroku for cloud-based app deployment
  • Bokeh for interactive, D3.js-style visualizations
  • Git for version control and distributing code

The goal is to create some distant ancestor of Google Finance: a form capable of accepting a stock ticker as input and producing a plot of the daily close price. Here’s the finished product. So how do we get there?

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Embedding D3 in an IPython Notebook

This post was written by our very own Christian Moscardi and originally posted here on his blog.

 

Jupyter is a fantastic tool that we use at The Data Incubator for instructional purposes. In particular, we like to keep our curriculum compartmentalized via Jupyter notebooks. It allows us to test our code samples across any language there’s a Jupyter kernel for* and keep things in one place, so our Fellows don’t have to rifle through a wide variety of file formats before getting to the information they need.

One area where we only recently integrated Jupyter was frontend web visualization. Our previous structure involved a notebook, possibly with code snippets, that contained links to various HTML files. We expected our Fellows to dig through the code to

  • Look at the HTML source for the basic layout.
  • Expose the Javascript powering the visualization.
  • View the styles making everything pretty.

Oh, and any data processing code was separate/output to a file. Obviously not ideal. We knew IPython had %%javascript magic, and started rifling around to see what we could improve.  Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn