Spark is one of the most popular open-source distributed computation engines and offers a scalable, flexible framework for processing huge amounts of data efficiently. The recent 2.0 release milestone brought a number of significant improvements including DataSets, an improved version of DataFrames, more support for SparkR, and a lot more. One of the great things about Spark is that it’s relatively autonomous and doesn’t require a lot of extra infrastructure to work. While Spark’s latest release is at 2.1.0 at the time of publishing, we’ll use the example of 2.0.1 throughout this post.
Jupyter notebooks are an interactive way to code that can enable rapid prototyping and exploration. It essentially connects a browser-based frontend, the Jupyter Server, to an interactive REPL underneath that can process snippets of code. The advantage to the user is being able to write code in small chunks which can be run independently but share the same namespace, greatly facilitating testing or trying multiple approaches in a modular fashion. The platform supports a number of kernels (the things that actually run the code) besides the out-of-the-box Python, but connecting Jupyter to Spark is a little trickier. Enter Apache Toree, a project meant to solve this problem by acting as a middleman between a running Spark cluster and other applications.
In this post I’ll describe how we go from a clean Ubuntu installation to being able to run Spark 2.0 code on Jupyter. Continue reading