When you think of artificial intelligence (AI), do you envision C-3PO or matrix multiplication? HAL 9000 or pruning decision trees? This is an example of ambiguous language, and for a field which has gained so much traction in recent years, it’s particularly important that we think about and define what we mean by artificial intelligence – especially when communicating between managers, salespeople, and the technical side of things. These days, AI is often used as a synonym for deep learning, perhaps because both ideas entered popular tech-consciousness at the same time. In this article I’ll go over the big picture definition of AI and how it differs from machine learning and deep learning. Continue reading
Spark is one of the most popular open-source distributed computation engines and offers a scalable, flexible framework for processing huge amounts of data efficiently. The recent 2.0 release milestone brought a number of significant improvements including DataSets, an improved version of DataFrames, more support for SparkR, and a lot more. One of the great things about Spark is that it’s relatively autonomous and doesn’t require a lot of extra infrastructure to work. While Spark’s latest release is at 2.1.0 at the time of publishing, we’ll use the example of 2.0.1 throughout this post.
Jupyter notebooks are an interactive way to code that can enable rapid prototyping and exploration. It essentially connects a browser-based frontend, the Jupyter Server, to an interactive REPL underneath that can process snippets of code. The advantage to the user is being able to write code in small chunks which can be run independently but share the same namespace, greatly facilitating testing or trying multiple approaches in a modular fashion. The platform supports a number of kernels (the things that actually run the code) besides the out-of-the-box Python, but connecting Jupyter to Spark is a little trickier. Enter Apache Toree, a project meant to solve this problem by acting as a middleman between a running Spark cluster and other applications.
In this post I’ll describe how we go from a clean Ubuntu installation to being able to run Spark 2.0 code on Jupyter. Continue reading
People interested in seeing the Broadway musical Hamilton — and there are still many of them, with demand driving starting ticket prices to $\$600$ — can enter Broadway Direct’s daily lottery. Winners can receive up to 2 tickets (out of 21 available tickets) for a total of $\$10$.
What’s the probability of winning?
How easy is it to win these coveted tickets? Members of NYC’s Data Incubator Team have collectively tried and failed 120 times. Given our data, we cannot simply divide the number of successes by the number of trials to calculate our chances of winning — we would get zero (and the odds, which are apparently small, are clearly non-zero).
This kind of situation often comes up under many guises in business and big data, and because we are a data science corporate training company, we decided to use statistics to determine the answer. Say you are measuring the click-through-rate of a piece of organic or paid content, and out of 100 impressions, you have not observed any clicks. The measured CTR is zero but the true CTR is likely not zero. Alternatively, suppose you are measuring the rate of adverse side effects of a new drug. You have tested 40 patients and haven’t found any, but you know the chance is unlikely to be zero. So what are the odds of observing a click or a side effect? Continue reading
How we cut our end-to-end test suite runtime by 66% using parallelism
While there’s a common stereotype that data scientists are poor software engineers, at The Data Incubator, we believe that mastering the fundamentals of software engineering is important for data science and we strive to implement rigorous engineering standards for our data science company. We have an extensive curriculum for data science corporate training, data science fellowship, and online data science course leveraging the jupyter (née ipython) notebook format. Last year, we published a post about testing Jupyter notebooks — applying rigorous software engineering testing standards to new technologies popular in data science.
However, over time, as our codebase as grown, we’ve added in more and more notebooks to our curriculum material. This led to tests on our curriculum taking ~30 minutes to run! We quickly identified parallelism as a low-hanging fruit that would make sense for a first approach, with a couple of points:
- We have curriculum materials that run code in Spark 2.0 parallelizing runs in that kernel is hard because of how the Spark execution environment spins up. We also have curriculum materials in the jupyter R Kernel.
- Subprocess communication in Python (what our testing code is written in) is a pain, so maybe there’s a way to use some other parallelization library to avoid having to reinvent that wheel.
- Most of our notebooks are in Python, so those shouldn’t have any issues.
These issues aside, this seemed like a reasonable approach because each Jupyter notebook executes as its own subprocess in our current setup – we just had to take each of those processes and run them at the same time. Taking a stab at 3., parallelizing python tests, while finding a way around 2. – annoying multiprocess communication issues – yielded great results! Continue reading
This post was written collectively by myself and Ariel M’ndange-Pfupfu. The original post for this piece can be found at O’Reilly.
There’s little doubt that cloud computing will play an important role in data science for the foreseeable future. The flexible, scalable, on-demand computing power available is an important resource, and as a result, there’s a lot of competition between the providers of this service. Two of the biggest players in the space are Amazon Web Services (AWS) and Google Cloud Platform (GCP).
This article includes a short comparison of distributed Spark workloads in AWS and GCP—both in terms of setup time and operating cost. We ran this experiment with our students at The Data Incubator, a big data training organization that helps companies hire top-notch data scientists and train their employees on the latest data science skills. Even with the efficiencies built into Spark, the cost and time of distributed workloads can be substantial, and we are always looking for the most efficient technologies so our students are learning the best and fastest tools.
Submitting Spark jobs to the cloud
Spark is a popular distributed computation engine that incorporates MapReduce-like aggregations into a more flexible, abstract framework. There are APIs for Python and Java, but writing applications in Spark’s native Scala is preferable. That makes job submission simple, as you can package your application and all its dependencies into one JAR file.
It’s common to use Spark in conjunction with HDFS for distributed data storage, and YARN for cluster management; this makes Spark a perfect fit for AWS’s Elastic MapReduce (EMR) clusters and GCP’s Dataproc clusters. Both EMR and Dataproc clusters have HDFS and YARN preconfigured, with no extra work required.
Special thanks to Brian Farris for contributing this post.
Who should care?
Anyone who is involved in testing. Whether you are testing creatives for a marketing campaign, pricing strategies, website designs, or even pharmaceutical treatments, multi-armed bandit algorithms can help you increase the accuracy of your tests while cutting down costs and automating your process.
A typical slot machine is a device in which the player pulls a lever arm and receives rewards at some expected rate. Because the expected rate is typically negative, these machines are sometimes referred to as “one-armed bandits”. By analogy, a “multi-armed bandit” is a machine in which there are multiple lever arms to pull, each one of which may pay out at a different expected rate. Continue reading
Jupyter is a fantastic tool that we use at The Data Incubator for instructional purposes. One perk of using Jupyter is that we can easily test our code samples across any language there’s a Jupyter kernel for. In this post, I’ll show you some of the code we use to test notebooks!
First, a quick discussion of the current state of testing ipython notebooks: there isn’t much documented about the process. ipython_nose is a really helpful extension for writing tests into your notebooks, but there’s no documentation or information about easy end-to-end testing. In particular, we want the programmatic equivalent of clicking “run all cells”.After poking around things like github’s list of cool ipython notebooks and the Jupyter docs, two things became apparent to us:
- Most people do not test their notebooks.
- Automated end-to-end testing is extremely easy to implement. Continue reading
The venerable NLTK has been the standard tool for natural language processing in Python for some time. It contains an amazing variety of tools, algorithms, and corpuses. Recently, a competitor has arisen in the form of spaCy, which has the goal of providing powerful, streamlined language processing. Let’s see how these toolkits compare.
NLTK provides a number of algorithms to choose from. For a researcher, this is a great boon. Its nine different stemming libraries, for example, allow you to finely customize your model. For the developer who just wants a stemmer to use as part of a larger project, this tends to be a hindrance. Which algorithm performs the best? Which is the fastest? Which is being maintained?
In contrast, spaCy implements a single stemmer, the one that the spaCy developers feel to be best. They promise to keep it updated, and may replace it with an improved algorithm as the state of the art progresses. You may update your version of spaCy and find that improvements to the library have boosted your application without any work necessary. (The downside is that you may need to rewrite some test cases.)
As a quick glance through the NLTK documentation demonstrates, different languages may need different algorithms. NLTK lets you mix and match the algorithms you need, but spaCy has to make a choice for each language. This is a long process and spaCy currently only has support for English. Continue reading
We all know how important keeping track of your open-source licensing is for the average startup. While most people think of open-source licenses as all being the same, there are meaningful differences that could have potentially serious legal implications for your code base. From permissive licenses like MIT or BSD to so-called “reciprocal” or “copyleft” licenses, keeping track of the alphabet soup of dependencies in your source code can be a pain.
Today, we’re releasing
pylicense, a simple python module that will add license data as comments directly from your