Data are becoming the new raw material of business
The Economist

Automating Excel with Python

We know there’s a lot of pain points in Excel that make it a tool that’s cumbersome and repetitive for data manipulation. We’ve distilled those pain points into three major themes.

  • The first is that it’s awkward to deal with higher dimensional data in a two-dimensional spreadsheet. Spreadsheets are great for representing two-dimensional data but they’re awkward for representing anything at three or higher dimensions. And while there’s many workarounds like pivot tables, this will only gets you so far.
  • The second pain point revolves around doing the same calculation over multiple sheets or multiple workbooks. While it’s easy to iterate over rows or columns in Excel, it’s cumbersome and time consuming to iterate over hundreds of sheets or notebooks.
  • Finally, data manipulation in Excel is actually very manual and hence very error prone. So in Excel, the convention is to copy data or formulas from cell to cell, but this makes it hard to keep our data up to date as new data arrives or as we update our computations as they become more complex. Errors aren’t always easy to catch before important business decisions are made.

In this video, we look at some data that we can get from the Bureau of Labor Statistics. While it comes in Excel, it comes in a very particular format. The rows iterate through years, the columns iterate through months, the sheets iterate through industries, and the workbooks iterate through wages, hours worked and overtime. So how would you use this to calculate salary, which we’re gonna define as wage times the quantity hours worked plus 1.5 times the overtime.
Continue reading

Advanced Conda: Installing, Building, and Uploading Packages, Adding Channels, and More

As many of our readers might know, conda is a package manager for the numerical python stack that solves many of the issues where pip falls short. While pip is great for pure Python packages (ones written exclusively in Python code), most data science packages need to rely on C code for performance. This, unfortunately, makes installation highly system-dependent. Conda alleviates this problem by managing compiled binaries such that a user does not the need to have a full suite of local build tools (for example, building NumPy from source no longer requires a FORTRAN 77 compiler). Additionally, conda is moving to include more than Python. For example, it’s also supporting managing packages in the R language. With such an ambitious scope, it’s not surprising that package coverage is incomplete and, if you’re a power user, you’ll often see yourself wanting to contribute missing packages. Here’s a primer on how:


Installing Using Pip / PyPI in Conda as a Fallback

The first thing to notice is that you don’t necessarily need to jump to building or uploading packages. As a simple fallback, you can tell conda to build Python Packages directly from pip/PyPI. For example, take a look at this simple conda environment.yaml file:

# environment.yaml
- numpy
- scipy
- pip:
  - requests

This installs numpy and scipy from anaconda but installs requests using pip.  You can invoke it by running:

conda env update -f environment.yaml


Adding New Channels to Conda for more Packages

While the core maintainers has fairly good coverage, the coverage isn’t complete. Also, because conda packages are version and system dependent, the odds of a specific version not existing for your operating system is fairly high. For example, as of this writing, scrapy, a popular web scraping software, lives in the popular conda-forge channel (package). Similarly, many r packages are under the r channel, for example the r-dplyr package. R packages are, by convention, prefaced with a “r-” prefix to their CRAN name. You can find the channel that supports it by Googling “scrapy conda”. To install it, we’ll need to add conda-forge as a channel:

# environment.yaml
- conda-forge
- scrapy


Building PyPI Conda Packages

But sometimes, even with extra channels, the packages simply don’t exist (as evidenced by a Google query) . In this case, we just have to build our own packages. It’s fairly easy. The first step is to create a user account on The username will be your channel name.

First, you’ll want to make sure you have conda build and anaconda client installed:

conda install conda-build
conda install anaconda-client

and you’ll want to authenticate your anaconda client with your new credentials:

anaconda login

Finally, it’s easiest to configure anaconda to automatically upload all successful builds to your channel:

conda config --set anaconda_upload yes

With this setup, it’s easy! Below are the instructions for uploading the pyinstrument package:

# download pypi data to ./pyinstrument
conda skeleton pypi pyinstrument
# build the package
conda build ./pyinstrument
# you'll see that the package is automatically uploaded


Building R Conda Packages

Finally, we’ll want to do this with R packages. Fortunately, there’s conda support for building from CRAN, R’s package manager. For example, glm2 is (surprisingly enough) not on anaconda.  We’ll run the following commands to build and auto upload it:

conda skeleton cran glm2
conda build ./r-glm2

Of course, glm2 now has OSX and Linux packages on The Data Incubator’s Anaconda channel at r-glm2 so you can directly include our channel:

# environment.yaml
- thedataincubator
- r-glm2


Visit our website to learn more about our offerings:


Tensorflow with Keras – Empowering Neural Networks for Deep Learning

Building deep neural networks just got easier. TensorFlow has announced that they are incorporating the popular deep learning API, Keras, as part of the core code that ships with TensorFlow 1.2. In the words of Keras’ author François Chollet, “Theano and TensorFlow are closer to NumPy, while Keras is closer to scikit-learn,” which is to say that Keras is at a higher level compared to pure TensorFlow and makes building deep learning models much more manageable.

TensorFlow is one of the fastest, most flexible, and most scalable machine-learning libraries available. It was developed internally by Google Brain and released as an open-source library in November 2015. Almost immediately upon its release, TensorFlow became one of the most popular machine learning libraries. But, as is the case with many libraries that emphasize speed and flexibility, TensorFlow tends to be a bit low-level.

Continue reading

Spark 2.0 on Jupyter with Toree


Spark is one of the most popular open-source distributed computation engines and offers a scalable, flexible framework for processing huge amounts of data efficiently. The recent 2.0 release milestone brought a number of significant improvements including DataSets, an improved version of DataFrames, more support for SparkR, and a lot more. One of the great things about Spark is that it’s relatively autonomous and doesn’t require a lot of extra infrastructure to work. While Spark’s latest release is at 2.1.0 at the time of publishing, we’ll use the example of 2.0.1 throughout this post.


Jupyter notebooks are an interactive way to code that can enable rapid prototyping and exploration. It essentially connects a browser-based frontend, the Jupyter Server, to an interactive REPL underneath that can process snippets of code. The advantage to the user is being able to write code in small chunks which can be run independently but share the same namespace, greatly facilitating testing or trying multiple approaches in a modular fashion. The platform supports a number of kernels (the things that actually run the code) besides the out-of-the-box Python, but connecting Jupyter to Spark is a little trickier. Enter Apache Toree, a project meant to solve this problem by acting as a middleman between a running Spark cluster and other applications.

In this post I’ll describe how we go from a clean Ubuntu installation to being able to run Spark 2.0 code on Jupyter. Continue reading

5 secrets for writing the perfect data scientist resume

handshake-2056023_960_720Data scientists are in demand like never before, but nonetheless, getting a job as a data scientist requires a resume that shows off your skills. At The Data Incubator, we’ve received tens of thousands of resumes from applicants for our free Data Science Fellowship. While we work hard to read between the lines to find great candidates who happen to have lackluster CVs, many recruiters may not be as diligent. Based on our experience, here’s the advice we give to our Fellows about how to craft the perfect resume to get hired as a data scientist.

Be brief: A resume is a summary of your accomplishments. It is not the right place to put your little-league participation award. Remember, you are being judged on something a lot closer to theaverage of your listed accomplishments than their sum. Giving unnecessary information will only dilute your average. Keep your resume to no more than one page. Remember that a busy HR person will scan your resume for 10 seconds. Adding more content will only distract them from finding key information (as will that second page). That said, don’t play font games; keep text at 11-point font or above.  Continue reading

Mindset Shift: Transitioning from Academia to Industry

Special thanks to Francesco Mosconi for contributing this post.


office-1209640_960_720Transitioning from Academia to Industry can be difficult for a number of reasons.

  1. You are learning a new set of hard skills (data analysis, programming in python, machine learning, map reduce etc.), and you are doing this in a very short time.
  2. You are also learning new soft skills, which also require practice to develop.
  3. A mindset shift needs to occur, and your success in industry will strongly depend on how quickly this happens.


Learn to prioritize

When your goal is knowledge, like in Academia, it is okay to spend as much time as you want learning a new concept or completing a project. On the other hand, during this program and on the job, you will often find that there is not enough time to deal with all the tasks and assignments required of you. In a situation where you have more on your plate than you can handle, it is essential to develop the ability to decide which tasks require execution, which can be postponed, and which ones can be simply ignored. There are many frameworks and approaches to prioritization, famous examples including the Getting Things Done system and the Eisenhower Method. Most methods are good, and eventually you will find your favorite one; however, they only work if consistently applied. In other words, it is less important which prioritization method you choose but it is fundamental that you prioritize your day and your week according to the specific goals you are to accomplish. Continue reading

Three best practices for building successful data pipelines

On September 15th, O’Reilly Radar featured an article written by Data Incubator founder Michael Li. The article can be found where it was originally posted here


pipeline-1585686_960_720Building a good data pipeline can be technically tricky. As a data scientist who has worked at Foursquare and Google, I can honestly say that one of our biggest headaches was locking down our Extract, Transform, and Load (ETL) process.

At The Data Incubator, our team has trained more than 100 talented Ph.D. data science fellows who are now data scientists at a wide range of companies, including Capital One, the New York Times, AIG, and Palantir. We commonly hear from Data Incubator alumni and hiring managers that one of their biggest challenges is also implementing their own ETL pipelines.

Drawn from their experiences and my own, I’ve identified three key areas that are often overlooked in data pipelines, and those are making your analysis:

  1. Reproducible
  2. Consistent
  3. Productionizable

While these areas alone cannot guarantee good data science, getting these three technical aspects of your data pipeline right helps ensure that your data and research results are both reliable and useful to an organization. Continue reading

How to Kickstart Your Data Science Career

At The Data Incubator, we run a free six week data science fellowship to help transition our Fellows from Academia to Industry. This post runs through some of the toolsets you’ll need to know to kickstart your Data Science Career

If you’re an aspiring data scientist but still processing your data in Excel, you might want to upgrade your toolset.  Why?  Firstly, while advanced features like Excel Pivot tables can do a lot, they don’t offer nearly the flexibility, control, and power of tools like SQL, or their functional equivalents in Python (Pandas) or R (Dataframes).  Also, Excel has low size limits, making it suitable for “small data”, not  “big data.”

In this blog entry we’ll talk about SQL.  This should cover your “medium data” needs, which we’ll define as the next level of data where the rows do not fit the 1 million row restriction in Excel.  SQL stores data in tables, which you can think of as a spreadsheet layout but with more structure.  Each row represents a specific record, (e.g. an employee at your company) and each column of a table corresponds to an attribute (e.g. name, department id, salary).  Critically, each column must be of the same “type”.  Here is a sample of the table Employees:

EmployeeId Name StartYear Salary DepartmentId
1 Bob 2001 10.5 10
2 Sally 2004 20. 10
3 Alice 2005 25. 20
4 Fred 2004 12.5 20

SQL has many keywords which compose its query language but the ones most relevant to data scientists are SELECT, WHERE, GROUP BY, JOIN.  We’ll go through these each individually.
Continue reading

A CS Degree for Data Science — Part I: Efficient Numerical Computation

apple-1851464_960_720At The Data Incubator, we get tons of interest from PhDs looking to attend our free fellowship, which trains PhDs to join industry as quants and data scientists.  A lot of them have asked what they can do to make themselves stronger candidates.  One of the critical skills for being a data scientist is understanding computation and algorithms.  Below is a (cursory) guide meant to whet your appetite for the computational techniques that I found useful as a data scientist.  Remember, there’s a lifetime of things to learn and these are just some highlights: Continue reading