Data are becoming the new raw material of business
The Economist

Automating Excel with Python

We know there’s a lot of pain points in Excel that make it a tool that’s cumbersome and repetitive for data manipulation. We’ve distilled those pain points into three major themes.

  • The first is that it’s awkward to deal with higher dimensional data in a two-dimensional spreadsheet. Spreadsheets are great for representing two-dimensional data but they’re awkward for representing anything at three or higher dimensions. And while there’s many workarounds like pivot tables, this will only gets you so far.
  • The second pain point revolves around doing the same calculation over multiple sheets or multiple workbooks. While it’s easy to iterate over rows or columns in Excel, it’s cumbersome and time consuming to iterate over hundreds of sheets or notebooks.
  • Finally, data manipulation in Excel is actually very manual and hence very error prone. So in Excel, the convention is to copy data or formulas from cell to cell, but this makes it hard to keep our data up to date as new data arrives or as we update our computations as they become more complex. Errors aren’t always easy to catch before important business decisions are made.

In this video, we look at some data that we can get from the Bureau of Labor Statistics. While it comes in Excel, it comes in a very particular format. The rows iterate through years, the columns iterate through months, the sheets iterate through industries, and the workbooks iterate through wages, hours worked and overtime. So how would you use this to calculate salary, which we’re gonna define as wage times the quantity hours worked plus 1.5 times the overtime.
Continue reading

Advanced Conda: Installing, Building, and Uploading Packages, Adding Channels, and More

As many of our readers might know, conda is a package manager for the numerical python stack that solves many of the issues where pip falls short. While pip is great for pure Python packages (ones written exclusively in Python code), most data science packages need to rely on C code for performance. This, unfortunately, makes installation highly system-dependent. Conda alleviates this problem by managing compiled binaries such that a user does not the need to have a full suite of local build tools (for example, building NumPy from source no longer requires a FORTRAN 77 compiler). Additionally, conda is moving to include more than Python. For example, it’s also supporting managing packages in the R language. With such an ambitious scope, it’s not surprising that package coverage is incomplete and, if you’re a power user, you’ll often see yourself wanting to contribute missing packages. Here’s a primer on how:


Installing Using Pip / PyPI in Conda as a Fallback

The first thing to notice is that you don’t necessarily need to jump to building or uploading packages. As a simple fallback, you can tell conda to build Python Packages directly from pip/PyPI. For example, take a look at this simple conda environment.yaml file:

# environment.yaml
- numpy
- scipy
- pip:
  - requests

This installs numpy and scipy from anaconda but installs requests using pip.  You can invoke it by running:

conda env update -f environment.yaml


Adding New Channels to Conda for more Packages

While the core maintainers has fairly good coverage, the coverage isn’t complete. Also, because conda packages are version and system dependent, the odds of a specific version not existing for your operating system is fairly high. For example, as of this writing, scrapy, a popular web scraping software, lives in the popular conda-forge channel (package). Similarly, many r packages are under the r channel, for example the r-dplyr package. R packages are, by convention, prefaced with a “r-” prefix to their CRAN name. You can find the channel that supports it by Googling “scrapy conda”. To install it, we’ll need to add conda-forge as a channel:

# environment.yaml
- conda-forge
- scrapy


Building PyPI Conda Packages

But sometimes, even with extra channels, the packages simply don’t exist (as evidenced by a Google query) . In this case, we just have to build our own packages. It’s fairly easy. The first step is to create a user account on The username will be your channel name.

First, you’ll want to make sure you have conda build and anaconda client installed:

conda install conda-build
conda install anaconda-client

and you’ll want to authenticate your anaconda client with your new credentials:

anaconda login

Finally, it’s easiest to configure anaconda to automatically upload all successful builds to your channel:

conda config --set anaconda_upload yes

With this setup, it’s easy! Below are the instructions for uploading the pyinstrument package:

# download pypi data to ./pyinstrument
conda skeleton pypi pyinstrument
# build the package
conda build ./pyinstrument
# you'll see that the package is automatically uploaded


Building R Conda Packages

Finally, we’ll want to do this with R packages. Fortunately, there’s conda support for building from CRAN, R’s package manager. For example, glm2 is (surprisingly enough) not on anaconda.  We’ll run the following commands to build and auto upload it:

conda skeleton cran glm2
conda build ./r-glm2

Of course, glm2 now has OSX and Linux packages on The Data Incubator’s Anaconda channel at r-glm2 so you can directly include our channel:

# environment.yaml
- thedataincubator
- r-glm2


Visit our website to learn more about our offerings:


A Study Of Reddit Politics

This article was written for The Data Incubator by Jay Kaiser, a Fellow of our 2018 Winter cohort in Washington, DC who landed a job with our hiring partner, ZeniMax Online Studios, as a Big Data Engineer.


The Question

The 2016 Presidential Election was, in a single word, weird. So much happened during the months leading up to November that it became difficult to keep track with what who said when and why. However, the finale of the election that culminated with Republican candidate Donald J. Trump winning the majority of the Electoral College and hence becoming the 45th President of the United States was an outcome which at the time I had thought impossible, if solely due to the aforementioned eccentric series of events that had circulated around Trump for a majority of his candidacy.

Following the election, the prominent question that could not leave my mind was a simple one: how? How had the American people changed so much in only a couple of years to allow an outsider hit by a number of black marks during the election to be elected to the highest position in the United States government? How did so many pollsters and political scientists fail to predict this outcome? How can we best analyze the campaigns of each candidate, now given hindsight and knowledge of the eventual outcome? In an attempt to answer each of these, I have turned to a perhaps unlikely source.

Continue reading

SQLite vs Pandas: Performance Benchmarks

This technical article was written for The Data Incubator by Paul Paczuski, a Fellow of our 2016 Spring cohort in New York City who landed a job with our hiring partner, Genentech as a Clinical Data Scientist.

As a data scientist, we all know that unglamorous data manipulation is 90% of the work. Two of the most common data manipulation tools are SQL and pandas. In this blog, we’ll compare the performance of pandas and SQLite, a simple form of SQL favored by Data Scientists.

Let’s find out the tasks at which each of these excel. Below, we compare Python’s pandas to sqlite for some common data analysis operations: sort, select, load, join, filter, and group by.

Continue reading

Python Multi-Threading vs Multi-Processing

There is a library called threading in Python and it uses threads (rather than just processes) to implement parallelism. This may be surprising news if you know about the Python’s Global Interpreter Lock, or GIL, but it actually works well for certain instances without violating the GIL. And this is all done without any overhead — simply define functions that make I/O requests and the system will handle the rest.


Global Interpreter Lock

The Global Interpreter Lock reduces the usefulness of threads in Python (more precisely CPython) by allowing only one native thread to execute at a time. This made implementing Python easier to implement in the (usually thread-unsafe) C libraries and can increase the execution speed of single-threaded programs. However, it remains controvertial because it prevents true lightweight parallelism. You can achieve parallelism, but it requires using multi-processing, which is implemented by the eponymous library multiprocessing. Instead of spinning up threads, this library uses processes, which bypasses the GIL.

It may appear that the GIL would kill Python multithreading but not quite. In general, there are two main use cases for multithreading:

  1. To take advantage of multiple cores on a single machine
  2. To take advantage of I/O latency to process other threads

In general, we cannot benefit from (1) with threading but we can benefit from (2).

Continue reading

Ranking Popular JavaScript Visualization Packages for Data Science

At The Data Incubator, we strive to provide the most up-to-date data science curriculum available. Using feedback from our corporate and government partners, we deliver training on the most sought after data science tools and techniques in industry. We wanted to include a more data-driven approach to developing the curriculum for our corporate data science training and our free Data Science Fellowship program for PhD and master’s graduates looking to get hired as professional Data Scientists. To achieve this goal, we started by looking at and ranking popular deep learning libraries for data science, then ranking popular distributed computing packages for data science. Next, we wanted to analyze the popularity of JavaScript visualization packages for data science. Here are the results:

The Rankings

Below is a ranking of the top 20 of 110 JavaScript data visualization packages that are useful for Data Science, based on Github and Stack Overflow activity, as well as npmjs (javascript package manager)downloads. The table shows standardized scores, where a value of 1 means one standard deviation above average (average = score of 0). For example, chart.js is 3.29 standard deviations above average in Github activity, while plotly.js is close to average. See below for methods.
Continue reading

Tensorflow with Keras – Empowering Neural Networks for Deep Learning

Building deep neural networks just got easier. TensorFlow has announced that they are incorporating the popular deep learning API, Keras, as part of the core code that ships with TensorFlow 1.2. In the words of Keras’ author François Chollet, “Theano and TensorFlow are closer to NumPy, while Keras is closer to scikit-learn,” which is to say that Keras is at a higher level compared to pure TensorFlow and makes building deep learning models much more manageable.

TensorFlow is one of the fastest, most flexible, and most scalable machine-learning libraries available. It was developed internally by Google Brain and released as an open-source library in November 2015. Almost immediately upon its release, TensorFlow became one of the most popular machine learning libraries. But, as is the case with many libraries that emphasize speed and flexibility, TensorFlow tends to be a bit low-level.

Continue reading

NumPy and pandas – Crucial Tools for Data Scientists

This technical article was written for The Data Incubator by Don Fox, a Fellow of our 2017 Summer cohort in New York City.

When it comes to scientific computing and data science, two key python packages are NumPy and pandas. NumPy is a powerful python library that expands Python’s functionality by allowing users to create multi-dimenional array objects (ndarray). In addition to the creation of ndarray objects, NumPy provides a large set of mathematical functions that can operate quickly on the entries of the ndarray without the need of for loops.

Continue reading

Manipulating Data with pandas and PostgreSQL: Which is better?

Working on large data science projects usually involves the user accessing, manipulating, and retrieving data on a server. Next, the work flow moves client-side where the user will apply more refined data analysis and processing, typically tasks not possible or too clumsy to be done on the server. SQL (Structured Query Language) is ubiquitous in industry and data scientists will have to use it in their work to access data on the server.

The line between what data manipulation should be done server-side using SQL or on the client-side using a language like Python is not clear. Further, people who are either uncomfortable or dislike using SQL may be tempted to keep server-side manipulation to a minimum and reserve more of those actions on the client-side. With powerful and popular Python libraries for data wrangling and manipulation, the temptation to keep server-side processing to a minimum has increased.

This article will compare the execution time for several typical data manipulation tasks such as join and group by using PostgreSQL and pandas. PostgreSQL, often shortened as Postgres, is an object-relational database management system. It is free and open-source and runs on all major operating systems. Pandas is a Python data manipulation library that offers data structures akin to Excel spreadsheets and SQL tables and functions for manipulating those data structures.
Continue reading

Beyond Excel: Popular Data Analysis Methods from Excel, using pandas


Microsoft Excel is a spreadsheet software, containing data in tabular form. Entries of the data are located in cells, with numbered rows and letter labeled columns. Excel is widespread across industries and has been around for over thirty years. It is often people’s first introduction to data analysis.

Most users feel at home using a GUI to operate Excel and no programming is necessary for the most commonly used features. The data is presented right in front of the user and it is easy to scroll around through the spreadsheet. Making plots from the data only involves highlighting cells in the spreadsheet and clicking a few buttons.

There are various shortcomings with Excel. It is closed source and not free. There are free open-source alternatives like OpenOffice and LibreOffice suites, but there might be compatibility issues between file formats, especially for complex spreadsheets. Excel becomes unstable for files reaching 500 MB, being unresponsive and crashing for large files, hindering productivity. Collaborations can become difficult because it is hard to inspect the spreadsheet and understand how certain values are calculated/populated. It is difficult to understand the user’s thought process and work flow for the analysis.

The functionality of the spreadsheet is sensitive to the layout and moving entries around can have disastrous effects. Tasks like data cleaning, munging, treating different data types, and handling missing data are often difficult and require manual manipulation. Further, the user is limited to the built-in functionality of the program. In short, Excel is great for certain tasks but becomes unwieldy and inefficient as the tasks become more complicated.
Continue reading