Data are becoming the new raw material of business
The Economist

Automating Excel with Python

We know there’s a lot of pain points in Excel that make it a tool that’s cumbersome and repetitive for data manipulation. We’ve distilled those pain points into three major themes.

  • The first is that it’s awkward to deal with higher dimensional data in a two-dimensional spreadsheet. Spreadsheets are great for representing two-dimensional data but they’re awkward for representing anything at three or higher dimensions. And while there’s many workarounds like pivot tables, this will only gets you so far.
  • The second pain point revolves around doing the same calculation over multiple sheets or multiple workbooks. While it’s easy to iterate over rows or columns in Excel, it’s cumbersome and time consuming to iterate over hundreds of sheets or notebooks.
  • Finally, data manipulation in Excel is actually very manual and hence very error prone. So in Excel, the convention is to copy data or formulas from cell to cell, but this makes it hard to keep our data up to date as new data arrives or as we update our computations as they become more complex. Errors aren’t always easy to catch before important business decisions are made.

In this video, we look at some data that we can get from the Bureau of Labor Statistics. While it comes in Excel, it comes in a very particular format. The rows iterate through years, the columns iterate through months, the sheets iterate through industries, and the workbooks iterate through wages, hours worked and overtime. So how would you use this to calculate salary, which we’re gonna define as wage times the quantity hours worked plus 1.5 times the overtime.
Continue reading

A Study Of Reddit Politics

This article was written for The Data Incubator by Jay Kaiser, a Fellow of our 2018 Winter cohort in Washington, DC who landed a job with our hiring partner, ZeniMax Online Studios, as a Big Data Engineer.


The Question

The 2016 Presidential Election was, in a single word, weird. So much happened during the months leading up to November that it became difficult to keep track with what who said when and why. However, the finale of the election that culminated with Republican candidate Donald J. Trump winning the majority of the Electoral College and hence becoming the 45th President of the United States was an outcome which at the time I had thought impossible, if solely due to the aforementioned eccentric series of events that had circulated around Trump for a majority of his candidacy.

Following the election, the prominent question that could not leave my mind was a simple one: how? How had the American people changed so much in only a couple of years to allow an outsider hit by a number of black marks during the election to be elected to the highest position in the United States government? How did so many pollsters and political scientists fail to predict this outcome? How can we best analyze the campaigns of each candidate, now given hindsight and knowledge of the eventual outcome? In an attempt to answer each of these, I have turned to a perhaps unlikely source.

Continue reading

MIT’s $75,000 Big Data finishing school (and its many rivals)

New courses target the need for managers and techies to talk to each other as data proliferate

For most students, a top degree in a field such as computer science or maths ought to be a passport to a career perfectly in tune with the relentless digitisation of work.

For the 30 graduates taking up a new one-year course at MIT’s Sloan School of Management in September, it will be only the prelude to a spell in a Big Data finishing school.

This first cohort of students will pay $75,000 in tuition fees for their Master of Business Analytics degree, with classes ranging from “Data mining: Finding the Data and Models that Create Value” to “Applied Probability”.

They will be calculating that the qualification will sprinkle their CVs with extra stardust, attracting elite employers that are trying to find meaning in the increasing volumes of data that businesses are generating. Continue reading

From Eco-Friendly Batteries to Random Forests: Alumni Spotlight on Matt Lawder

At The Data Incubator we run a free eight-week Data Science Fellowship Program to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring Data Scientists. Matt was a Fellow in our Winter 2016 cohort who landed a job with one of our hiring partners, 1010data.


Tell us about your background. How did it set you up to be a great Data Scientist?

I defended my PhD dissertation at Washington University in St. Louis, a few weeks before coming to The Data Incubator. I was part of the MAPLE lab in Energy, Environmental, and Chemical Engineering (I know, it’s a mouthful). Our lab focused on physics-based electrochemical modeling, mostly geared toward Li-ion batteries.

For my main dissertation project, I studied how batteries age under different real-world cycling patterns. Most cycle life estimates for a battery are based on simple constant charge and constant discharge patterns, but lots of applications (such those experienced by batteries in electric vehicles or coupled to the electric grid) do not have simple cycling patterns. This variation effects the life of the battery.

Both through model simulation and long-term experiments, I had to analyze battery characteristics over thousands of cycles and pick out important features. This type of analysis along with programming computational models that were used to create these data sets helped give me a background to tackle data science problems.

Additionally, I think that working on my PhD projects allowed me to gain experience in solving unstructured problems, where the solution (and sometime even the problem/need) are not well defined. these type of problems are very common, especially once you get outside of academia.


What do you think you got out of The Data Incubator?

More than I can fit in a couple of paragraphs! The most important thing for me was learning all the functionality of different programming languages and packages. Coming from a background where I had programmed in Maple, VB, and a little bit of Matlab and SQL, learning Python (and all of its different packages), Spark, etc. opened up so many possibilities for doing new types of analysis. Knowing these tools, greatly sped up my ability to conquer new problems.

Completing miniprojects on each subject was instrumental in feeling confident about applying the techniques we learned in real-world situations. It was definitely a pressure packed environment trying to complete everything on time, but it forced you to know each subject inside and out. Looking back on the program it’s amazing to look at the amount of code you have produced.

Beyond the subject matter, working together with so many other driven people was a great experience. And the network of employers that were brought through the program for happy hours and panel discussions always helped showcase all the different ways data science is being used in industry. I made my first connection with 1010data (where I will be starting a job at the end of the month) at one of the happy hours. So I think those were pretty valuable!

Continue reading