At The Data Incubator we run a free eight week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Brian was a Fellow in our spring cohort who landed a job with one of our hiring partners, Capital One, after completing his postdoc at Columbia and NYU. Here’s his story:
My background is in computational astrophysics and numerical relativity. I did my Ph.D. work at the University of Illinois at Urbana Champaign, then went on to a postdoc at Columbia and NYU. For my research I was performing large scale simulations of accretion disks around black holes. These simulations are performed on supercomputers and generate very large data sets. In working with this data, I developed specific skills which carry over directly to data science. However, I think the experience I gained in approaching problems scientifically, thinking critically, and using data to communicate a story clearly was the most valuable.
What do you think you got out of The Data Incubator?
The most valuable thing I got out of The Data Incubator was the ability to meet people from industry through the partner panels and happy hours. For me this served two purposes. First, it was an extremely efficient way to do a lot of networking in a short amount of time, which greatly increases the chance of finding a job. It is much easier to initiate a dialogue with a hiring partner if you have already met someone from the company in person.
Second, I learned a lot about the landscape of the industry, which helped me figure out what I was looking for in a job. Over the course of the program, I was introduced to companies that weren’t on my radar and found myself very interested in them. Once you are in The Data Incubator program, there are too many partner companies on the list to be able to seriously pursue them all. You have to know what you are looking for in order to focus your efforts, and I definitely gained that focus over the course of the program.
Could you tell us about your Data Incubator project?
For my project, I analyzed Twitter data collected during the Super Bowl. For anyone looking to complete a fun data project, I highly recommend Twitter. It is a relatively easy way to get a large, rich dataset in a short period of time. Once I had obtained the data through the streaming API, I did a few different things.
I was able to identify clusters of tweets that occur near the same time, contain keywords related to the same team, and contain either the word “touchdown” or “field goal.” This allows one to accurately find the final score of the game using only the contents of the Twitter stream. These clusters can also be used to estimate the time when touchdowns or field goals occur.
I also used the K-means clustering algorithm along with Natural Language Processing techniques to divide the tweets into three clusters based on the frequency of popular words in each tweet. Each of the clusters are well delineated, and we can use the dominant words in each cluster to see that they represent tweets about the Seahawks or Patriots, tweets about the commercials, and tweets about Katy Perry and the halftime show. Moreover, histograms in time for each cluster demonstrate that in the first quarter, viewer attention is roughly split between the game itself and the commercials, but attention shifts to the commercials in the second quarter. When the halftime show starts, attention shifts dramatically to Katy Perry and the halftime show, and continues to dominate the rest of the game, with the exception of a spike at 145 minutes when the Seahawks score, and at the very end of the game. From this, we can conclude that commercials airing in the second quarter are most likely to hold viewers’ attention and should be worth more.
I also used a “bag of words” model to estimate how positive or negative tweets about a given brand were. I do this to compute a measure of the Twitter audience’s response to fifteen of the most-tweeted commercials. I then compare this with ratings computed by survey and published by USA Today, and find a significant correlation. Of course, the Twitter sentiment score has the advantage of being available in real time.
For a slightly more detailed writeup, please visit superbowl.brianfarris.me.
What advice would you give to someone who is applying for The Data Incubator, particularly someone with your background?
If you have my specific background, you may be experienced with data analysis and programming in general, but lack expertise in machine learning. I found the Data Science Specialization and Andrew Ng’s machine learning course, both on Coursera, to be quite useful for filling in gaps. It also doesn’t hurt to make a nice personal website where you can show off some work you have done related to data science. When filling out the actual application, I also tried to make it clear that I had a genuine enthusiasm for leaving academia and pursuing a new career. Also, try to use Python as much as possible.
What’s your favorite thing you learned while at The Data Incubator?
I think my favorite thing I learned was natural language processing techniques. As an astrophysicist, I had never really worked with text before at all, so this was very new to me. Between Michael’s lectures and the mini projects, I learned a lot, and it really helped me improve my Super Bowl project. It also came in handy when I was working on challenge projects in the interview process.
Learn more about The Data Incubator here.