Data are becoming the new raw material of business
The Economist


Predicting Flight Delays with Random Forests: Alumni Spotlight on Stacy Karthas

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists.  Stacy was a Fellow in our Winter 2017 cohort who landed a job with one of our hiring partners, AdTheorent

Tell us about your background. How did it set you up to be a great data scientist 

I received my Bachelor of Science degrees in mathematics and physics from the University of New Hampshire. I then went on to graduate school at Stony Brook University. I graduated with my master’s degree in Physics in December 2016. During my master’s degree, I did research in Nuclear Heavy Ion Physics with a focus on the analysis of gluons and their products as they traversed our detector. The data analysis, simulation, and clustering algorithms I worked on prepared me to become a data scientist because it was a physical application of many of the tools used by data scientists.

What do you think you got out of The Data Incubator?

The Data Incubator gave me the chance to solidify my data science knowledge. It helped me pull together tools and concepts I had been using during all of my previous research experiences. I learned a lot of new machine learning concepts and how they could be applied to real world data.

What advice would you give to someone who is applying for The Data Incubator, particularly someone with your background?

Python is key. Learning as much as you can before the program is very important. I would also suggest taking an online course or reading a bit about machine learning before the program starts. Also, it is easier if you try to relate the concepts back to something you’ve already done. It was easier for me to visualize how clustering algorithms worked because I had been working on my own for a few months.

What’s your favorite thing you learned while at The Data Incubator? This can be a technology, concept, or whatever you want!

My favorite thing I learned at The Data Incubator was how to create models with scikit-learn. Because of my limited background in Python, the fact that you can use such a convenient package to do some very solid machine learning was very neat!

Describe your Data Incubator Capstone Project

My capstone project was an app that predicted whether or not a domestic flight in the US would be delayed. This was based on date, time of day, airline, airport, etc.

How did you come up with the idea for the project?

Millions of passengers take domestic flights every day, whether for business or for pleasure. The worst thing about flying is that you have to build in time in case you have a delay, and at least 15 % of flights are delayed by more than 10 minutes and many flights are delayed hours. I thought that I could create an app that would allow people(and myself) to find an airline or flight that is not likely to be delayed so as to minimize the chance of this hassle.

What technologies did you use and what skills did you learn at TDI that you applied to the project?

I used scikit learn’s random forest classifier to build my prediction model along with other packages to assist in evaluating and cross-validating my results. I also used flask and heroku to deploy my app. Some of my visualizations used matplotlib, seaborn, plotly and d3.

What was your most surprising or interesting finding?

I thought it was interesting just how poorly some of the airlines performed. Generally, the larger airlines tended to have worse on-time statistics and small airlines like Alaskan and Hawaiian had short delays in general.

Describe the business application for this project (how could a company use your work or your data)

Time is money. I can think of two ways a business would want to use this. The first is that they don’t want to send their employees on business trips to have them waiting around in the airport so it would be best to book with airlines that have fewer delays. Additionally, this app would promote competition and accountability among airlines. They would be able to promote themselves with their on-time statistics in addition to customers holding airlines to higher standards.

Do you have an interesting visualization to share?
Cause_of_Delay
 

 

 

 

 

The cause of delays by time of day indicate that delays tend to stack, meaning that delays earlier in the day tend to cause delays later in the day.

Where are you going to be working, and tell us a little about your new job!

I am currently working as an associate data scientist at Adtheorent. Adtheorent is a digital media company that bids on mobile and web ad space for their clients. My job is to build models that help increase the likelihood that an advertisement will perform well (be clicked, be seen, or someone will buy the product).

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Back to index