At The Data Incubator we run a free eight week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. William was a Fellow in our winter cohort who landed a job with one of our hiring partners, Dataminr. Here’s his story:
Data science is a natural fit given my experiences in graduate school. I have a PhD in Computer Science from the University of Minnesota with a research focus on applied machine learning. Specifically, I worked on methods for leveraging domain knowledge to improve feature selection in machine learning domains such as airline ticket price prediction and river/stream flow prediction. Also, I supplemented this experience by working on several occasions as a summer intern or visiting scholar during graduate school. These opportunities provided me with experience in collaborating with experts in a variety of different problem solving cultures.
What do you think you got out of The Data Incubator?
The greatest benefit of The Data Incubator is in the ability to ramp up the job search very quickly. The partner companies come to the program ready to interact with Fellows and to find good hiring matches that satisfy their needs. The Fellows come ready to demonstrate their abilities, to show how they can add value, and to jump into new opportunities. This convergence of interests is very special and benefits everyone involved. I know that I benefited from the opportunity to interact so intensively with the wide variety of partner companies in my own search process.
Also, I especially enjoyed participating as part of the cohort in the NYC Data Incubator. Being located in NYC and interacting with the other Fellows was very valuable. It helped me to understand my own strengths relative to others looking for similar opportunities. Coming together as a group all at once with similar goals to a new city helped to turn what is usually an individual effort into a more collaborative process. The connections formed within the group will help us far into our future careers.
Could you tell us about your Data Incubator taxi project? It was really cool; we’d love to feature it.
I love working with real-world data to solve problems. When looking for a project for the Incubator, I wanted to use a large dataset that was relevant to my audience (partner companies) and would provide a useful output that solves a real-world problem. This is indeed a tall order.
My Data Incubator project considers the following problem: From my current location on the street in NYC, what is the best nearby location I should walk toward to maximize my chance of finding a taxi quickly? Should I walk west or east? Should I walk two blocks more to maximize my chances? My project, nyctaxi.me, is a smartphone-friendly web-app that answers these questions for you. Using your current GPS location, the current time of day, and a large corpus of historical data, the site determines the best 7 nearby intersections that you should walk toward to minimize your taxi hail wait. Also, we estimate the total walking and waiting time for each of the potential locations, and these locations are ranked based on the total walking plus waiting time. I encourage everyone to try the site. I found this was a great ice-breaker in my interactions with companies because it solves a problem that many long-time New Yorkers as well as newcomers face. Additionally, I benefitted by using this taxi hot-spot finder a couple of times as a Data Incubator Fellow on my way to events!
How does it work? At the risk of giving too much detail, I will describe some specifics. The dataset I used was a collection of all year 2013 yellow taxi fares (starting GPS location, ending GPS location, total distance, fare, tolls, passenger count, etc.) which was publicly released by the NYC Taxi and Livery Commission. I map each of the fare end points to the nearest road intersection to facilitate generalization and reduce noise in the location data. Using GIS data on the road network, I find there are over 100,000 road intersections within the city. Then I characterize the taxi drop-off and pickup frequency at each road intersection by estimating these event frequencies conditioned on the time of the day and the day of the week. These pickup and drop-off frequencies show in the nyctaxi.me map visualization as a colored dot centered over each intersection. Using these frequencies and the user’s current GPS location, we compute the walking distance using the “Manhattan” distance to all nearby intersections. We also compute the average expected waiting time (1/2 of the mean inter-arrival time) at each nearby intersection based on the current time. Finally we rank the nearest intersections from best to worst based on total walking plus expected waiting time.
This dataset (about 50GB in size consisting of over 170 million taxi fares) has been used in previous visualization/analysis projects (and even for academic research). But I feel my project is unique because it can help to solve a real problem faced by individuals in the real world every day.
What advice would you give to someone who is applying for The Data Incubator, particularly those with computer science backgrounds?
Computer scientists are a natural fit for data science opportunities principally because computer science is, in many respects, a degree in applied problem solving. In a university setting, we are trained to leverage the benefits of clever algorithms that perform specific tasks very well. These methods can be much faster than the most obvious brute-force solutions to a problem. Using the most appropriate algorithm can make problems that are totally infeasible via brute-force actually feasible and even fast to solve. I think the problem solving skills of computer scientists are particularly strong, but there are always skills that we can improve.
First, practicing fundamental computer science problems beyond one’s research area is critically important. In interviews, these kinds of questions are ubiquitous, so whiteboard coding of basic algorithms (sorting, search), data structures, and statistics is always going to be helpful for future job searching. There are many books and websites about this topic, but there is no replacement for the practice time!
Second, having a portfolio of one or two items from recent past efforts is valuable when trying to demonstrate your value to a potential future employer. It is really much more impactful to show a future boss what you have done than it is to tell them.
Third, being able to work quickly on a new problem or dataset to develop actionable insights is a very valuable skill for data scientists. It’s important to become comfortable with a specific tool-set and to rapidly produce statistics, models, and visualizations that can be used to explain the underlying phenomena to others. I think it is important to practice digesting new datasets from many domains from scratch without any specific goal. Being inquisitive is an extremely valuable habit in this business. [Editor’s Note: for more information on preparing to be a Data Scientist, check out this previous blog post.]
Learn more about The Data Incubator here.