At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Chris was a Fellow in our Fall 2015 cohort who landed a job with one of our hiring partners, Sotera.
I’ve always been interested in science and examining how things work. Before I could talk, I disassembled part of a washing machine. In that particular instance my victory was short lived; I was told to return the appliance to its original condition, which I promptly did.
My curiosity formed the basis for a successful research career. My technical background includes analysis in several different areas of physics, but data analysis remains a common thread. As one example, my dissertation involved analyzing structural and aerodynamic data from a transonic aerodynamic system. I excelled at the data analysis aspects of the work in a way that a traditional aerodynamicist probably wouldn’t have. I was able to identify correlations between key mechanisms in different systems and develop visualizations to easily express the conclusions.
Two aspects of the work I enjoyed the most are: the determination of root causes and the satisfaction of presenting these in simple terms. These traits align well with data science.
What do you think you got out of The Data Incubator?
The most important take away was getting to know the other Fellows… you know, birds of a feather. The discussions with industry partners along with the projects really clarified the scope and practical impact a data scientist can have.
On the technical side, the (highly focused) time spent solving real world data science problems was invigorating. The team work, planning, and coding sprints were invaluable. It would be difficult to get the same impact when working solo.
Could you tell us about your Data Incubator project?
I developed a website which provides an overview of the used car market. Guided by the user, my project aggregates used car pricing data from several different websites. Trends in price, year, and vehicle mileage are presented to provide an overview of the models and years selected.
The overview is more informative than traditional value estimators and clearly reveals similarities or major differences between model years. In addition, cars with similar option packages are identified and grouped to provide a further level of categorization for the user-selected car market. Cars with missing data can be matched to their most likely option packages through machine learning. Similar trend analysis and classification ideas can be quite powerful in an industrial setting.
I choose this project because I thought it was something most people can relate to, especially data-oriented people. Many of the trends are really clear. Prices are often quantized at round numbers ($15,500, $16,000, $16,500) and the different types of value trends for luxury vehicles versus more economical vehicles are immediately evident. Access to an overview like this can help someone develop an intuition about a particular domain. The overview and the clarity which it provides are two of my favorite aspects of data science.
What advice would you give to someone who is applying for The Data Incubator, particularly someone with your background?
Go for it!! On the practical side, exercising some of your favorite development tools will be hugely beneficial. If you don’t have a favorite you should check out some text editors and/or IDEs and get comfortable with a work flow. Be sure to spend time on your project. The Incubator covers a lot of ground and the more project planning and debugging you can put in now the better.
What’s your favorite thing you learned while at The Data Incubator?
It’s difficult to pick just one. I really like pandas for data exploration. It’s accessible for a beginner who just requires basic summaries and plots through an expert who wants to dive deep into trends. Spark is the yang to pandas’ yin. Large-scale data analysis almost certainly involves parallel processing. I really enjoyed working with Spark as way to utilize a cluster.
An honorable mention should go to Slack. I try to avoid a lot of distraction when I’m at my computer so I’ve stayed away from having a chat client easily accessible. Slack was my lifeline to the other members of the cohort, whether they were sitting right next to me or in another time zone working on a project at 3:00am. I knew the messages had a high signal to noise ratio and were relevant to the material we were working on. The Slack channel was a great way to distribute the debugging load for tough problems and keep in touch.
Learn more about The Data Incubator here.