At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Paul was a Fellow in our Fall 2016 cohort who landed a job with Cloudera.
Following the completion of my PhD in Electrical and Computer Engineering in 2009, I joined Palantir Technologies as a Forward Deployed Engineer (client-facing software engineer). There, I helped Palantir enter a new vertical, that of Fortune 500 companies, where I built data integration and analysis software for novel commercial workflows. I left Palantir in 2012 and in 2013 I co-founded SolveBio, a genomics company whose mission is to help improve the variant-curation process; the process by which clinicians and genetic counselors research genetic mutations and label them as pathogenic, benign, or unknown. At SolveBio, my work was primarily focused on building scalable data cleansing, transformation and ingestion infrastructure that could be used to power the SolveBio genomics API. I also worked closely with geneticists and other domain experts in a semi-client-facing role.
The theme of my six years as a software engineer has been to help domain experts, whether they be fraud investigators at a bank or clinicians at a hospital, analyze disparate data to make better decisions. I have built infrastructure in both Java and Python, have used large SQL and NoSQL databases, and have spent countless hours perfecting Bash hackery (or wizardry, depending on your perspective).
My experiences as a software engineer were very relevant to data science in that I learned many ways to access, manipulate, and understand a variety of datasets from a variety of sources in a variety of formats. As the adage goes, “Garbage in. Garbage out.” No more is this true than in data science. Performing good data science requires cleaning and organizing data, and I feel very comfortable with this process.
What do you think you got out of The Data Incubator?
Several things! First, I got much needed exposure to the “science” portion of data science. I learned the techniques, terminology, and the mathematics that every data scientist should know. Moreover, The Data Incubator is organized so as to promote collaboration and (lots of) discussion amongst fellows. Not only did this help with my understanding, but I definitely built several lasting personal relationships with other fellows.
Could you tell us about your Data Incubator project?
My Capstone Project involves using machine learning models to build market-neutral long/short trading strategies from quantitative financial data. The data I used was primarily sourced from CapIQ and was collected via a combination of web scraping and bulk downloading. I calculated year-over-year rank-ordering of over 20+ financial indicators and used them to train classifiers whose output was ‘long’ or ‘short’ (i.e. positive growth vs. negative growth). I observed good portfolio performance which, expectedly, improved with the complexity of my models. My eventual goal is to integrate categorical datasets, such as sentiment extracted from filings and fraud indicators extracted from SEC documents, into my analysis and to backtest strategies more comprehensively.
What advice would you give to someone who is applying for The Data Incubator, particularly someone with your background?
I would advise that incoming candidates have at least an elementary-level understand of basic statistics (p-values, linear regression, etc.) and Python.
What’s your favorite thing you learned while at The Data Incubator? This can be a technology, concept, or whatever you want!
I loved learning Pandas (seriously) and Scikit Learn. Both of these frameworks are very well architected and implemented, and can make your life as a data scientist substantially more productive.
Where are you working now and tell us a little about your new job!
I’ll be working for Cloudera as a Software Engineer. In particular, I’ll be joining a very small team that is focused on further-reducing the friction of doing data science at scale and in the “cloud.” We will be providing the abstraction required for data scientists and engineers to quickly and easily deploy Hadoop clusters so they can focus on analytics instead of infrastructure.