One of the more commonly used screening devices for data science is the portfolio project. Applicants apply with a project that they have showcasing a piece of data science that they’ve accomplished. At The Data Incubator, we run a free eight week fellowship helping train and transition people with masters and PhD degrees for careers in data science. One of the key components of the program is completing a capstone data science project to present to our (hundreds of) hiring employers. In fact, a major part of the fellowship application process is proposing that very capstone project, with many successful candidates having projects that are substantially far along if not nearly completed. Based on conversations with partners, here’s our sense of priorities for what makes a good project, ranked roughly in order of importance:
- Completion: While their potential is important, projects are assessed primarily based on the success of analysis performed rather than the promise of future work. Working in any industry is about getting things done quickly, not perfectly, and projects with many gaps, “I wish I had time for”, or “ future steps” suggests the applicant may not be able to get things done at work.
- Practicality: High-impact problems of general interest are more interesting than theoretical discussions on academic research problems. If you solve the problem, will anyone care? Identifying interesting problems is half the challenge, especially for candidates leaving academia who must disprove an inherent “academic” bias.
- Creativity: Employers are looking for creative, original thinkers who can identify either (1) new datasets or (2) find novel questions to ask about a dataset. Employers do not want to see the tenth generic presentation on Citibike (or Chicago Crime, Yelp Restaurant Ratings data, NYC Restaurant Inspection Data, NYC Taxi, BTS Flight Delay, Amazon Review, Zillow home price, or beating the stock market) data. Similarly, projects that explain a non-obvious thesis supported by concise plots are more compelling than ones that present obvious conclusions (e.g. “more riders use Citibike during the day than at night”). Employers are looking for data scientists who can find trends in the data that they don’t already know.
There are a number of easy ways to tap into your creativity when scoping a project. Remember that even a well-trodden dataset like Citibike can have novel conclusions and untapped data can have obvious ones. While your project does not have to be completely original, you should Google around to see if your analysis has been done to death. You can use design thinking techniques to consider how different kinds of end users will interact with your project and how it will add value they couldn’t get somewhere else. Divergent thinking is an important part of this, and doesn’t always come naturally to traditional scientists who are trained to converge to answers quickly. Allow yourself time to brainstorm a wide variety of possibilities, and you may find an unexplored angle that goes beyond people’s expectations.
- Challenge data: Real world data science is not about running a few machine learning algorithms on pre-cleaned, structured CSV files. It’s often about munging, joining, and processing dirty, unstructured data. Projects that use pre-cleaned datasets intended for machine learning (e.g. UCI or Kaggle data sets) are less impressive than projects that require pulling data an API or scraping a webpage.
- Size: All things being equal, analysis of larger datasets is more impressive than analysis of smaller ones. Real world problems often involve working on large, multi-gigabyte (or terabyte) datasets, which pose significantly more of an engineering challenge than working with small data. Employers value people who have demonstrated experience working with large data.
- Engineering: All things being equal, candidates who can demonstrate the ability to use professional engineering tools like git and Heroku will be viewed more favorably. So much of data science is software engineering and savvy employers are looking for people who have the basic ability to do this. To get started, try following this git tutorial or these Heroku tutorials in your favorite language. Put up your results on GitHub or turn your presentation into a small Heroku app!
Obviously, no project will be perfect. It’s hard to fulfill all of these criteria, and individual employers undoubtedly have other criteria that we have not mentioned or have a different prioritization. But more often than not, the fellows who are hired first have projects that satisfy more of these criteria than not. And lastly, if you’re looking for a data science job or to kick start your career as a data scientist, consider applying to our free eight-week fellowship.