Data are becoming the new raw material of business
The Economist

The Science of Data Science: Alumni Spotlight on Paul George

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Paul was a Fellow in our Fall 2016 cohort who landed a job with Cloudera.

Tell us about your background. How did it set you up to be a great data scientist?

Following the completion of my PhD in Electrical and Computer Engineering in 2009, I joined Palantir Technologies as a Forward Deployed Engineer (client-facing software engineer). There, I helped Palantir enter a new vertical, that of Fortune 500 companies, where I built data integration and analysis software for novel commercial workflows. I left Palantir in 2012 and in 2013 I co-founded SolveBio, a genomics company whose mission is to help improve the variant-curation process; the process by which clinicians and genetic counselors research genetic mutations and label them as pathogenic, benign, or unknown. At SolveBio, my work was primarily focused on building scalable data cleansing, transformation and ingestion infrastructure that could be used to power the SolveBio genomics API. I also worked closely with geneticists and other domain experts in a semi-client-facing role.

The theme of my six years as a software engineer has been to help domain experts, whether they be fraud investigators at a bank or clinicians at a hospital, analyze disparate data to make better decisions. I have built infrastructure in both Java and Python, have used large SQL and NoSQL databases, and have spent countless hours perfecting Bash hackery (or wizardry, depending on your perspective).

My experiences as a software engineer were very relevant to data science in that I learned many ways to access, manipulate, and understand a variety of datasets from a variety of sources in a variety of formats. As the adage goes, “Garbage in. Garbage out.” No more is this true than in data science. Performing good data science requires cleaning and organizing data, and I feel very comfortable with this process.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

The Iraq War by the Numbers: Extracting the Conflicts’ Staggering Costs

One of our fellows recently had a piece published about her very unique and timely capstone project. The original piece is posted on Data Driven Journalism

In her own words:

This war is not only important due to its staggering costs (both human and financial) but also on account of its publicly available and well-documented daily records from 2004 to 2010.

These documents provide a very high spatial and temporal resolution view of the conflict. For example, I extracted from these government memos the number of violent events per day in each county. Then, using latent factor analysis techniques, e.g. non-negative matrix factorization, I was able to cluster the top three principal war zones. Interestingly these principal conflict zones were areas populated by the three main ethno-religious groups in Iraq.

You can watch her explain it herself:

 

Editor’s Note: The Data Incubator is a data science education company.  We offer a free eight-week fellowship helping candidates with PhDs and masters degrees enter data science careers.  Companies can hire talented data scientists or enroll employees in our data science corporate training.

Tweet about this on TwitterShare on FacebookShare on LinkedIn

From Astronomy to AI: Alumni Spotlight on Athena Stacy

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Athena was a Fellow in our Fall 2016 cohort who landed a job with Brighterion.

Tell us about your background. How did it set you up to be a great data scientist?

My background is in astronomy.  My research consisted of developing and performing computer simulations of star formation in the early universe.  The goal of these simulations was to better understand what stellar clusters looked like in regions of the universe that telescopes cannot observe.  Thus I was already familiar with computer programming and visualizing data.  This was very helpful in the transition to data science.  Knowing how to present my research clearly to a range of audiences — both beginning students and other experts in the field — has helped as well!

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Parallelizing Jupyter Notebook Tests

How we cut our end-to-end test suite runtime by 66% using parallelismjupyter-logo

While there’s a common stereotype that data scientists are poor software engineers, at The Data Incubator, we believe that mastering the fundamentals of software engineering is important for data science and we strive to implement rigorous engineering standards for our data science company.  We have an extensive curriculum for data science corporate training, data science fellowship, and online data science course leveraging the jupyter (née ipython) notebook format.  Last year, we published a post about testing Jupyter notebooks — applying rigorous software engineering testing standards to new technologies popular in data science.

However, over time, as our codebase as grown, we’ve added in more and more notebooks to our curriculum material. This led to tests on our curriculum taking ~30 minutes to run! We quickly identified parallelism as a low-hanging fruit that would make sense for a first approach, with a couple of points:

  1. We have curriculum materials that run code in Spark 2.0 parallelizing runs in that kernel is hard because of how the Spark execution environment spins up.  We also have curriculum materials in the jupyter R Kernel.
  2. Subprocess communication in Python (what our testing code is written in) is a pain, so maybe there’s a way to use some other parallelization library to avoid having to reinvent that wheel.
  3. Most of our notebooks are in Python, so those shouldn’t have any issues.

These issues aside, this seemed like a reasonable approach because each Jupyter notebook executes as its own subprocess in our current setup – we just had to take each of those processes and run them at the same time. Taking a stab at 3., parallelizing python tests, while finding a way around 2. – annoying multiprocess communication issues – yielded great results!

The library: nose

Anyone who’s written production-grade python is probably familiar with nosetests, the ubiquitous test suite runner. In another codebase of ours, we use nose in conjunction with the flaky plugin to rerun some tests whose output can be… less deterministic than would be ideal.

Nose has good support for parallel testing out-of-the-box (incidentally, the flaky plugin and parallel testing don’t play nice together), so it seemed like a clear candidate for handling test-related subprocess communication for us.

The hurdle: dynamically adding tests

We run our test suite in many different configurations: on pull requests, for example, we’ll only run modified notebooks to facilitate speedier development – and we save the full build for when we merge into master. Given this, we need to dynamically add tests – 1 per notebook we want to test in a given run. A popular Python mechanism for this, that we’ve historically employed, is using something like this:

 

suite = unittest.TestSuit()
for filename in notebook_filenames:
     suite.addTest(NotebookTestCase(filename))

 

Nose, unfortunately, does not like this and doesn’t play nice with unittest. It insists, instead, on test discovery. So we had to get creative. What did “creative” mean, exactly? Unfortunately for the pythonistas among us, it meant we had to use some of Python’s introspection functionality.

The solution: dynamically adding functions to Python classes

The hack we came up with was the following:

  1. Dynamically search out notebooks and add a test function for each to a class. In python, this involves defining a function, setting its __name__ attribute, and then using setattr on the parent class to add that function with the appropriate name. This took care of adding parallel tests in.
  2. Use the nose attr plugin to specify attributes on the tests, so we can maintain speedy single-notebook PR testing as described above. We have code that keeps track of the current diffed filenames (from master), and adds two sets of tests: one under the all attribute, and another under the change attribute. You can see the @attr decorator being used below.

You can see the class below. In a wrapper file, we call the add_tests() function as soon as that file is imported (i.e. before nose attempts any “discovery”) – the ipynb_all and ipynb_change_nbs functions live outside of the class but simply search out appropriate filenames.

class IpynbSelectorTestCase(object):
    """
     Parallelizable TestCase to be used for nose tests.
     To use, inherit and override`check_ipynb` to define how to check each notebook.
     Call `add_tests` in a global call immediately after the class declaration.
     See http://nose.readthedocs.io/en/latest/writing_tests.html#test-generators
     Tests can be invoked via (e.g.):
 
     nosetests -a 'all'
 
     Do not inherit `unittest.TestCase`: this will break parallelization
     """
     def check_ipynb(self, ipynb):
         raise NotImplemented
 
     @classmethod
     def add_func(cls, ipynb, prefix):
         @attr(prefix)
         def func(self):
             self.check_ipynb(ipynb)
 
         _, nbname = os.path.split(ipynb)
         func.__name__ = 'test_{}_{}'.format(prefix, nbname.split('.')[0])
         func.__doc__ = 'Test {}'.format(nbname)
         setattr(cls, func.__name__, func)
 
     @classmethod
     def add_tests(cls):
         for ipynb in ipynb_all():
             cls.add_func(ipynb, 'all')
 
         for ipynb in ipynb_change_nbs():
             cls.add_func(ipynb, 'change')

The results

So, our full build used to take 30 minutes to run, typically. With added parallelism, that time has dropped to 11 minutes! We tested with a few different process counts, and continued seeing marginal improvement up to 6 processes. We made some plots! (Made with seaborn).

 

runtime_vs_processesjupyter_runtime_comparison

 

Not only is the reduction numerically dramatic, but gains like these add up in terms of curriculum developer productivity and allows us to rapidly iterate on our curriculum.

 

By Christian Moscardi and Michael Li

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Polling and big data in the age of Trump, Brexit, and the Colombian Referendum

Our founder, Michael Li recently collaborated with his colleague Raymond Perkins, a researcher and PhD candidate at Princeton University, on this piece about big data and polling. You can find the original article at Data Driven Journalism. 

The recent presidential inauguration and the notably momentous election that preceded it has brought about numerous discussions surrounding the accuracy of polling and big data. The US election results paired with those of Brexit, and the Colombian Referendum have left a number of people scratching their heads in confusion. Statisticians, however understand the multitude of sampling biases and statistical errors than can ensue when your data is involving human beings.

“Though big data has the potential to virtually eliminate statistical error, it unfortunately provides no protection against sampling bias and, as we’ve seen, may even compound the problem. This is not to say big data has no place in modern polling, in fact it may provide alternative means to predict election results. However, as we move forward we must consider the limitations of big data and our overconfidence in it as a polling panacea.”

At The Data Incubator, this central misconception about big data is one of the core lessons we try to impart on our students. Apply to be a Fellow today!

 

Editor’s Note: The Data Incubator is a data science education company.  We offer a free eight-week fellowship helping candidates with PhDs and masters degrees enter data science careers.  Companies can hire talented data scientists or enroll employees in our data science corporate training.

Tweet about this on TwitterShare on FacebookShare on LinkedIn

How Employers Judge Data Science Projects

How Employers Judge Data Science Projects

One of the more commonly used screening devices for data science is the portfolio project.  Applicants apply with a project that they have showcasing a piece of data science that they’ve accomplished.  At The Data Incubator, we run a free eight week fellowship helping train and transition people with masters and PhD degrees for careers in data science.  One of the key components of the program is completing a capstone data science project to present to our (hundreds of) hiring employers.  In fact, a major part of the fellowship application process is proposing that very capstone project, with many successful candidates having projects that are substantially far along if not nearly completed.  Based on conversations with partners, here’s our sense of priorities for what makes a good project, ranked roughly in order of importance:

  1. Completion: While their potential is important, projects are assessed primarily based on the success of analysis performed rather than the promise of future work.  Working in any industry is about getting things done quickly, not perfectly, and projects with many gaps, “I wish I had time for”, or “ future steps” suggests the applicant may not be able to get things done at work.
  2. Practicality: High-impact problems of general interest are more interesting than theoretical discussions on academic research problems. If you solve the problem, will anyone care? Identifying interesting problems is half the challenge, especially for candidates leaving academia who must disprove an inherent “academic” bias.
  3. Creativity: Employers are looking for creative, original thinkers who can identify either (1) new datasets or (2) find novel questions to ask about a dataset. Employers do not want to see the tenth generic presentation on Citibike (or Chicago Crime, Yelp Restaurant Ratings data, NYC Restaurant Inspection DataNYC Taxi, BTS Flight Delay, Amazon Review, Zillow home price, or beating the stock market) data. Similarly, projects that explain a non-obvious thesis supported by concise plots are more compelling than ones that present obvious conclusions (e.g. “more riders use Citibike during the day than at night”).  Remember — even a well-trodden dataset like Citibike can have novel conclusions and even untapped data can have obvious ones.  While your project does not have to be completely original, you should Google around to see if your analysis has been done to death.  Employers are looking for data scientists who can find trends in the data that they do not know.
  4. Challenge data: Real world data science is not about running a few machine learning algorithms on pre-cleaned, structured CSV files.  It’s often about munging, joining, and processing dirty, unstructured data.  Projects that use pre-cleaned datasets intended for machine learning (e.g. UCI or Kaggle data sets) are less impressive than projects that require pulling data an API or scraping a webpage.
  5. Size: All things being equal, analysis of larger datasets is more impressive than analysis of smaller ones.  Real world problems often involve working on large, multi-gigabyte (or terabyte) datasets, which pose significantly more of an engineering challenge than working with small data.  Employers value people who have demonstrated experience working with large data.
  6. Engineering: All things being equal, candidates who can demonstrate the ability to use professional engineering tools like git and Heroku will be viewed more favorably. So much of data science is software engineering and savvy employers are looking for people who have the basic ability to do this. To get started, try following this git tutorial or these Heroku tutorials in your favorite language.  Put up your results on github or turn your presentation into a small Heroku app!

Obviously, no project will be perfect.  It’s hard to fulfill all of these criteria, and individual employers undoubtedly have other criteria that we have not mentioned or have a different prioritization.  But more often than not, the fellows who are hired first have projects that satisfy more of these criteria than not.  And lastly, if you’re looking for a data science job or to kick start your career as a data scientist, consider applying to our free eight-week fellowship.

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Search results: Careers in high tech

Screen Shot 2017-01-26 at 11.06.49 AMI was recently interviewed for a piece for ScienceMag about careers in high tech. You can find the original post on ScienceMag.

With big data becoming increasingly popular and relevant,  data scientist jobs are opening up across every industry in virtually every corner of the globe. Unfortunately, the multitude of available positions isn’t making it any easier to actually land a job as a data scientist. Competition is abundant, interviews can be lengthy and arduous, and good ideas aren’t enough to get yourself hired. Michael Li  emphasizes that technical know-how is what hiring managers crave. “No one needs just an ‘ideas’ person. They need someone who can actually get the job done.”

 

This shouldn’t discourage anyone from pursuing a career in data science because it can be both rewarding and profitable. If you’re looking to brush up your skills and jump start your career, consider applying for our free data science fellowship with offerings now in San Francisco, New York, Washington DC, Boston, and Seattle. Learn more and apply on our website.

 

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Tying Together Elegant Models: Alumni Spotlight on Brendan Keller

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Brendan was a Fellow in our Fall 2015 cohort who landed a job with one of our hiring partners, Jolata.

Tell us about your background. How did it set you up to be a great data scientist?144670722818-brendan_keller

I did my PhD research in theoretical condensed matter physics at the University of California, Santa Barbara. The focus of my research was on studying the phase diagram of chains of non-abelian anyons. Because such chains are gapless in most regions of the phase diagram we had to model them using very large matrices in C++. To make this computation more tractable we used hash tables and sparse matrices.  Besides my background in numerics I also took the time to learn Python, Pandas, SQL and MapReduce in Cloudera a few months before starting the fellowship.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Continuing Your Data Science Job Hunt Through the Holidays

xmas-wishThe holidays can seem like a tough time to job search, people are out of the office and holiday schedules are hectic.  But there are lots of things you can do to take advantage of this time, keep your search moving forward, and set yourself up for post holiday success. 

  1. Review your skill set

Read through every job description you can, even the ones for jobs that didn’t originally interest you. Where are your skills gaps? If you see fluency in C++ in one or two job descriptions, but not on most, you might be okay not knowing it well. But if you see fluency in C++ listed over and over, the next few weeks are a great time for you to work on learning it.

  1. Take on a new project

One of the best things you can do to really master those new skills (and demonstrate your knowledge) is to apply them. We’ve been publishing links to lots of publically available data sets on our blog. Take one and treat it as a case study, what problem might this company or organization have, and how can you use data science to solve it? You can add your work to your github, blog about it, or share it on your LinkedIn! These are publically available data sets, so definitely show off your work.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

SAS Pain Points

Having trouble with SAS?  Check out this handy video by our former fellow Paul Paczuski.

Tweet about this on TwitterShare on FacebookShare on LinkedIn