Data are becoming the new raw material of business
The Economist

Python Multi-Threading vs Multi-Processing

There is a library called threading in Python and it uses threads (rather than just processes) to implement parallelism. This may be surprising news if you know about the Python’s Global Interpreter Lock or GIL but it actually works well for certain instances without violating the GIL. And this is all done without any overhead — simply define functions that make I/O requests and the system will handle the rest.


Global Interpreter Lock

The Global Interpreter Lock reduces the usefulness of threads in Python (more precisely CPython) by allowing only one native thread to execute at a time. This made implementing Python easier to implement in the (usually thread-unsafe) C libraries and can increase the execution speed of single-threaded programs. However, it remains controvertial because it prevents true lightweight parallelism. You can achieve parallelism, but it requires using multi-processing, which is implemented by the eponymous library multiprocessing. Instead of spinning up threads, this library uses processes, which bypasses the GIL.

It may appear that the GIL would kill Python multithreading but not quite. In general, there are two main use cases for multithreading:

  1. To take advantage of multiple cores on a single machine
  2. To take advantage of I/O latency to process other threads

In general, we cannot benefit from (1) with threading but we can benefit from (2).



The general threading library is fairly low-level but it turns out that multiprocessing wraps this in multiprocessing.pool.ThreadPool, which conveniently takes on the same interface as multiprocessing.pool.Pool.

One benefit of using threading is that it avoids pickling. Multi-processing relies on pickling objects in memory to send to other processes. For example, if the timed decorator did not wraps the wrapper function it returned, then CPython would be able to pickle our functions request_func and selenium_func and hence these could not be multi-processed. In contrast, the threading library, even through multiprocessing.pool.ThreadPool works just fine. Multiprocessing also requires more ram and startup overhead.



We analyze the highly I/O dependent task of making 100 URL requests for random wikipedia pages. We compare:

  1. The Python requests module and
  2. The Python selenium with PhantomJS.

We run each of these requests in three ways and measure the time required for each fetch:

  1. In serial
  2. In parallel in a threading pool with 10 threads
  3. In parallel in a multiprocessing pool with 10 threads

Each request is timed and we compare the results.



Firstly, the per-thread running time for requests is obviously lower than for selenium, as the latter requires spinning up a new process to run a PhantomJS headless browser. It’s also interesting to notice that the individual threads (particularly selenium threads) run faster in serial than in parallel, which is the typical bandwidth vs. latency tradeoff. In particular, selenium threads are more than twice as slow, problably because of resource contention with 10 selenium processes spinning up at once.

Likewise, all threads run roughly 4 times faster for selenium requests and roughly 8 times faster for requests requests when multithreaded compared with serial.

There was no significant performance difference between using threading vs multiprocessing.



The performance between multithreading and multiprocessing are extremely similar and the exact performance details are likely to depend on your specific application. Threading through multiprocessing.pool.ThreadPool is really easy, or at least as easy as using the multiprocessing.pool.Pool interface — simply define your I/O workloads as a function and use the ThreadPool to run them in parallel.


Improvements welcome! Please submit a pull request to our github.

Data Scientist Salaries

money-548948_960_720At The Data Incubator we’ve worked with hundreds of Fellows looking to enter industry and our alumni work at companies including LinkedIn, Palantir, Amazon, Capital One, and the NYTimes.  

Starting salary is one of the most common concerns for professionals entering any field, but as we’ve only been using the job title “Data Scientist” for about eight years it can be particularly challenging for prospective data scientists to find good information on their job market. LinkedIn and Facebook were the first to give employees on their data teams the title of data scientist, but now there are thousands of data scientists working across all industries alongside data engineers, data analysts, and quantitative analysts.

Continue reading

4 Data Science Projects That We Can’t Get Enough Of

LI3Y5U376XAt The Data Incubator we run a free advanced 8-week fellowship for PhDs looking to enter the industry as data scientists.  

As part of the application process, we ask potential fellows to propose and begin working on a data science project to highlight their skills to employers.  Regardless of whether you’re selected to be a fellow, this project will be instrumental in attracting employer interest and highlighting your skills.  Here are some projects that we would love to see, and that we hope to see you take on as well.


Multi-Axial Political Analysis  

We often think of American politics in terms of a single axis: left versus right, democrat versus republican.  In reality, the parties are composed of varying factions with different identities and political priorities and American politics is actually broken along multiple axes: foreign policy, social issues, regulation, social spending, education, second amendment, just to name a few.  Continue reading

JUST Capital and The Data Incubator Challenge

Data Science For Social Good (1)


Today, we’re excited to announce that we’re teaming up with JUST Capital to help crowd-source data science for social good.  The Data Incubator offers a free eight-week data science fellowship for those with a PhD or a masters degree looking to transition into data science.  As a part of the application process, students are asked to submit a data science capstone project and the best students are invited to work on them during the fellowship.  JUST Capital is helping providing data and project prompts to harness the collective brainpower amongst The Data Incubator fellows to solve these high-impact social problems.

  • These projects focus on applied data science techniques with tangible impacts on JUST Capital’s mission.
  • The projects are open ended and creativity is encouraged. The documents provided, below, are suitable for analysis, but one should not shy in seeking out additional sources of data.

JUST Capital is a nonprofit that provides information and rankings on how large corporations perform on issues that matter most to the public. We give individuals a voice on what really matters to them, and evaluate how companies perform on those issues. By providing the right knowledge and making it easy to access and understand, we believe capital will flow to corporations that are more JUST, ultimately leading to a balanced business world that takes into account human needs that are so often neglected today. The meaning of JUST is defined by the American public as fair, equitable and balanced. In 2016, JUST Capital surveyed nearly 4,000 Americans from all regions and walks of life, in its second annual Poll on Corporate America. The issues identified by the public form the basis of our benchmark — it is against these Drivers and Components that we measure corporate performance. The most important factors broadly relate to employees, customers, company leadership, the environment, communities and investors.

Continue reading

What is the probability of winning the Hamilton lottery?

roll-the-dice-1502706_960_720People interested in seeing the Broadway musical Hamilton — and there are still many of them, with demand driving starting ticket prices to $\$600$ — can enter Broadway Direct’s daily lottery. Winners can receive up to 2 tickets (out of 21 available tickets) for a total of $\$10$.

What’s the probability of winning?

How easy is it to win these coveted tickets? Members of NYC’s Data Incubator Team have collectively tried and failed 120 times. Given our data, we cannot simply divide the number of successes by the number of trials to calculate our chances of winning — we would get zero (and the odds, which are apparently small, are clearly non-zero).

This kind of situation often comes up under many guises in business and big data, and because we are a data science corporate training company, we decided to use statistics to determine the answer. Say you are measuring the click-through-rate of a piece of organic or paid content, and out of 100 impressions, you have not observed any clicks. The measured CTR is zero but the true CTR is likely not zero. Alternatively, suppose you are measuring the rate of adverse side effects of a new drug. You have tested 40 patients and haven’t found any, but you know the chance is unlikely to be zero. So what are the odds of observing a click or a side effect?  Continue reading

The Iraq War by the Numbers: Extracting the Conflicts’ Staggering Costs

624246174001_5021696153001_5021209282001-vsOne of our fellows recently had a piece published about her very unique and timely capstone project. The original piece is posted on Data Driven Journalism

In her own words:

This war is not only important due to its staggering costs (both human and financial) but also on account of its publicly available and well-documented daily records from 2004 to 2010.

These documents provide a very high spatial and temporal resolution view of the conflict. For example, I extracted from these government memos the number of violent events per day in each county. Then, using latent factor analysis techniques, e.g. non-negative matrix factorization, I was able to cluster the top three principal war zones. Interestingly these principal conflict zones were areas populated by the three main ethno-religious groups in Iraq.

You can watch her explain it herself:


Editor’s Note: The Data Incubator is a data science education company.  We offer a free eight-week fellowship helping candidates with PhDs and masters degrees enter data science careers.  Companies can hire talented data scientists or enroll employees in our data science corporate training.

Polling and big data in the age of Trump, Brexit, and the Colombian Referendum

Our founder, Michael Li, recently collaborated with his colleague Raymond Perkins, a researcher and PhD candidate at Princeton University, on this piece about big data and polling. You can find the original article at Data Driven Journalism.

globe-1015311_960_720The recent presidential inauguration and the notably momentous election that preceded it has brought about numerous discussions surrounding the accuracy of polling and big data. The US election results paired with those of Brexit, and the Colombian Referendum have left a number of people scratching their heads in confusion. Statisticians, however understand the multitude of sampling biases and statistical errors than can ensue when your data is involving human beings.

“Though big data has the potential to virtually eliminate statistical error, it unfortunately provides no protection against sampling bias and, as we’ve seen, may even compound the problem. This is not to say big data has no place in modern polling, in fact it may provide alternative means to predict election results. However, as we move forward we must consider the limitations of big data and our overconfidence in it as a polling panacea.”

At The Data Incubator, this central misconception about big data is one of the core lessons we try to impart on our students. Apply to be a Fellow today!


Editor’s Note: The Data Incubator is a data science education company.  We offer a free eight-week fellowship helping candidates with PhDs and masters degrees enter data science careers.  Companies can hire talented data scientists or enroll employees in our data science corporate training.

How Employers Judge Data Science Projects

mark-516277_960_720One of the more commonly used screening devices for data science is the portfolio project.  Applicants apply with a project that they have showcasing a piece of data science that they’ve accomplished.  At The Data Incubator, we run a free eight week fellowship helping train and transition people with masters and PhD degrees for careers in data science.  One of the key components of the program is completing a capstone data science project to present to our (hundreds of) hiring employers.  In fact, a major part of the fellowship application process is proposing that very capstone project, with many successful candidates having projects that are substantially far along if not nearly completed.  Based on conversations with partners, here’s our sense of priorities for what makes a good project, ranked roughly in order of importance: 

  1. Completion: While their potential is important, projects are assessed primarily based on the success of analysis performed rather than the promise of future work.  Working in any industry is about getting things done quickly, not perfectly, and projects with many gaps, “I wish I had time for”, or “ future steps” suggests the applicant may not be able to get things done at work.
  2. Practicality: High-impact problems of general interest are more interesting than theoretical discussions on academic research problems. If you solve the problem, will anyone care? Identifying interesting problems is half the challenge, especially for candidates leaving academia who must disprove an inherent “academic” bias.
  3. Creativity: Employers are looking for creative, original thinkers who can identify either (1) new datasets or (2) find novel questions to ask about a dataset. Employers do not want to see the tenth generic presentation on Citibike (or Chicago Crime, Yelp Restaurant Ratings data, NYC Restaurant Inspection DataNYC Taxi, BTS Flight Delay, Amazon Review, Zillow home price, World Bank or other macroeconomic data, or beating the stock market) data. Similarly, projects that explain a non-obvious thesis supported by concise plots are more compelling than ones that present obvious conclusions (e.g. “more riders use Citibike during the day than at night”). Employers are looking for data scientists who can find trends in the data that they don’t already know. Continue reading

Search results: Careers in high tech

Screen Shot 2017-01-26 at 11.06.49 AMI was recently interviewed for a piece for ScienceMag about careers in high tech. You can find the original post on ScienceMag.

With big data becoming increasingly popular and relevant,  data scientist jobs are opening up across every industry in virtually every corner of the globe. Unfortunately, the multitude of available positions isn’t making it any easier to actually land a job as a data scientist. Competition is abundant, interviews can be lengthy and arduous, and good ideas aren’t enough to get yourself hired. Michael Li  emphasizes that technical know-how is what hiring managers crave. “No one needs just an ‘ideas’ person. They need someone who can actually get the job done.”


This shouldn’t discourage anyone from pursuing a career in data science because it can be both rewarding and profitable. If you’re looking to brush up your skills and jump start your career, consider applying for our free data science fellowship with offerings now in San Francisco, New York, Washington DC, Boston, and Seattle. Learn more and apply on our website.


Continuing Your Data Science Job Hunt Through the Holidays

xmas-wishThe holidays can seem like a tough time to job search, people are out of the office and holiday schedules are hectic.  But there are lots of things you can do to take advantage of this time, keep your search moving forward, and set yourself up for post holiday success. 

  1. Review your skill set

Read through every job description you can, even the ones for jobs that didn’t originally interest you. Where are your skills gaps? If you see fluency in C++ in one or two job descriptions, but not on most, you might be okay not knowing it well. But if you see fluency in C++ listed over and over, the next few weeks are a great time for you to work on learning it.

  1. Take on a new project

One of the best things you can do to really master those new skills (and demonstrate your knowledge) is to apply them. We’ve been publishing links to lots of publically available data sets on our blog. Take one and treat it as a case study, what problem might this company or organization have, and how can you use data science to solve it? You can add your work to your github, blog about it, or share it on your LinkedIn! These are publically available data sets, so definitely show off your work.

Continue reading