Data are becoming the new raw material of business
The Economist

Alternative Data Sources: How to Improve Your Models

Alternative Data Sources: How to Improve Your Models

By Andres Gonzalez Casabianca


Picture this: You’ve been working hard on a project at work. You’ve run several algorithms, tuned the necessary hyperparameters, performed cross validation and exhausted the checks required to ensure you’re not overfitting. Yet, the performance metric isn’t where you would like it to be; or worse, isn’t where the business needs it to be. You take a hard look at your data science pipeline and don’t see any room for improvement. What do you do? Go back to the source; specifically, go to an alternative source.


FinTechs working in the credit space differentiate themselves by their ability to muster alternative data sources and put them through their analytics pipeline. These companies aim to predict a person’s default probability, i.e. how likely they won’t pay their loan. However, to get a competitive advantage from the established household names (e.g., Transunion, Equifax), they need to find uncharted information, clean it and finally, use it as input in their models.


Back in 2011, when social media was ramping up and people were creating their digital footprints, Jeff Stewart and Richard Eldridge founded Lenddo. This fast-growing FinTech gathers data from social networks with the user’s authorization and analyzes over 12 thousand variables to create a score that represents the likelihood of default. For example, Lenddo looks at how and with whom social media users interact, and the quality of their connections. Without getting too deep into the role of privacy in data science and the cleaning preprocessing, garnering this information is an excellent example of alternate data sources, the importance of data cleaning, and optimization outside of the traditional parameters.


Branch is another startup that is thinking outside the box. It operates mainly in Sub-Saharan Africa, focusing on financial underserved -and unserved- populations using alternative data sources to predict the likelihood of default. Branch uses mobile data, ranging from cellphone battery charging patterns to SMS frequency and length, all gathered with the user’s consent. Branch cleans, crunches, and puts the information through its data science pipeline, transforming it into a credit score. This way, Branch has more input information for the machine learning algorithms and superior results against its competitors.


Both FinTech companies mentioned above are built around financial prediction and data science, starting their pipelines by looking at unique, unmapped, and uncharted data. However, in the era of Big Data, these data sets come with their own challenges, so a mix of technical knowledge and business understanding is key: data scientists must see the numbers and the colors. Common problems that arise are over and under representation, selection bias, uncleanable values and unwieldy data, to name a few. These are the hidden costs of using alternative sources. Therefore, spend enough time understanding where the data is coming from and what that information is (beyond the numbers). If data scientists put garbage in, they will get garbage out. Companies need to adjust the pipeline for these biases to avoid erroneous and unactionable conclusions.


Next time you feel like you have hit a plateau, take a few steps back and ask yourself: What alternative source can I add to the pipeline? Whether it is from digital interactions, online preferences or other innovative source, alternative sources will help you improve the performance metric and will set you apart from the competition. We saw how Lenddo and Branch use social networks and mobile patterns respectively to enhance their models and produce a novel credit score. It does not matter what industry you work on, nor what type of challenge you are tackling, when the performance metric is off target, go back and look for alternative data sources: there is always new and untapped data. Get creative, account for inherent biases in new data sets, and incorporate explainable metrics to evaluate your models.


What’s different about hiring data scientists in 2020?


What’s different about hiring data scientists in 2020?

By Michael Li

Originally published on TechCrunch 8/12/20. See the original post here.


It’s 2020 and the world has changed remarkably, including in how companies screen data science candidates. While many things have changed, there is one change that stands out above the rest. At The Data Incubator, we run a data science fellowship and are responsible for hundreds of data science hires each year. We have observed these hires go from a rare practice to being standard for over 80% of hiring companies. Many of the holdouts tend to be the largest (and traditionally most cautious) enterprises. At this point, they are at a serious competitive disadvantage in hiring.


Historically, data science hiring practices evolved from software engineering. A hallmark of software engineering interviewing is the dreaded brain teaser, puzzles like “How many golf balls would fit inside a Boeing 747?” or “Implement the quick-sort algorithm on the whiteboard.” Candidates will study for weeks or months for these and the hiring website Glassdoor  has an entire section devoted to them. In data science, the traditional coding brain teaser has been supplemented with statistics ones as well — “What is the probability that the sum of two dice rolls is divisible by three?” Over the years, companies are starting to realize that these brain teasers are not terribly effective and have started cutting down their usage.


In their place, firms are focusing on project-based data assessments. These ask data science candidates to analyze real-world data provided by the company. Rather than having a single correct answer, project-based assessments are often more open-ended, encouraging exploration. Interviewees typically submit code and a write-up of their results. These have a number of advantages, both in terms of form and substance.


First, the environment for data assessments is far more realistic. Brain teasers unnecessarily put candidates on the spot or compel them to awkwardly code on a whiteboard. Because answers to brain teasers are readily Google-able, internet resources are off-limits. On the job, it is unlikely that you’ll be asked to code on a whiteboard or perform mental math with someone peering over your shoulder. It is incomprehensible that you’ll be denied internet access during work hours. Data assessments also allow the applicants to complete the assessment at a more realistic pace, using their favorite IDE or coding environment.


“Take-home challenges give you a chance to simulate how the candidate will perform on the job more realistically than with puzzle interview questions,” said Sean Gerrish, an engineering manager and author of “How Smart Machines Think.”


Second, the substance of data assessments is also more realistic. By design, brainteasers are tricky or test knowledge of well-known algorithms. In real life, one would never write these algorithms by hand (you would use one of the dozens of solutions freely available on the internet) and the problems encountered on the job are rarely tricky in the same way. By giving candidates real data they might work with and structuring the deliverable in line with how results are actually shared at the company, data projects are more closely aligned with actual job skills.


Jesse Anderson, an industry veteran and author of “Data Teams,” is a big fan of data assessments: “It’s a mutually beneficial setup. Interviewees are given a fighting chance that mimics the real-world. Managers get closer to an on-the-job look at a candidate’s work and abilities.” Project-based assessments have the added benefit of assessing written communication strength, an increasingly important skill in the work-from-home world of COVID-19.


Finally, written technical project work can help avoid bias by de-emphasizing traditional but prejudicially fraught aspects of the hiring process. Resumes with Hispanic and African American names receive fewer callbacks than the same resume with white names. In response, minority candidates deliberately “whiten” their resumes to compensate. In-person interviews often rely on similarly problematic gut feel. By emphasizing an assessment closely tied to job performance, interviewers can focus their energies on actual qualifications, rather than relying on potentially biased “instincts.” Companies looking to embrace #BLM and #MeToo beyond hashtagging may consider how tweaking their hiring processes can lead to greater equality.


The exact form of data assessments vary. At The Data Incubator, we found that over 60% of firms provide take-home data assessments. These best simulate the actual work environment, allowing the candidate to work from home (typically) over the course of a few days. Another roughly 20% require interview data projects, where candidates analyze data as a part of the interview process. While candidates face more time pressure from these, they also do not feel the pressure to ceaselessly work on the assessment. “Take-home challenges take a lot of time,” explains Field Cady, an experienced data scientist and author of “The Data Science Handbook.” “This is a big chore for candidates and can be unfair (for example) to people with family commitments who can’t afford to spend many evening hours on the challenge.”


To reduce the number of custom data projects, smart candidates are preemptively building their own portfolio projects to showcase their skills and companies are increasingly accepting these in lieu of custom work.


Companies relying on old-fashioned brainteasers are a vanishing breed. Of the recalcitrant 20% of employers still sticking with brainteasers, most are the larger, more established enterprises that are usually slower to adapt to change. They need to realize that the antiquated hiring process doesn’t just look quaint, it’s actively driving candidates away. At a recent virtual conference, one of my fellow panelists was a data science new hire who explained that he had turned down opportunities based on the firm’s poor screening process.


How strong can the team be if the hiring process is so outmoded? This sentiment is also widely shared by the Ph.D.s completing The Data Incubator’s data science fellowship. Companies that fail to embrace the new reality are losing the battle for top talent.


AI at the Far Edge


AI at the Far Edge

by Alexander Sack

A Smarter Edge

The concept of “edge computing” has been around since the late 90s, and typically refers to systems that process data where it is collected instead of having to both store and push it to a centralized location for off-line processing. The aim is to move computation away from the data center in order to faciliate real-time analytics and reduce network and response latency. But some applications, particularly those that leverage deep learning, have been historically very difficult to deploy at the edge where power and compute are typically extremely limited. The problem has become particularly accute over the past few years as recent breakthroughs in deep learning have featured networks with a lot more depth and complexity, and thus require greater compute from the platforms they run on. But recent developments in the embedded hardware space have bridged that gap to a certain extent and enable AI to run fully on the edge, ushering a whole new wave of applications. And new data scientists and machine learning engineers entering the field are going to need to be prepared on how to leverage these platforms to build the next generation of truly “smart” devices.


Enter Deep IoT

Historically, edge computing devices, particualrly IoT, rely on the cloud for most of their compute – the cloud effectively becomes a device’s main processing engine. For example, Amazon’s Alexa doesn’t perform voice recognition on device but rather records voice and then sends it off to AWS for post-processing to create an actionable response. However, this comes at a trade off since real-time response isn’t possible when either the network connection is too slow to shuffle data back and forth between device and cloud or it simply doesn’t exist – both typical deployment scenarios for a lot of edge computing applications.

Yet the industry has seen an increasing demand to perform inference on the edge as many applications require applying sophisticated, highly computational deep learning and machine learning algorithms in real-time. For example, technologies like facial and voice recognizition as well as self-driving cars all require data to be processed (and by potentially several models) as it is being collected.


NVIDIA Jetson Nano

Use cases like these have cultivated a cottage industry of solutions, all tailored to various machine learning and deep learning workloads. In fact, many of the leaders in this space, GoogleNVIDIA, and Intel to name a few, all offer complete embedded platforms in a myriad array of form factors that allow data scientists and machine learning engineers to build “smart” edge ready devices at relatively low cost.

Though each of these platforms have different trade-offs, what’s in common across all of them is each provides an embedded accelerator (GPU or VPU) that off-loads the CPU for real-time processing. Some also include one or more hardware video encoders/decoders on-chip that are directly linked to the embedded GPU enabling a complete end-to-end accelerated pipeline. For example, in computer vision applications, high-definition or even ultra high-definition video can be recorded, decoded, as well as inferred all in hardware at extremely fast frame rates.


Modeling for the Edge

Although “smart” edge computing devices offer more compute power, data scientists and machine learning engineers still need to optimize their models to make efficient use of that power. Techniques such as quantatization and model pruning as well as understanding how different metrics effect their model’s overall peformance are all key in building robust solutions on the edge. Luckily, most of these platforms support popular frameworks like Tensorflow and PyTorch as well as ship pre-optimized models that can be deployed right out of the box for rapid prototyping.


Alexander Sack is a machine learning engineer at Corteva Agriscience and a former TDI Fellow (Winter 2019). Before falling for the gradient descent trap, he worked as a Principal Software Engineer specializing in systems programming and operating systems. He holds a Bachelors and Masters in Computer Science from Stevens Institute of Technology with high honors. In his free time, he listens to a lot of heavy metal.


The 3 Things That Make Technical Training Worthwhile

seminar-594125_960_720Managers understand that having employees who understand the latest tools and technologies is vital to keeping a company competitive. But training employees on those tools and technologies can be a costly endeavor (US corporations spent $87.6 billion on training expenditures in 2018) and too often training simply doesn’t achieve the objective of giving employees the skills they need.

At The Data Incubator, we work with hundreds of clients who hire PhD data scientists from our Fellowship program or enroll their employees in our big data corporate training. We’ve found in our work with these companies across industries that technical training often lacks three important things: hands-on practice, accountability, and breathing room. Continue reading

Automating Excel with Python

We know there’s a lot of pain points in Excel that make it a tool that’s cumbersome and repetitive for data manipulation. We’ve distilled those pain points into three major themes.

  • The first is that it’s awkward to deal with higher dimensional data in a two-dimensional spreadsheet. Spreadsheets are great for representing two-dimensional data but they’re awkward for representing anything at three or higher dimensions. And while there’s many workarounds like pivot tables, this will only gets you so far.
  • The second pain point revolves around doing the same calculation over multiple sheets or multiple workbooks. While it’s easy to iterate over rows or columns in Excel, it’s cumbersome and time consuming to iterate over hundreds of sheets or notebooks.
  • Finally, data manipulation in Excel is actually very manual and hence very error prone. So in Excel, the convention is to copy data or formulas from cell to cell, but this makes it hard to keep our data up to date as new data arrives or as we update our computations as they become more complex. Errors aren’t always easy to catch before important business decisions are made.

In this video, we look at some data that we can get from the Bureau of Labor Statistics. While it comes in Excel, it comes in a very particular format. The rows iterate through years, the columns iterate through months, the sheets iterate through industries, and the workbooks iterate through wages, hours worked and overtime. So how would you use this to calculate salary, which we’re gonna define as wage times the quantity hours worked plus 1.5 times the overtime.
Continue reading

Advanced Conda: Installing, Building, and Uploading Packages, Adding Channels, and More

As many of our readers might know, conda is a package manager for the numerical python stack that solves many of the issues where pip falls short. While pip is great for pure Python packages (ones written exclusively in Python code), most data science packages need to rely on C code for performance. This, unfortunately, makes installation highly system-dependent. Conda alleviates this problem by managing compiled binaries such that a user does not the need to have a full suite of local build tools (for example, building NumPy from source no longer requires a FORTRAN 77 compiler). Additionally, conda is moving to include more than Python. For example, it’s also supporting managing packages in the R language. With such an ambitious scope, it’s not surprising that package coverage is incomplete and, if you’re a power user, you’ll often see yourself wanting to contribute missing packages. Here’s a primer on how:


Installing Using Pip / PyPI in Conda as a Fallback

The first thing to notice is that you don’t necessarily need to jump to building or uploading packages. As a simple fallback, you can tell conda to build Python Packages directly from pip/PyPI. For example, take a look at this simple conda environment.yaml file:

# environment.yaml
- numpy
- scipy
- pip:
  - requests

This installs numpy and scipy from anaconda but installs requests using pip.  You can invoke it by running:

conda env update -f environment.yaml


Adding New Channels to Conda for more Packages

While the core maintainers has fairly good coverage, the coverage isn’t complete. Also, because conda packages are version and system dependent, the odds of a specific version not existing for your operating system is fairly high. For example, as of this writing, scrapy, a popular web scraping software, lives in the popular conda-forge channel (package). Similarly, many r packages are under the r channel, for example the r-dplyr package. R packages are, by convention, prefaced with a “r-” prefix to their CRAN name. You can find the channel that supports it by Googling “scrapy conda”. To install it, we’ll need to add conda-forge as a channel:

# environment.yaml
- conda-forge
- scrapy


Building PyPI Conda Packages

But sometimes, even with extra channels, the packages simply don’t exist (as evidenced by a Google query) . In this case, we just have to build our own packages. It’s fairly easy. The first step is to create a user account on The username will be your channel name.

First, you’ll want to make sure you have conda build and anaconda client installed:

conda install conda-build
conda install anaconda-client

and you’ll want to authenticate your anaconda client with your new credentials:

anaconda login

Finally, it’s easiest to configure anaconda to automatically upload all successful builds to your channel:

conda config --set anaconda_upload yes

With this setup, it’s easy! Below are the instructions for uploading the pyinstrument package:

# download pypi data to ./pyinstrument
conda skeleton pypi pyinstrument
# build the package
conda build ./pyinstrument
# you'll see that the package is automatically uploaded


Building R Conda Packages

Finally, we’ll want to do this with R packages. Fortunately, there’s conda support for building from CRAN, R’s package manager. For example, glm2 is (surprisingly enough) not on anaconda.  We’ll run the following commands to build and auto upload it:

conda skeleton cran glm2
conda build ./r-glm2

Of course, glm2 now has OSX and Linux packages on The Data Incubator’s Anaconda channel at r-glm2 so you can directly include our channel:

# environment.yaml
- thedataincubator
- r-glm2


Visit our website to learn more about our offerings:


A Study Of Reddit Politics

This article was written for The Data Incubator by Jay Kaiser, a Fellow of our 2018 Winter cohort in Washington, DC who landed a job with our hiring partner, ZeniMax Online Studios, as a Big Data Engineer.


The Question

The 2016 Presidential Election was, in a single word, weird. So much happened during the months leading up to November that it became difficult to keep track with what who said when and why. However, the finale of the election that culminated with Republican candidate Donald J. Trump winning the majority of the Electoral College and hence becoming the 45th President of the United States was an outcome which at the time I had thought impossible, if solely due to the aforementioned eccentric series of events that had circulated around Trump for a majority of his candidacy.

Following the election, the prominent question that could not leave my mind was a simple one: how? How had the American people changed so much in only a couple of years to allow an outsider hit by a number of black marks during the election to be elected to the highest position in the United States government? How did so many pollsters and political scientists fail to predict this outcome? How can we best analyze the campaigns of each candidate, now given hindsight and knowledge of the eventual outcome? In an attempt to answer each of these, I have turned to a perhaps unlikely source.

Continue reading

SQLite vs Pandas: Performance Benchmarks

This technical article was written for The Data Incubator by Paul Paczuski, a Fellow of our 2016 Spring cohort in New York City who landed a job with our hiring partner, Genentech as a Clinical Data Scientist.

As a data scientist, we all know that unglamorous data manipulation is 90% of the work. Two of the most common data manipulation tools are SQL and pandas. In this blog, we’ll compare the performance of pandas and SQLite, a simple form of SQL favored by Data Scientists.

Let’s find out the tasks at which each of these excel. Below, we compare Python’s pandas to sqlite for some common data analysis operations: sort, select, load, join, filter, and group by.

Continue reading

GPU Cloud Computing Services Compared: AWS, Google Cloud, IBM Nimbix/Power AI, and Crestle

This technical article was written for The Data Incubator by Tim Pollio, a Fellow of our 2017 Fall cohort in Washington, DC who joined The Data Incubator team as one of our resident Data Scientist instructors.

At The Data Incubator, a data science training and placement company, we’re excited about the potential for neural networks and deep learning to transform AI and Big Data. Of course, to practically run deep learning, normal CPUs won’t suffice — you’ll need GPUs. GPUs can dramatically increase the speed of deep learning algorithms, so it’s no surprise that they’re becoming increasingly popular and accessible. Amazon, Google, and IBM all offer GPU enabled options with their cloud computing services, and newer companies like Crestle provide additional options.

We tried four different services — Amazon Web Services, Google Cloud Platform, Nimbix/PowerAI, and Crestle — to find the options with the best performance, price, and convenience. Each service was tested using the same task: 1000 training steps on a tensorflow neural network designed for text prediction. The code for this benchmark can be found here.

Continue reading

Python Multi-Threading vs Multi-Processing

There is a library called threading in Python and it uses threads (rather than just processes) to implement parallelism. This may be surprising news if you know about the Python’s Global Interpreter Lock, or GIL, but it actually works well for certain instances without violating the GIL. And this is all done without any overhead — simply define functions that make I/O requests and the system will handle the rest.


Global Interpreter Lock

The Global Interpreter Lock reduces the usefulness of threads in Python (more precisely CPython) by allowing only one native thread to execute at a time. This made implementing Python easier to implement in the (usually thread-unsafe) C libraries and can increase the execution speed of single-threaded programs. However, it remains controvertial because it prevents true lightweight parallelism. You can achieve parallelism, but it requires using multi-processing, which is implemented by the eponymous library multiprocessing. Instead of spinning up threads, this library uses processes, which bypasses the GIL.

It may appear that the GIL would kill Python multithreading but not quite. In general, there are two main use cases for multithreading:

  1. To take advantage of multiple cores on a single machine
  2. To take advantage of I/O latency to process other threads

In general, we cannot benefit from (1) with threading but we can benefit from (2).

Continue reading