Data are becoming the new raw material of business
The Economist

JUST Capital and The Data Incubator Challenge

Data Science For Social Good (1)

 

Today, we’re excited to announce that we’re teaming up with JUST Capital to help crowd-source data science for social good.  The Data Incubator offers a free eight-week data science fellowship for those with a PhD or a masters degree looking to transition into data science.  As a part of the application process, students are asked to submit a data science capstone project and the best students are invited to work on them during the fellowship.  JUST Capital is helping providing data and project prompts to harness the collective brainpower amongst The Data Incubator fellows to solve these high-impact social problems.

  • These projects focus on applied data science techniques with tangible impacts on JUST Capital’s mission.
  • The projects are open ended and creativity is encouraged. The documents provided, below, are suitable for analysis, but one should not shy in seeking out additional sources of data.

JUST Capital is a nonprofit that provides information and rankings on how large corporations perform on issues that matter most to the public. We give individuals a voice on what really matters to them, and evaluate how companies perform on those issues. By providing the right knowledge and making it easy to access and understand, we believe capital will flow to corporations that are more JUST, ultimately leading to a balanced business world that takes into account human needs that are so often neglected today. The meaning of JUST is defined by the American public as fair, equitable and balanced. In 2016, JUST Capital surveyed nearly 4,000 Americans from all regions and walks of life, in its second annual Poll on Corporate America. The issues identified by the public form the basis of our benchmark — it is against these Drivers and Components that we measure corporate performance. The most important factors broadly relate to employees, customers, company leadership, the environment, communities and investors.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Spark 2.0 on Jupyter with Toree

spark-logo2Spark

Spark is one of the most popular open-source distributed computation engines and offers a scalable, flexible framework for processing huge amounts of data efficiently. The recent 2.0 release milestone brought a number of significant improvements including DataSets, an improved version of DataFrames, more support for SparkR, and a lot more. One of the great things about Spark is that it’s relatively autonomous and doesn’t require a lot of extra infrastructure to work. While Spark’s latest release is at 2.1.0 at the time of publishing, we’ll use the example of 2.0.1 throughout this post.

Jupyter

Jupyter notebooks are an interactive way to code that can enable rapid prototyping and exploration. It essentially connects a browser-based frontend, the Jupyter Server, to an interactive REPL underneath that can process snippets of code. The advantage to the user is being able to write code in small chunks which can be run independently but share the same namespace, greatly facilitating testing or trying multiple approaches in a modular fashion. The platform supports a number of kernels (the things that actually run the code) besides the out-of-the-box Python, but connecting Jupyter to Spark is a little trickier. Enter Apache Toree, a project meant to solve this problem by acting as a middleman between a running Spark cluster and other applications.

In this post I’ll describe how we go from a clean Ubuntu installation to being able to run Spark 2.0 code on Jupyter. Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Making the Switch from Management Consulting: Alumni Spotlight on Armand Quenum

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Armand was a Fellow in our Fall 2016 cohort who landed a job with KPMG.

Tell us about your background. How did it set you up to be a great data scientist?

I received my Bachelor’s degree in Mechanical Engineering from NC State University. After college, I became a management consultant specializing in program and strategic management. As a consultant, I saw the value of data-driven decisions and extracting insights from data. As a result, I decided to go back to school to obtain my Master’s in Systems Engineering. There I was introduce to R Programming software, data mining techniques, and applications of optimization. My Masters not only exposed me to data science, but it also provided me a framework to approach complex problems.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Ranked: 16 R Packages for Machine Learning

Ranked R PackagesAt The Data Incubator we pride ourselves on having the latest data science curriculum. Much of our course material is based on feedback from corporate and government partners about the technologies they are looking to learn. However, we wanted to develop a more data-driven approach to what we teach in our data science corporate training and our free fellowship for

Data science masters and PhDs looking to begin their careers in the industry.

This report is the first in a series analyzing data science related topics. We thought it would be useful to the data science community to rank and analyze a variety of topics related to the profession in a simple, easy to digest cheat sheet, rankings or reports.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

What is the probability of winning the Hamilton lottery?

roll-the-dice-1502706_960_720People interested in seeing the Broadway musical Hamilton — and there are still many of them, with demand driving starting ticket prices to $\$600$ — can enter Broadway Direct’s daily lottery. Winners can receive up to 2 tickets (out of 21 available tickets) for a total of $\$10$.

What’s the probability of winning?

How easy is it to win these coveted tickets? Members of NYC’s Data Incubator Team have collectively tried and failed 120 times. Given our data, we cannot simply divide the number of successes by the number of trials to calculate our chances of winning — we would get zero (and the odds, which are apparently small, are clearly non-zero).

This kind of situation often comes up under many guises in business and big data, and because we are a data science corporate training company, we decided to use statistics to determine the answer. Say you are measuring the click-through-rate of a piece of organic or paid content, and out of 100 impressions, you have not observed any clicks. The measured CTR is zero but the true CTR is likely not zero. Alternatively, suppose you are measuring the rate of adverse side effects of a new drug. You have tested 40 patients and haven’t found any, but you know the chance is unlikely to be zero. So what are the odds of observing a click or a side effect?  Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Standing Out as a STEM Graduate: Alumni Spotlight on Bernard Beckerman

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Bernard was a Fellow in our Fall 2016 cohort who landed a job with Uptake.

Tell us about your background. How did it set you up to be a great data scientist?

I studied Materials Science and Engineering at Northwestern University for my PhD. Graduate school prepared me with an array of technical skills including programming, statistical analysis, and the ability to build, communicate, and defend a scientific argument. These are all important in producing data science products and presenting them to those at all levels of a corporate structure.

What do you think you got out of The Data Incubator?

TDI helped me leverage my programming and critical thinking skills toward a career in data science by giving me essential skills and project experience that made me stand out from other advanced-degree STEM graduates. These include machine learning, parallel programming, and interactive data visualization. TDI also connected me to a cohort of accomplished students that has been a great support as I’ve started my career.  Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

The Science of Data Science: Alumni Spotlight on Paul George

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Paul was a Fellow in our Fall 2016 cohort who landed a job with Cloudera.

Tell us about your background. How did it set you up to be a great data scientist?

Following the completion of my PhD in Electrical and Computer Engineering in 2009, I joined Palantir Technologies as a Forward Deployed Engineer (client-facing software engineer). There, I helped Palantir enter a new vertical, that of Fortune 500 companies, where I built data integration and analysis software for novel commercial workflows. I left Palantir in 2012 and in 2013 I co-founded SolveBio, a genomics company whose mission is to help improve the variant-curation process; the process by which clinicians and genetic counselors research genetic mutations and label them as pathogenic, benign, or unknown. At SolveBio, my work was primarily focused on building scalable data cleansing, transformation and ingestion infrastructure that could be used to power the SolveBio genomics API. I also worked closely with geneticists and other domain experts in a semi-client-facing role.

The theme of my six years as a software engineer has been to help domain experts, whether they be fraud investigators at a bank or clinicians at a hospital, analyze disparate data to make better decisions. I have built infrastructure in both Java and Python, have used large SQL and NoSQL databases, and have spent countless hours perfecting Bash hackery (or wizardry, depending on your perspective).

My experiences as a software engineer were very relevant to data science in that I learned many ways to access, manipulate, and understand a variety of datasets from a variety of sources in a variety of formats. As the adage goes, “Garbage in. Garbage out.” No more is this true than in data science. Performing good data science requires cleaning and organizing data, and I feel very comfortable with this process.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

The Iraq War by the Numbers: Extracting the Conflicts’ Staggering Costs

624246174001_5021696153001_5021209282001-vsOne of our fellows recently had a piece published about her very unique and timely capstone project. The original piece is posted on Data Driven Journalism

In her own words:

This war is not only important due to its staggering costs (both human and financial) but also on account of its publicly available and well-documented daily records from 2004 to 2010.

These documents provide a very high spatial and temporal resolution view of the conflict. For example, I extracted from these government memos the number of violent events per day in each county. Then, using latent factor analysis techniques, e.g. non-negative matrix factorization, I was able to cluster the top three principal war zones. Interestingly these principal conflict zones were areas populated by the three main ethno-religious groups in Iraq.

You can watch her explain it herself:

 

Editor’s Note: The Data Incubator is a data science education company.  We offer a free eight-week fellowship helping candidates with PhDs and masters degrees enter data science careers.  Companies can hire talented data scientists or enroll employees in our data science corporate training.

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

From Astronomy to AI: Alumni Spotlight on Athena Stacy

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Athena was a Fellow in our Fall 2016 cohort who landed a job with Brighterion.

Tell us about your background. How did it set you up to be a great data scientist?

My background is in astronomy.  My research consisted of developing and performing computer simulations of star formation in the early universe.  The goal of these simulations was to better understand what stellar clusters looked like in regions of the universe that telescopes cannot observe.  Thus I was already familiar with computer programming and visualizing data.  This was very helpful in the transition to data science.  Knowing how to present my research clearly to a range of audiences — both beginning students and other experts in the field — has helped as well!

What do you think you got out of The Data Incubator?

Tons!  Just about everything I know about machine learning I learned at TDI.  I met lots of great, friendly, and supportive people through TDI as well.  This includes the instructors and mentors as well as the other fellows in my cohort, many of which I’m sure I will keep up with for many years to come.   Through TDI  I’ve also made contacts with other companies and data scientists in the San Francisco Bay area, which has been quite helpful in getting those job interviews!  Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Parallelizing Jupyter Notebook Tests

How we cut our end-to-end test suite runtime by 66% using parallelismjupyter-logo

While there’s a common stereotype that data scientists are poor software engineers, at The Data Incubator, we believe that mastering the fundamentals of software engineering is important for data science and we strive to implement rigorous engineering standards for our data science company.  We have an extensive curriculum for data science corporate training, data science fellowship, and online data science course leveraging the jupyter (née ipython) notebook format.  Last year, we published a post about testing Jupyter notebooks — applying rigorous software engineering testing standards to new technologies popular in data science.

However, over time, as our codebase as grown, we’ve added in more and more notebooks to our curriculum material. This led to tests on our curriculum taking ~30 minutes to run! We quickly identified parallelism as a low-hanging fruit that would make sense for a first approach, with a couple of points:

  1. We have curriculum materials that run code in Spark 2.0 parallelizing runs in that kernel is hard because of how the Spark execution environment spins up.  We also have curriculum materials in the jupyter R Kernel.
  2. Subprocess communication in Python (what our testing code is written in) is a pain, so maybe there’s a way to use some other parallelization library to avoid having to reinvent that wheel.
  3. Most of our notebooks are in Python, so those shouldn’t have any issues.

These issues aside, this seemed like a reasonable approach because each Jupyter notebook executes as its own subprocess in our current setup – we just had to take each of those processes and run them at the same time. Taking a stab at 3., parallelizing python tests, while finding a way around 2. – annoying multiprocess communication issues – yielded great results!  Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone