Data are becoming the new raw material of business
The Economist

Tying Together Elegant Models: Alumni Spotlight on Brendan Keller

At The Data Incubator we run a free eight-week Data Science Fellowship Program to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring Data Scientists. Brendan was a Fellow in our Fall 2015 cohort who landed a job with one of our hiring partners, Jolata.


Tell us about your background. How did it set you up to be a great Data Scientist?

I did my PhD research in theoretical condensed matter physics at the University of California, Santa Barbara. The focus of my research was on studying the phase diagram of chains of non-abelian anyons. Because such chains are gapless in most regions of the phase diagram we had to model them using very large matrices in C++. To make this computation more tractable we used hash tables and sparse matrices. Besides my background in numerics I also took the time to learn Python, Pandas, SQL and MapReduce in Cloudera a few months before starting the fellowship.


What do you think you got out of The Data Incubator?

The Data Incubator gave me a solid foundation in data parsing, large scale data analysis and machine learning. I went into the fellowship already knowing about various concepts like SVM, bag-of-words and cross-validation. But I learned how tie these together into a elegant models that are both modular and easy to modify or upgrade. I also learned how to use Map Reduce on a cluster where the behavior of your program can be quite different then on a single node.

Continue reading

Spark comparison: AWS vs. GCP

This post was written collectively by myself and Ariel M’ndange-Pfupfu. The original post for this piece can be found at O’Reilly

cloud-computing-2001090_960_720There’s little doubt that cloud computing will play an important role in data science for the foreseeable future. The flexible, scalable, on-demand computing power available is an important resource, and as a result, there’s a lot of competition between the providers of this service. Two of the biggest players in the space are Amazon Web Services (AWS) and Google Cloud Platform (GCP).

This article includes a short comparison of distributed Spark workloads in AWS and GCP—both in terms of setup time and operating cost. We ran this experiment with our students at The Data Incubator, a big data training organization that helps companies hire top-notch data scientists and train their employees on the latest data science skills. Even with the efficiencies built into Spark, the cost and time of distributed workloads can be substantial, and we are always looking for the most efficient technologies so our students are learning the best and fastest tools.

Submitting Spark jobs to the cloud

Spark is a popular distributed computation engine that incorporates MapReduce-like aggregations into a more flexible, abstract framework. There are APIs for Python and Java, but writing applications in Spark’s native Scala is preferable. That makes job submission simple, as you can package your application and all its dependencies into one JAR file.

It’s common to use Spark in conjunction with HDFS for distributed data storage, and YARN for cluster management; this makes Spark a perfect fit for AWS’s Elastic MapReduce (EMR) clusters and GCP’s Dataproc clusters. Both EMR and Dataproc clusters have HDFS and YARN preconfigured, with no extra work required.

Continue reading