Data are becoming the new raw material of business
The Economist

Turning Bold Questions into a Data Science Career at Amazon: Alumni Spotlight on David Wallace

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. David was a Fellow in our Winter 2016 cohort who landed a job with one of our hiring partners, Amazon.

Tell us about your background. How did it set you up to be a great data scientist? 

Before joining The Data Incubator, I completed my Ph.D. in chemistry at Johns Hopkins University, where I focused on the design and synthesis of new magnetic materials. My work gave me the opportunity to work alongside scientists in many different disciplines, and exposed me to a vast array of experimental techniques and theoretical constructs. From a data science perspective, this meant that I was constantly encountering new types of data and searching for scientifically rigorous models to explain those results. As the volume and complexity of these datasets increased, graphical data analysis tools like Excel and Origin weren’t making the cut for me, and I gradually made the transition to performing data transformation and analysis entirely in Python. That was a big technical leap that took a lot of time and frustration, but I think it ultimately made me a better researcher.

From a research perspective, working in a vibrant academic setting also meant learning how to ask bold questions, even at the risk of sounding stupid in front of a large group of mentors and peers–something I’ve done more than I care to admit. For me, finding the right question to ask is just as important as having the technical expertise to find an answer, and that’s one of the things that makes Data Science so exciting.

Continue reading


Testing Jupyter Notebooks

This was originally posted on Christian Moscardi’s blog and is a follow-up piece to another post on Embedding D3 in IPython Notebook. Christian is our Lead Developer! 

Jupyter is a fantastic tool that we use at The Data Incubator for instructional purposes. One perk of using Jupyter is that we can easily test our code samples across any language there’s a Jupyter kernel for. In this post, I’ll show you some of the code we use to test notebooks!

First, a quick discussion of the current state of testing ipython notebooks: there isn’t much documented about the process. ipython_nose is a really helpful extension for writing tests into your notebooks, but there’s no documentation or information about easy end-to-end testing. In particular, we want the programmatic equivalent of clicking “run all cells”.After poking around things like github’s list of cool ipython notebooks and the Jupyter docs, two things became apparent to us:

  1. Most people do not test their notebooks.
  2. Automated end-to-end testing is extremely easy to implement. Continue reading

Machine Learning and Modeling the Stock Market: Alumni Spotlight on Michael Skarlinski

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Michael was a Fellow in our Winter 2016 cohort who landed a job with one of our hiring partners, Schireson Associates.

 

Tell us about your background. How did it set you up to be a great data scientist? 

My PhD work was in computational materials science, where I worked with reactive molecular dynamics simulations. The field is totally simulation based, and typically requires high performance computing resources. Running these simulations helped build my chops for working with parallel systems and command line tools. The software required familiarity with some powerful languages and APIs like C and CUDA. Learning those definitely helped my understanding of Python once I converted to using it.

Toward the third year of my PhD I got really interested in machine learning. I started using scikit-learn to predict different aspects of simulations I worked on. These projects became a large part of my thesis and contributed to choosing The Data Incubator as a next step in my career.

Continue reading


Calculating the Perfect Algorithm: Alumni Spotlight on Sumanth Swaminathan

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Sumanth was a Fellow in our Winter 2016 cohort who landed a job with one of our hiring partners, Revon.

 

Tell us about your background. How did it set you up to be a great data scientist?

I did my bachelors degree in Chemical Engineering at the University of Delaware and my PhD in Applied Mathematics at Northwestern University.  After some postdoctoral work between Northwestern and Oxford University, I went into industry as a quantitative consultant for W.L. Gore & Associates.  For the past 4 years, I have spent most of my time delivering technology solutions at W.L. Gore, teaching mathematics at the University of Delaware, and performing and teaching Indian Classical Music.  

On the question of what makes a strong data scientist, I think that the better practitioners in the field tend to be hypothesis driven, strong critical thinkers with hard skills in statistics, programming, mathematics, and hardware.  Hence, my background in engineering and mathematics, my consulting experience, and my years of teaching probably contributed the most to my success.  

 

Continue reading


Making (LinkedIn) Connections: Alumni Spotlight on Xia Hong

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Xia was a Fellow in our Summer 2015 cohort who landed a job at LinkedIn.

 

Tell us about your background. How did it set you up to be a great data scientist? 

I am an experimental physicist in soft condensed matter by training in my PhD program at Emory University. There are three things that I think have helped me a lot to become a good data scientist:
1). The solid background in physics and math that I obtained back in my college. The knowledge itself isn’t necessarily reflected in my day to day work now. However, the training of logical thinking and critical thinking is really beneficial in a long run.
2). Persistence in finding root causes. The massive amount of data can easily leave you feeling swamped. I believe that always asking why until you get to the true cause of the problem is really essential. Sometimes, the insights are hidden behind and need our motivation to dig them out. No matter if it’s driven by natural stubbornness or original curiosity, I find the persistence usually a great help for walking the last mile to the final discovery.
3). Passion for solving problems using data. There is a joint program in our department where I took computer science courses for a masters degree. In the course projects, I started to find my passion in solving practical problems using data science approaches. Now I am working on product analytics and I cannot imagine how tough it could be without that passion and curiosity about what we can do to improve it.

Continue reading


NLTK vs. spaCy: Natural Language Processing in Python

The venerable NLTK has been the standard tool for natural language processing in Python for some time. It contains an amazing variety of tools, algorithms, and corpuses. Recently, a competitor has arisen in the form of spaCy, which has the goal of providing powerful, streamlined language processing. Let’s see how these toolkits compare.

Philosophy

NLTK provides a number of algorithms to choose from. For a researcher, this is a great boon. Its nine different stemming libraries, for example, allow you to finely customize your model. For the developer who just wants a stemmer to use as part of a larger project, this tends to be a hindrance. Which algorithm performs the best? Which is the fastest? Which is being maintained?

In contrast, spaCy implements a single stemmer, the one that the spaCy developers feel to be best. They promise to keep it updated, and may replace it with an improved algorithm as the state of the art progresses. You may update your version of spaCy and find that improvements to the library have boosted your application without any work necessary. (The downside is that you may need to rewrite some test cases.)

As a quick glance through the NLTK documentation demonstrates, different languages may need different algorithms. NLTK lets you mix and match the algorithms you need, but spaCy has to make a choice for each language. This is a long process and spaCy currently only has support for English.

Strings versus objects

NLTK is essentially a string processing library. All the tools take strings as input and return strings or lists of strings as output. This is simple to deal with at first, but it requires the user to explore the documentation to discover the functions they need.

In contrast, spaCy uses an object-oriented approach. Parsing some text returns a document object, whose words and sentences are represented by objects themselves. Each of these objects has a number of useful attributes and methods, which can be discovered through introspection. This object-oriented approach lends itself much better to modern Python style than does the string-handling system of NLTK.

A more detailed comparison between these approaches is available in this notebook.

Performance

An important part of a production-ready library is its performance, and spaCy brags that it’s ready to be used. We’ll run some tests on the text of the Wikipedia article on NLP, which contains about 10 kB of text. The tests will be word tokenization (splitting a document into words), sentence tokenization (splitting a document into sentences), and part-of-speech tagging (labeling the grammatical function of each word).

timing

It is fairly obvious that spaCy dramatically out-performs NLTK in word tokenization and part-of-speech tagging. Its poor performance in sentence tokenization is a result of differing approaches: NLTK simply attempts to split the text into sentences. In contrast, spaCy is actually constructing a syntactic tree for each sentence, a more robust method that yields much more information about the text. (You can see a visualization of the result here.)

Conclusion

While NLTK is certainly capable, I feel that spaCy is a better choice for most common uses. It makes the hard choices about algorithms for you, providing state-of-the-art solutions. Its Pythonic API will fit in well with modern Python programming practices, and its fast performance will be much appreciated.

Unfortunately, spaCy is English only at the moment, so developers concerned with other languages will need to use NLTK. Developers that need to ensure a particular algorithm is being used will also want to stick with NLTK. Everyone else should take a look at spaCy.


From Eco-Friendly Batteries to Random Forests: Alumni Spotlight on Matt Lawder

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Matt was a Fellow in our Winter 2016 cohort who landed a job with one of our hiring partners, 1010data.

 

Tell us about your background. How did it set you up to be a great data scientist? Matt Lawder

I defended my PhD dissertation at Washington University in St. Louis, a few weeks before coming to The Data Incubator. I was part of the MAPLE lab in Energy, Environmental, and Chemical Engineering (I know, it’s a mouthful). Our lab focused on physics-based electrochemical modeling, mostly geared toward Li-ion batteries.

For my main dissertation project, I studied how batteries age under different real-world cycling patterns. Most cycle life estimates for a battery are based on simple constant charge and constant discharge patterns, but lots of applications (such those experienced by batteries in electric vehicles or coupled to the electric grid) do not have simple cycling patterns. This variation effects the life of the battery.

Both through model simulation and long-term experiments, I had to analyze battery characteristics over thousands of cycles and pick out important features. This type of analysis along with programming computational models that were used to create these data sets helped give me a background to tackle data science problems.

Additionally, I think that working on my PhD projects allowed me to gain experience in solving unstructured problems, where the solution (and sometime even the problem/need) are not well defined. these type of problems are very common, especially once you get outside of academia. 

Continue reading


Five Tips for Future-proofing Your Business with Data Science

We all want to be future-proof: not just prepared for unforeseen developments but positioned well to take advantage of them. Having a flexible, adaptable, and scalable technology stack is a great way to get to achieve that goal when it comes to being able to leverage data science effectively. Here are five ideas I personally think it’s crucial to keep in mind when building out your own functionality:

1. Your pipeline is only as good as its weakest link.

It’s great that your predictive modelers have come up with a thousand new features to incorporate, but have you asked your data engineers how that will affect the performance of backend queries? What about your data collection and ingestion flow? Maybe your team is frothing at the mouth for an upgrade to Spark Streaming to run their clustering algorithms in real time, but your frontend will lose responsiveness if you try to display the results as fast as they come in. The key here is not to get sucked into the hype of “scaling up” without fully recognizing the implications across your entire organization and what new demands will be placed on all those moving parts. Continue reading


The 3 Things That Make Technical Training Worthwhile

Managers understand that having employees who understand the latest tools and technologies is vital to keeping a company competitive. But training employees on those tools and technologies can be a costly endeavor (corporations spent $130 billion on corporate training in general in 2014) and too often training simply doesn’t achieve the objective of giving employees the skills they need.

At The Data Incubator, we work with hundreds of clients who hire PhD data scientists from our fellowship program or enroll their employees in our big data corporate training. We’ve found in our work with these companies across industries that technical training often lacks three important things: hands-on practice, accountability, and breathing room. Continue reading


Automatically Generating License Data from Python Dependencies

We all know how important keeping track of your open-source licensing is for the average startup.  While most people think of open-source licenses as all being the same, there are meaningful differences that could have potentially serious legal implications for your code base.  From permissive licenses like MIT or BSD to so-called “reciprocal” or “copyleft” licenses, keeping track of the alphabet soup of dependencies in your source code can be a pain.

Today, we’re releasing pylicense, a simple python module that will add license data as comments directly from your requirements.txt or environment.yml files.

Continue reading