Data are becoming the new raw material of business
The Economist


MATLAB vs. Python NumPy for Academics Transitioning into Data Science

h5g3etjnacmazg8oq17z_400x400

At The Data Incubator, we pride ourselves on having the most up to date data science curriculum available. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are using and learning. In addition to their feedback we wanted to develop a data-driven approach for determining what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.

This technical article was written for The Data Incubator by Dan Taylor, a Fellow of our 2017 Spring cohort in Washington, DC. 

 

For many of us with roots in academic research, MATLAB was our first introduction to data analysis. However, due to its high cost, MATLAB is not very common beyond the academy. It is simply too expensive for most companies to be able to afford a license. Luckily, for experienced MATLAB users, the transition to free and open source tools, such as Python’s NumPy, is fairly straight-forward. This post aims to compare the functionalities of MATLAB with Python’s NumPy library, in order to assist those transitioning from academic research into a career in data science.

MATLAB has several benefits when it comes to data analysis. Perhaps most important is its low barrier of entry for users with little programming experience. MathWorks has put a great deal of effort into making MATLAB’s user interface both expansive and intuitive. This means new users can quickly get up and running with their data without knowing how to code. It is possible to import, model, and visualize structured data without typing a single line of code. Because of this, MATLAB is a great entrance point for scientists into programmatic analysis. Of course, the true power of MATLAB can only be unleashed through more deliberate and verbose programming, but users can gradually move into this more complicated space as they become more comfortable with programming. MATLAB’s other strengths include its deep library of functions and extensive documentation, a virtual “instruction manual” full of detailed explanations and examples.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

NLTK vs. spaCy: Natural Language Processing in Python

The venerable NLTK has been the standard tool for natural language processing in Python for some time. It contains an amazing variety of tools, algorithms, and corpuses. Recently, a competitor has arisen in the form of spaCy, which has the goal of providing powerful, streamlined language processing. Let’s see how these toolkits compare.

Philosophy

board-2084777_960_720NLTK provides a number of algorithms to choose from. For a researcher, this is a great boon. Its nine different stemming libraries, for example, allow you to finely customize your model. For the developer who just wants a stemmer to use as part of a larger project, this tends to be a hindrance. Which algorithm performs the best? Which is the fastest? Which is being maintained?

In contrast, spaCy implements a single stemmer, the one that the spaCy developers feel to be best. They promise to keep it updated, and may replace it with an improved algorithm as the state of the art progresses. You may update your version of spaCy and find that improvements to the library have boosted your application without any work necessary. (The downside is that you may need to rewrite some test cases.)

As a quick glance through the NLTK documentation demonstrates, different languages may need different algorithms. NLTK lets you mix and match the algorithms you need, but spaCy has to make a choice for each language. This is a long process and spaCy currently only has support for English.  Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Automatically Generating License Data from Python Dependencies

computer-1869236_960_720We all know how important keeping track of your open-source licensing is for the average startup.  While most people think of open-source licenses as all being the same, there are meaningful differences that could have potentially serious legal implications for your code base.  From permissive licenses like MIT or BSD to so-called “reciprocal” or “copyleft” licenses, keeping track of the alphabet soup of dependencies in your source code can be a pain.

Today, we’re releasing pylicense, a simple python module that will add license data as comments directly from your requirements.txt or environment.yml files.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Painlessly Deploying Data Apps with Bokeh, Flask, and Heroku

laptop-in-the-office-1967479_960_720Here at The Data Incubator, our Fellows deploy their own fully functional, public-facing web app to showcase their data science skills to employers. This not only gives them valuable experience dynamically fetching and displaying data, but also encourages them to think about end user interaction. To demo the process, we decided to marry together some of our favorite technologies:

  • Flask, a slick web framework for Python
  • Heroku for cloud-based app deployment
  • Bokeh for interactive, D3.js-style visualizations
  • Git for version control and distributing code

The goal is to create some distant ancestor of Google Finance: a form capable of accepting a stock ticker as input and producing a plot of the daily close price. Here’s the finished product. So how do we get there?

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone