Data are becoming the new raw material of business
The Economist

Ranking Popular Distributed Computing Packages for Data Science

At The Data Incubator, we strive to provide the most up-to-date data science curriculum available. Using feedback from our corporate and government partners, we deliver training on the most sought after data science tools and techniques in industry. We wanted to include a more data-driven approach to developing the curriculum for our corporate data science training and our free Data Science Fellowship program for PhD and master’s graduates looking to get hired as professional Data Scientists. To achieve this goal, we started by looking at and ranking popular deep learning libraries for data science. Next, we wanted to analyze the popularity of distributed computing packages for data science. Here are the results.

The Rankings

Below is a ranking of the top 20 of 140 distributed computing packages that are useful for Data Science, based on Github and Stack Overflow activity, as well as Google Search results. The table shows standardized scores, where a value of 1 means one standard deviation above average (average = score of 0). For example, Apache Hadoop is 6.6 standard deviations above average in Stack Overflow activity, while Apache Flink is close to average. See below for methods.

Continue reading


Manipulating Data with pandas and PostgreSQL: Which is better?


 
Working on large data science projects usually involves the user accessing, manipulating, and retrieving data on a server. Next, the work flow moves client-side where the user will apply more refined data analysis and processing, typically tasks not possible or too clumsy to be done on the server. SQL (Structured Query Language) is ubiquitous in industry and data scientists will have to use it in their work to access data on the server.

The line between what data manipulation should be done server-side using SQL or on the client-side using a language like Python is not clear. Further, people who are either uncomfortable or dislike using SQL may be tempted to keep server-side manipulation to a minimum and reserve more of those actions on the client-side. With powerful and popular Python libraries for data wrangling and manipulation, the temptation to keep server-side processing to a minimum has increased.

This article will compare the execution time for several typical data manipulation tasks such as join and group by using PostgreSQL and pandas. PostgreSQL, often shortened as Postgres, is an object-relational database management system. It is free and open-source and runs on all major operating systems. Pandas is a Python data manipulation library that offers data structures akin to Excel spreadsheets and SQL tables and functions for manipulating those data structures.
Continue reading


Beyond Excel: Popular Data Analysis Methods from Excel, using pandas

Microsoft Excel is a spreadsheet software, containing data in tabular form. Entries of the data are located in cells, with numbered rows and letter labeled columns. Excel is widespread across industries and has been around for over thirty years. It is often people’s first introduction to data analysis.

Most users feel at home using a GUI to operate Excel and no programming is necessary for the most commonly used features. The data is presented right in front of the user and it is easy to scroll around through the spreadsheet. Making plots from the data only involves highlighting cells in the spreadsheet and clicking a few buttons.

There are various shortcomings with Excel. It is closed source and not free. There are free open-source alternatives like OpenOffice and LibreOffice suites, but there might be compatibility issues between file formats, especially for complex spreadsheets. Excel becomes unstable for files reaching 500 MB, being unresponsive and crashing for large files, hindering productivity. Collaborations can become difficult because it is hard to inspect the spreadsheet and understand how certain values are calculated/populated. It is difficult to understand the user’s thought process and work flow for the analysis.

The functionality of the spreadsheet is sensitive to the layout and moving entries around can have disastrous effects. Tasks like data cleaning, munging, treating different data types, and handling missing data are often difficult and require manual manipulation. Further, the user is limited to the built-in functionality of the program. In short, Excel is great for certain tasks but becomes unwieldy and inefficient as the tasks become more complicated.
Continue reading


Scikit-learn vs. StatsModels: Which, why, and how?

At The Data Incubator, we pride ourselves on having the most up to date data science curriculum available. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are using and learning. In addition to their feedback we wanted to develop a data-driven approach for determining what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.

This technical article was written for The Data Incubator by Brett Sutton, a Fellow of our 2017 Summer cohort in Washington, DC. 

When you’re getting started on a project that requires doing some heavy stats and machine learning in Python, there are a handful of tools and packages available. Two popular options are scikit-learn and StatsModels. In this post, we’ll take a look at each one and get an understanding of what each has to offer.

Scikit-learn’s development began in 2007 and was first released in 2010. The current version, 0.19, came out in in July 2017. StatsModels started in 2009, with the latest version, 0.8.0, released in February 2017. Though they are similar in age, scikit-learn is more widely used and developed as we can see through taking a quick look at each package on Github. Both packages have an active development community, though scikit-learn attracts a lot more attention, as shown below. Continue reading


MATLAB vs. Python NumPy for Academics Transitioning into Data Science

h5g3etjnacmazg8oq17z_400x400

At The Data Incubator, we pride ourselves on having the most up to date data science curriculum available. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are using and learning. In addition to their feedback we wanted to develop a data-driven approach for determining what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.

This technical article was written for The Data Incubator by Dan Taylor, a Fellow of our 2017 Spring cohort in Washington, DC. 

 

For many of us with roots in academic research, MATLAB was our first introduction to data analysis. However, due to its high cost, MATLAB is not very common beyond the academy. It is simply too expensive for most companies to be able to afford a license. Luckily, for experienced MATLAB users, the transition to free and open source tools, such as Python’s NumPy, is fairly straight-forward. This post aims to compare the functionalities of MATLAB with Python’s NumPy library, in order to assist those transitioning from academic research into a career in data science.

MATLAB has several benefits when it comes to data analysis. Perhaps most important is its low barrier of entry for users with little programming experience. MathWorks has put a great deal of effort into making MATLAB’s user interface both expansive and intuitive. This means new users can quickly get up and running with their data without knowing how to code. It is possible to import, model, and visualize structured data without typing a single line of code. Because of this, MATLAB is a great entrance point for scientists into programmatic analysis. Of course, the true power of MATLAB can only be unleashed through more deliberate and verbose programming, but users can gradually move into this more complicated space as they become more comfortable with programming. MATLAB’s other strengths include its deep library of functions and extensive documentation, a virtual “instruction manual” full of detailed explanations and examples.

Continue reading


Ranking Popular Deep Learning Libraries for Data Science

Gold  Blog
At The Data Incubator, we pride ourselves on having the most up to date data science curriculum available. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are using and learning. In addition to their feedback we wanted to develop a data-driven approach for determining what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.

The Rankings

Below is a ranking of 23 open-source deep learning libraries that are useful for Data Science, based on Github and Stack Overflow activity, as well as Google search results. The table shows standardized scores, where a value of 1 means one standard deviation above average (average = score of 0). For example, Caffe is one standard deviation above average in Github activity, while deeplearning4j is close to average. See below for methods.

Results and Discussion

The ranking is based on equally weighing its three components: Github (stars and forks), Stack Overflow (tags and questions), and Google Results (total and quarterly growth rate). These were obtained using available APIs. Coming up with a comprehensive list of deep learning toolkits was tricky – in the end, we scraped five different lists that we thought were representative (see methods below for details). Computing standardized scores for each metric allows us to see which packages stand out in each category. The full ranking is here, while the raw data is here.
Continue reading


The APIs for Neural Networks in TensorFlow

By Dana Mastropole, Robert Schroll, and Michael Li

TensorFlow has gathered quite a bit of attention as the new hot toolkit for building neural networks. To the beginner, it may seem that the only thing that rivals this interest is the number of different APIs which you can use. In this article we will go over a few of them, building the same neural network each time. We will start with low-level TensorFlow math, and then show how to simplify that code with TensorFlow’s layer API. We will also discuss two libraries built on top of TensorFlow, TFLearn and Keras.

The MNIST database is a collection of handwritten digits. Each is recorded in a $28\times28$ pixel grayscale image. We we build a two-layer perceptron network to classify each image as a digit from zero to nine. The first layer will fully connect the 784 inputs to 64 hidden neurons, using a sigmoid activation. The second layer will connect those hidden neurons to 10 outputs, scaled with the softmax function. The network will be trained with stochastic gradient descent, on minibatches of 64, for 20 epochs. (These values are chosen not because they are the best, but because they produce reasonable results in a reasonable time.)


The Many Facets of Artificial Intelligence

artificial-intelligence-2228610_960_720When you think of artificial intelligence (AI), do you envision C-3PO or matrix multiplication? HAL 9000 or pruning decision trees? This is an example of ambiguous language, and for a field which has gained so much traction in recent years, it’s particularly important that we think about and define what we mean by artificial intelligence – especially when communicating between managers, salespeople, and the technical side of things. These days, AI is often used as a synonym for deep learning, perhaps because both ideas entered popular tech-consciousness at the same time. In this article I’ll go over the big picture definition of AI and how it differs from machine learning and deep learning. Continue reading


Ranked: 15 Python Packages for Data Science

Cover of Python Packages for Data Science

At The Data Incubator we pride ourselves on having the latest data science curriculum. Much of our course material is based on feedback from corporate and government partners about the technologies they are looking to learn. However, we wanted to develop a more data-driven approach to what we teach in our data science corporate training and our free fellowship for
Data science masters and PhDs looking to begin their careers in the industry.

This report is the second in a series analyzing data science related topics, to see more be sure to check out our R Packages for Machine Learning report. We thought it would be useful to the data science community to rank and analyze a variety of topics related to the profession in a simple, easy to digest cheat sheet, rankings or reports. Continue reading