Data are becoming the new raw material of business
The Economist

A Study Of Reddit Politics

This article was written for The Data Incubator by Jay Kaiser, a Fellow of our 2018 Winter cohort in Washington, DC who landed a job with our hiring partner, ZeniMax Online Studios, as a Big Data Engineer.

 

The Question

The 2016 Presidential Election was, in a single word, weird. So much happened during the months leading up to November that it became difficult to keep track with what who said when and why. However, the finale of the election that culminated with Republican candidate Donald J. Trump winning the majority of the Electoral College and hence becoming the 45th President of the United States was an outcome which at the time I had thought impossible, if solely due to the aforementioned eccentric series of events that had circulated around Trump for a majority of his candidacy.

Following the election, the prominent question that could not leave my mind was a simple one: how? How had the American people changed so much in only a couple of years to allow an outsider hit by a number of black marks during the election to be elected to the highest position in the United States government? How did so many pollsters and political scientists fail to predict this outcome? How can we best analyze the campaigns of each candidate, now given hindsight and knowledge of the eventual outcome? In an attempt to answer each of these, I have turned to a perhaps unlikely source.

Continue reading


SQLite vs Pandas: Performance Benchmarks

This technical article was written for The Data Incubator by Paul Paczuski, a Fellow of our 2016 Spring cohort in New York City who landed a job with our hiring partner, Genentech as a Clinical Data Scientist.

As a data scientist, we all know that unglamorous data manipulation is 90% of the work. Two of the most common data manipulation tools are SQL and pandas. In this blog, we’ll compare the performance of pandas and SQLite, a simple form of SQL favored by Data Scientists.

Let’s find out the tasks at which each of these excel. Below, we compare Python’s pandas to sqlite for some common data analysis operations: sort, select, load, join, filter, and group by.

Continue reading


GPU Cloud Computing Services Compared: AWS, Google Cloud, IBM Nimbix/Power AI, and Crestle

This technical article was written for The Data Incubator by Tim Pollio, a Fellow of our 2017 Fall cohort in Washington, DC who joined The Data Incubator team as one of our resident Data Scientist instructors.

At The Data Incubator, a data science training and placement company, we’re excited about the potential for neural networks and deep learning to transform AI and Big Data. Of course, to practically run deep learning, normal CPUs won’t suffice — you’ll need GPUs. GPUs can dramatically increase the speed of deep learning algorithms, so it’s no surprise that they’re becoming increasingly popular and accessible. Amazon, Google, and IBM all offer GPU enabled options with their cloud computing services, and newer companies like Crestle provide additional options.

We tried four different services — Amazon Web Services, Google Cloud Platform, Nimbix/PowerAI, and Crestle — to find the options with the best performance, price, and convenience. Each service was tested using the same task: 1000 training steps on a tensorflow neural network designed for text prediction. The code for this benchmark can be found here.

Continue reading


Python Multi-Threading vs Multi-Processing

There is a library called threading in Python and it uses threads (rather than just processes) to implement parallelism. This may be surprising news if you know about the Python’s Global Interpreter Lock, or GIL, but it actually works well for certain instances without violating the GIL. And this is all done without any overhead — simply define functions that make I/O requests and the system will handle the rest.

 

Global Interpreter Lock

The Global Interpreter Lock reduces the usefulness of threads in Python (more precisely CPython) by allowing only one native thread to execute at a time. This made implementing Python easier to implement in the (usually thread-unsafe) C libraries and can increase the execution speed of single-threaded programs. However, it remains controvertial because it prevents true lightweight parallelism. You can achieve parallelism, but it requires using multi-processing, which is implemented by the eponymous library multiprocessing. Instead of spinning up threads, this library uses processes, which bypasses the GIL.

It may appear that the GIL would kill Python multithreading but not quite. In general, there are two main use cases for multithreading:

  1. To take advantage of multiple cores on a single machine
  2. To take advantage of I/O latency to process other threads

In general, we cannot benefit from (1) with threading but we can benefit from (2).

Continue reading


Ranking Popular JavaScript Visualization Packages for Data Science

At The Data Incubator, we strive to provide the most up-to-date data science curriculum available. Using feedback from our corporate and government partners, we deliver training on the most sought after data science tools and techniques in industry. We wanted to include a more data-driven approach to developing the curriculum for our corporate data science training and our free Data Science Fellowship program for PhD and master’s graduates looking to get hired as professional Data Scientists. To achieve this goal, we started by looking at and ranking popular deep learning libraries for data science, then ranking popular distributed computing packages for data science. Next, we wanted to analyze the popularity of JavaScript visualization packages for data science. Here are the results:
 

The Rankings

Below is a ranking of the top 20 of 110 JavaScript data visualization packages that are useful for Data Science, based on Github and Stack Overflow activity, as well as npmjs (javascript package manager)downloads. The table shows standardized scores, where a value of 1 means one standard deviation above average (average = score of 0). For example, chart.js is 3.29 standard deviations above average in Github activity, while plotly.js is close to average. See below for methods.
Continue reading


Tensorflow with Keras – Empowering Neural Networks for Deep Learning

Building deep neural networks just got easier. TensorFlow has announced that they are incorporating the popular deep learning API, Keras, as part of the core code that ships with TensorFlow 1.2. In the words of Keras’ author François Chollet, “Theano and TensorFlow are closer to NumPy, while Keras is closer to scikit-learn,” which is to say that Keras is at a higher level compared to pure TensorFlow and makes building deep learning models much more manageable.

TensorFlow is one of the fastest, most flexible, and most scalable machine-learning libraries available. It was developed internally by Google Brain and released as an open-source library in November 2015. Almost immediately upon its release, TensorFlow became one of the most popular machine learning libraries. But, as is the case with many libraries that emphasize speed and flexibility, TensorFlow tends to be a bit low-level.

Continue reading


NumPy and pandas – Crucial Tools for Data Scientists

This technical article was written for The Data Incubator by Don Fox, a Fellow of our 2017 Summer cohort in New York City.

When it comes to scientific computing and data science, two key python packages are NumPy and pandas. NumPy is a powerful python library that expands Python’s functionality by allowing users to create multi-dimenional array objects (ndarray). In addition to the creation of ndarray objects, NumPy provides a large set of mathematical functions that can operate quickly on the entries of the ndarray without the need of for loops.

Continue reading


Ranking Popular Distributed Computing Packages for Data Science

At The Data Incubator, we strive to provide the most up-to-date data science curriculum available. Using feedback from our corporate and government partners, we deliver training on the most sought after data science tools and techniques in industry. We wanted to include a more data-driven approach to developing the curriculum for our corporate data science training and our free Data Science Fellowship program for PhD and master’s graduates looking to get hired as professional Data Scientists. To achieve this goal, we started by looking at and ranking popular deep learning libraries for data science. Next, we wanted to analyze the popularity of distributed computing packages for data science. Here are the results.

The Rankings

Below is a ranking of the top 20 of 140 distributed computing packages that are useful for Data Science, based on Github and Stack Overflow activity, as well as Google Search results. The table shows standardized scores, where a value of 1 means one standard deviation above average (average = score of 0). For example, Apache Hadoop is 6.6 standard deviations above average in Stack Overflow activity, while Apache Flink is close to average. See below for methods.
Continue reading


Manipulating Data with pandas and PostgreSQL: Which is better?


 
Working on large data science projects usually involves the user accessing, manipulating, and retrieving data on a server. Next, the work flow moves client-side where the user will apply more refined data analysis and processing, typically tasks not possible or too clumsy to be done on the server. SQL (Structured Query Language) is ubiquitous in industry and data scientists will have to use it in their work to access data on the server.

The line between what data manipulation should be done server-side using SQL or on the client-side using a language like Python is not clear. Further, people who are either uncomfortable or dislike using SQL may be tempted to keep server-side manipulation to a minimum and reserve more of those actions on the client-side. With powerful and popular Python libraries for data wrangling and manipulation, the temptation to keep server-side processing to a minimum has increased.

This article will compare the execution time for several typical data manipulation tasks such as join and group by using PostgreSQL and pandas. PostgreSQL, often shortened as Postgres, is an object-relational database management system. It is free and open-source and runs on all major operating systems. Pandas is a Python data manipulation library that offers data structures akin to Excel spreadsheets and SQL tables and functions for manipulating those data structures.
Continue reading


Beyond Excel: Popular Data Analysis Methods from Excel, using pandas

 

Microsoft Excel is a spreadsheet software, containing data in tabular form. Entries of the data are located in cells, with numbered rows and letter labeled columns. Excel is widespread across industries and has been around for over thirty years. It is often people’s first introduction to data analysis.

Most users feel at home using a GUI to operate Excel and no programming is necessary for the most commonly used features. The data is presented right in front of the user and it is easy to scroll around through the spreadsheet. Making plots from the data only involves highlighting cells in the spreadsheet and clicking a few buttons.

There are various shortcomings with Excel. It is closed source and not free. There are free open-source alternatives like OpenOffice and LibreOffice suites, but there might be compatibility issues between file formats, especially for complex spreadsheets. Excel becomes unstable for files reaching 500 MB, being unresponsive and crashing for large files, hindering productivity. Collaborations can become difficult because it is hard to inspect the spreadsheet and understand how certain values are calculated/populated. It is difficult to understand the user’s thought process and work flow for the analysis.

The functionality of the spreadsheet is sensitive to the layout and moving entries around can have disastrous effects. Tasks like data cleaning, munging, treating different data types, and handling missing data are often difficult and require manual manipulation. Further, the user is limited to the built-in functionality of the program. In short, Excel is great for certain tasks but becomes unwieldy and inefficient as the tasks become more complicated.
Continue reading