Data are becoming the new raw material of business
The Economist

NumPy and pandas – Crucial Tools for Data Scientists

When it comes to scientific computing and data science, two key python packages are NumPy and pandas. NumPy is a powerful python library that expands Python’s functionality by allowing users to create multi-dimenional array objects (ndarray). In addition to the creation of ndarray objects, NumPy provides a large set of mathematical functions that can operate quickly on the entries of the ndarray without the need of for loops. Below is an example of the usage of NumPy. The code creates a random array and calculates the cosine for each entry.

In [23]:
import numpy as np

X = np.random.random((4, 2))  # create random 4x2 array
y = np.cos(X)                 # take the cosine on each entry of X

print y
print "\n The dimension of y is", y.shape
[[ 0.95819067  0.60474588]
 [ 0.78863282  0.95135038]
 [ 0.82418621  0.93289855]
 [ 0.67706351  0.83420891]]

 The dimension of y is (4, 2)

 

We can easily access entries of an array, call individual elements, and select certain rows and columns.

In [24]:
print y[0, :]   # select 1st row
print y[:, 1]   # select 1st column
print y[2, 1]   # select element y_12
print y[1:2, :] # select rows 2nd and 3rd row
[ 0.95819067  0.60474588]
[ 0.60474588  0.95135038  0.93289855  0.83420891]
0.932898546321
[[ 0.78863282  0.95135038]]

 

The pandas (PANel + DAta) Python library allows for easy and fast data analysis and manipulation tools by providing numerical tables and time series data structures called DataFrame and Series, respectively. Pandas was created to do the following:

  • provide data structures that can handle both time and non-time series data
  • allow mathematical operations on the data structures, ignoring the metadata of the data structures
  • use relational operations like those found in programming languages like SQL (join, group by, etc.)
  • handle missing data

Below is an example of the usage of pandas and some of its capabilitites.

In [25]:
import pandas as pd

# create data
states = ['Texas', 'Rhode Island', 'Nebraska'] # string
population = [27.86E6, 1.06E6, 1.91E6]         # float
electoral_votes = [38, 3, 5]                   # integer
is_west_of_MS = [True, False, True]            # Boolean

# create and display DataFrame
headers = ('State', 'Population', 'Electoral Votes', 'West of Mississippi')
data = (states, population, electoral_votes, is_west_of_MS)
data_dict = dict(zip(headers, data))

df1 = pd.DataFrame(data_dict)
df1
Out[25]:
Electoral Votes Population State West of Mississippi
0 38 27860000.0 Texas True
1 3 1060000.0 Rhode Island False
2 5 1910000.0 Nebraska True

 

In the above code, we created a pandas DataFrame object, a tabular data structure that resembles a spreadsheet like those used in Excel. For those familiar with SQL, you can view a DataFrame as an SQL table. The DataFrame we created consists of four columns, each with entries of different data types (integer, float, string, and Boolean).

Pandas is built on top of NumPy, relying on ndarray and its fast and efficient array based mathematical functions. For example, if we wanted to calculate the mean population across the states, we can run

In [26]:
print df1['Population'].mean()
10276666.6667

 

Pandas relies on NumPy data types for the entries in the DataFrame. Printing the types of individual entries using iloc shows

In [27]:
print type(df1['Electoral Votes'].iloc[0])
print type(df1['Population'].iloc[0])
print type(df1['West of Mississippi'].iloc[0])
<type 'numpy.int64'>
<type 'numpy.float64'>
<type 'numpy.bool_'>

 

Another example of the pandas and NumPy compatibility is if we have a DataFrame that is composed of purely numerical data we can apply NumPy functions. For example,

In [28]:
df2 = pd.DataFrame({"times": [1.0, 2.0, 3.0, 4.0], "more times": [5.0, 6.0, 7.0, 8.0]})
df2 = np.cos(df2)
df2.head()
Out[28]:
more times times
0 0.283662 0.540302
1 0.960170 -0.416147
2 0.753902 -0.989992
3 -0.145500 -0.653644

 

Pandas was built to ease data analysis and manipulation. Two import pandas methods are groupby and apply. The groupbymethod groups the DataFrame by values of a certain column and applies some aggregating function on the resulting groups. For example, if we want to determine the maximum population for states grouped by if they are either west or east of the Mississippi river, the syntax is

In [29]:
df1.groupby('West of Mississippi').agg('max')
Out[29]:
Electoral Votes Population State
West of Mississippi
False 3 1060000.0 Rhode Island
True 38 27860000.0 Texas

 

The apply method accepts a function to apply to all the entries of a pandas Series object. This method is useful for applying a customized function to the entries of a column in a pandas DataFrame. For example, we can create a Series object that tells us if a state’s population is more than two million. The result is a Series object that we can append to our original DataFrame object.

In [30]:
more_than_two_million = df1['Population'].apply(lambda x: x > 2E6)  # create Series object of Boolean values
df1['More than a Million'] = more_than_two_million  # append Series object to our original DataFrame
df1.head()
Out[30]:
Electoral Votes Population State West of Mississippi More than a Million
0 38 27860000.0 Texas True True
1 3 1060000.0 Rhode Island False False
2 5 1910000.0 Nebraska True False

 

Accessing columns is inuitive, and returns a pandas Series object.

In [31]:
print df1['Population']
print type(df1['Population'])
0    27860000.0
1     1060000.0
2     1910000.0
Name: Population, dtype: float64
<class 'pandas.core.series.Series'>

 

DataFrame is composed of multiple Series. The DataFrame class resembles a collection of NumPy arrays but with labeled axes and mixed data types across the columns. In fact, Series is subclass of NumPy’s ndarray. While you can achieve the same results of certain pandas methods using NumPy, the result would require more lines of code. Pandas expands on NumPy by providing easy to use methods for data analysis to operate on the DataFrame and Series classes, which are built on NumPy’s powerful ndarrayclass.

 

How memory is configured in NumPy

The power of NumPy comes from the ndarray class and how it is laid out in memory. The ndarray class consists of

  • the data type of the entires of the array
  • a pointer to a contiguous block of memory where the data/entries of the array reside
  • a tuple of the array’s shape
  • a tuple of the array’s stride

The shape refers to the dimension of the array while the stride is the number of bytes to step in a particular dimension when traversing an array in memory. With both the stride and the shape, NumPy has sufficient information to access the array’s entries in memory.

By default, NumPy arranges the data in row-major order, like in C. Row-major order lays out the entries of the array by groupings of rows. An alternative is column-major ordering, as used in Fortran and MATLAB, which uses columns as the grouping. NumPy is capable of implementing both ordering schemes by passing the keyword order when creating an array. See the figure below for the differeneces in the schemes.

The continguous memory layout allows NumPy to use vector processors in modern CPUs and array computations. Array computations are efficient because NumPy can loop through the entries in data properly by knowing the location in memory and the data type of the entries. NumPy can also link to established and highly optimized linear algebra libraries such as BLAS and LAPACK. As you can see, using the NumPy ndarray offers more efficient and fast computations over the native Python list. No wonder pandas and other Python libraries are built on top of NumPy. However, the infrastructure of the ndarray class must require all entries to be the same data type, something that a Python list class is not limited to.

 

Hetereogeneous data types in pandas

As mentioned earlier, the pandas DataFrame class can store hetereogeneous data; each column contains a Series object of a different data type. The DataFrame is stored as several blocks in memory, where each block contains the columns of the DataFramethat have the same data type. For example, a DataFrame with five columns comprised of two columns of floats, two columns of integers, and one Boolean column will be stored using three blocks.

With the data of the DataFrame stored using blocks grouped by data, operations within blocks are effcient, as described previously on why NumPy operations are fast. However, operations involving several blocks will not be efficient. Information on these blocks of a DataFrame object can be accessed using ._data.

In [32]:
df1._data
Out[32]:
BlockManager
Items: Index([u'Electoral Votes', u'Population', u'State', u'West of Mississippi',
       u'More than a Million'],
      dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
FloatBlock: slice(1, 2, 1), 1 x 3, dtype: float64
IntBlock: slice(0, 1, 1), 1 x 3, dtype: int64
BoolBlock: slice(3, 4, 1), 1 x 3, dtype: bool
ObjectBlock: slice(2, 3, 1), 1 x 3, dtype: object
BoolBlock: slice(4, 5, 1), 1 x 3, dtype: bool

The DataFrame class can allow columns with mixed data types. For these cases, the data type for the column is referred to as object. When the data type is object, the data is no longer stored in the NumPy ndarray format, but rather a continguous block of pointers where each pointer referrences a Python object. Thus, operations on a DataFrame involving Series of data type object will not be efficient.

Strings are stored in pandas as Python object data type. This is because strings have variable memory size. In contrast, integers and floats have a fixed byte size. However, if a DataFrame has columns with categorial data, encoding the entries using integers will be more memory and computational efficient. For example, a column containing entries of “small”, “medium”, and “large” can be coverted to 0, 1, and 2 and the data type of that new column is now an integer.

 

The importance of understanding Numpy and pandas

Through this article, we have seen

  • examples of usage of NumPy and pandas
  • how memory is configured in NumPy
  • how pandas relies on NumPy
  • how pandas deals with hetereogeneous data types

While knowing how NumPy and pandas work is not necessary to use these tools, knowing the working of these libraries and how they are related enables data scientists to effectively yield these tools. More effective use of these tools becomes more important for larger data sets and more complex analysis, where even a small improvement in terms of percentage translates to large time savings.


Manipulating Data with pandas and PostgreSQL: Which is better?


 
Working on large data science projects usually involves the user accessing, manipulating, and retrieving data on a server. Next, the work flow moves client-side where the user will apply more refined data analysis and processing, typically tasks not possible or too clumsy to be done on the server. SQL (Structured Query Language) is ubiquitous in industry and data scientists will have to use it in their work to access data on the server.

The line between what data manipulation should be done server-side using SQL or on the client-side using a language like Python is not clear. Further, people who are either uncomfortable or dislike using SQL may be tempted to keep server-side manipulation to a minimum and reserve more of those actions on the client-side. With powerful and popular Python libraries for data wrangling and manipulation, the temptation to keep server-side processing to a minimum has increased.

This article will compare the execution time for several typical data manipulation tasks such as join and group by using PostgreSQL and pandas. PostgreSQL, often shortened as Postgres, is an object-relational database management system. It is free and open-source and runs on all major operating systems. Pandas is a Python data manipulation library that offers data structures akin to Excel spreadsheets and SQL tables and functions for manipulating those data structures.
Continue reading


Beyond Excel: Popular Data Analysis Methods from Excel, using pandas

 

Microsoft Excel is a spreadsheet software, containing data in tabular form. Entries of the data are located in cells, with numbered rows and letter labeled columns. Excel is widespread across industries and has been around for over thirty years. It is often people’s first introduction to data analysis.

Most users feel at home using a GUI to operate Excel and no programming is necessary for the most commonly used features. The data is presented right in front of the user and it is easy to scroll around through the spreadsheet. Making plots from the data only involves highlighting cells in the spreadsheet and clicking a few buttons.

There are various shortcomings with Excel. It is closed source and not free. There are free open-source alternatives like OpenOffice and LibreOffice suites, but there might be compatibility issues between file formats, especially for complex spreadsheets. Excel becomes unstable for files reaching 500 MB, being unresponsive and crashing for large files, hindering productivity. Collaborations can become difficult because it is hard to inspect the spreadsheet and understand how certain values are calculated/populated. It is difficult to understand the user’s thought process and work flow for the analysis.

The functionality of the spreadsheet is sensitive to the layout and moving entries around can have disastrous effects. Tasks like data cleaning, munging, treating different data types, and handling missing data are often difficult and require manual manipulation. Further, the user is limited to the built-in functionality of the program. In short, Excel is great for certain tasks but becomes unwieldy and inefficient as the tasks become more complicated.
Continue reading


Scikit-learn vs. StatsModels: Which, why, and how?

At The Data Incubator, we pride ourselves on having the most up to date data science curriculum available. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are using and learning. In addition to their feedback we wanted to develop a data-driven approach for determining what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.

This technical article was written for The Data Incubator by Brett Sutton, a Fellow of our 2017 Summer cohort in Washington, DC. 

When you’re getting started on a project that requires doing some heavy stats and machine learning in Python, there are a handful of tools and packages available. Two popular options are scikit-learn and StatsModels. In this post, we’ll take a look at each one and get an understanding of what each has to offer.

Scikit-learn’s development began in 2007 and was first released in 2010. The current version, 0.19, came out in in July 2017. StatsModels started in 2009, with the latest version, 0.8.0, released in February 2017. Though they are similar in age, scikit-learn is more widely used and developed as we can see through taking a quick look at each package on Github. Both packages have an active development community, though scikit-learn attracts a lot more attention, as shown below. Continue reading


MATLAB vs. Python NumPy for Academics Transitioning into Data Science

h5g3etjnacmazg8oq17z_400x400

At The Data Incubator, we pride ourselves on having the most up to date data science curriculum available. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are using and learning. In addition to their feedback we wanted to develop a data-driven approach for determining what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.

This technical article was written for The Data Incubator by Dan Taylor, a Fellow of our 2017 Spring cohort in Washington, DC. 

 

For many of us with roots in academic research, MATLAB was our first introduction to data analysis. However, due to its high cost, MATLAB is not very common beyond the academy. It is simply too expensive for most companies to be able to afford a license. Luckily, for experienced MATLAB users, the transition to free and open source tools, such as Python’s NumPy, is fairly straight-forward. This post aims to compare the functionalities of MATLAB with Python’s NumPy library, in order to assist those transitioning from academic research into a career in data science.

MATLAB has several benefits when it comes to data analysis. Perhaps most important is its low barrier of entry for users with little programming experience. MathWorks has put a great deal of effort into making MATLAB’s user interface both expansive and intuitive. This means new users can quickly get up and running with their data without knowing how to code. It is possible to import, model, and visualize structured data without typing a single line of code. Because of this, MATLAB is a great entrance point for scientists into programmatic analysis. Of course, the true power of MATLAB can only be unleashed through more deliberate and verbose programming, but users can gradually move into this more complicated space as they become more comfortable with programming. MATLAB’s other strengths include its deep library of functions and extensive documentation, a virtual “instruction manual” full of detailed explanations and examples.

Continue reading


Ranking Popular Deep Learning Libraries for Data Science

Gold Blog
At The Data Incubator, we pride ourselves on having the most up to date data science curriculum available. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are using and learning. In addition to their feedback we wanted to develop a data-driven approach for determining what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.
 

The Rankings

Below is a ranking of 23 open-source deep learning libraries that are useful for Data Science, based on Github and Stack Overflow activity, as well as Google search results. The table shows standardized scores, where a value of 1 means one standard deviation above average (average = score of 0). For example, Caffe is one standard deviation above average in Github activity, while deeplearning4j is close to average. See below for methods.


Continue reading


The APIs for Neural Networks in TensorFlow

By Dana Mastropole, Robert Schroll, and Michael Li

TensorFlow has gathered quite a bit of attention as the new hot toolkit for building neural networks. To the beginner, it may seem that the only thing that rivals this interest is the number of different APIs which you can use. In this article we will go over a few of them, building the same neural network each time. We will start with low-level TensorFlow math, and then show how to simplify that code with TensorFlow’s layer API. We will also discuss two libraries built on top of TensorFlow, TFLearn and Keras.

The MNIST database is a collection of handwritten digits. Each is recorded in a $28\times28$ pixel grayscale image. We we build a two-layer perceptron network to classify each image as a digit from zero to nine. The first layer will fully connect the 784 inputs to 64 hidden neurons, using a sigmoid activation. The second layer will connect those hidden neurons to 10 outputs, scaled with the softmax function. The network will be trained with stochastic gradient descent, on minibatches of 64, for 20 epochs. (These values are chosen not because they are the best, but because they produce reasonable results in a reasonable time.)


The Many Facets of Artificial Intelligence

artificial-intelligence-2228610_960_720When you think of artificial intelligence (AI), do you envision C-3PO or matrix multiplication? HAL 9000 or pruning decision trees? This is an example of ambiguous language, and for a field which has gained so much traction in recent years, it’s particularly important that we think about and define what we mean by artificial intelligence – especially when communicating between managers, salespeople, and the technical side of things. These days, AI is often used as a synonym for deep learning, perhaps because both ideas entered popular tech-consciousness at the same time. In this article I’ll go over the big picture definition of AI and how it differs from machine learning and deep learning. Continue reading


Ranked: 15 Python Packages for Data Science

Cover of Python Packages for Data Science

At The Data Incubator we pride ourselves on having the latest data science curriculum. Much of our course material is based on feedback from corporate and government partners about the technologies they are looking to learn. However, we wanted to develop a more data-driven approach to what we teach in our data science corporate training and our free fellowship for
Data science masters and PhDs looking to begin their careers in the industry.

This report is the second in a series analyzing data science related topics, to see more be sure to check out our R Packages for Machine Learning report. We thought it would be useful to the data science community to rank and analyze a variety of topics related to the profession in a simple, easy to digest cheat sheet, rankings or reports. Continue reading


Spark 2.0 on Jupyter with Toree

spark-logo2Spark

Spark is one of the most popular open-source distributed computation engines and offers a scalable, flexible framework for processing huge amounts of data efficiently. The recent 2.0 release milestone brought a number of significant improvements including DataSets, an improved version of DataFrames, more support for SparkR, and a lot more. One of the great things about Spark is that it’s relatively autonomous and doesn’t require a lot of extra infrastructure to work. While Spark’s latest release is at 2.1.0 at the time of publishing, we’ll use the example of 2.0.1 throughout this post.

Jupyter

Jupyter notebooks are an interactive way to code that can enable rapid prototyping and exploration. It essentially connects a browser-based frontend, the Jupyter Server, to an interactive REPL underneath that can process snippets of code. The advantage to the user is being able to write code in small chunks which can be run independently but share the same namespace, greatly facilitating testing or trying multiple approaches in a modular fashion. The platform supports a number of kernels (the things that actually run the code) besides the out-of-the-box Python, but connecting Jupyter to Spark is a little trickier. Enter Apache Toree, a project meant to solve this problem by acting as a middleman between a running Spark cluster and other applications.

In this post I’ll describe how we go from a clean Ubuntu installation to being able to run Spark 2.0 code on Jupyter. Continue reading