Data are becoming the new raw material of business
The Economist


Beyond Excel: Popular Data Analysis Methods from Excel, using pandas

Microsoft Excel is a spreadsheet software, containing data in tabular form. Entries of the data are located in cells, with numbered rows and letter labeled columns. Excel is widespread across industries and has been around for over thirty years. It is often people’s first introduction to data analysis.

Most users feel at home using a GUI to operate Excel and no programming is necessary for the most commonly used features. The data is presented right in front of the user and it is easy to scroll around through the spreadsheet. Making plots from the data only involves highlighting cells in the spreadsheet and clicking a few buttons.

There are various shortcomings with Excel. It is closed source and not free. There are free open-source alternatives like OpenOffice and LibreOffice suites, but there might be compatibility issues between file formats, especially for complex spreadsheets. Excel becomes unstable for files reaching 500 MB, being unresponsive and crashing for large files, hindering productivity. Collaborations can become difficult because it is hard to inspect the spreadsheet and understand how certain values are calculated/populated. It is difficult to understand the user’s thought process and work flow for the analysis.

The functionality of the spreadsheet is sensitive to the layout and moving entries around can have disastrous effects. Tasks like data cleaning, munging, treating different data types, and handling missing data are often difficult and require manual manipulation. Further, the user is limited to the built-in functionality of the program. In short, Excel is great for certain tasks but becomes unwieldy and inefficient as the tasks become more complicated.

The Python library pandas is a great alternative to Excel, providing much of the same functionality and more. Pandas is great for data manipulation, cleaning, analysis, and exploration. Additionally, these tasks can be easily automated and reapplied to different datasets. It is built on top of Python’s NumPy library which offers fast and efficient numerical computations. Unlike Excel, it is open-source and free.

Some people may be intimidated to use Python and pandas because they will have to move away from the point-and-click GUI approach of Excel. Knowing pandas is a great introduction to more powerful and complex data analysis. This article explores how do to some common Excel tasks in pandas, helping people get their feet wet with one of the most powerful Python libraries for data science.

 

A Quick Introduction to pandas

The equivalent to an Excel spreadsheet in pandas is the DataFrame class. It looks like a spreadsheet, with rows, columns, and indices. Let us consider three spreadsheets, the first two containing each student’s grade on an exam and the third has information on which section the students belong. These DataFrames are loaded into memory from CSV files using the read_csv function.

In [79]:
# import pandas library
import pandas as pd

# create pandas DataFrame from CSV
exam_one = pd.read_csv('csv/exam_one.csv', delimiter=',')
exam_two = pd.read_csv('csv/exam_two.csv', delimiter=',')
sections = pd.read_csv('csv/sections.csv', delimiter=',')

# print the top five entries of the DataFrame
exam_one.head()
Out[79]:
Student Name Exam 1 Score
0 Edwin Duncan 83
1 Steve Hayes 69
2 Cecilia Richards 91
3 Dorothy Harris 53
4 Jordan Davis 76

 

Columns can be indiviudally selected simply by referring to them by name.

In [80]:
exam_one['Student Name']
Out[80]:
0        Edwin Duncan
1         Steve Hayes
2    Cecilia Richards
3      Dorothy Harris
4        Jordan Davis
Name: Student Name, dtype: object

 

Vlookup

Experienced Excel users rely on Vlookup, a built-in function that searches (looks up) a specified value in one column and returns the corresponding value of another column. For our example of exam scores, we would like to take a student’s second exam score and include it into the table of first exam score. The column of student names may not be in the same order, e.g., the first name in one table may not correpsond to the first name in another table.

A Vlookup can be done in pandas using the merge method of a DataFrame. The merge method takes the second DataFrame and the on keyword specifies the column name to perform that matching on. The result is a new DataFrame with both exam scores.

In [81]:
# apply "Vlookup" in padas
exam_scores = exam_one.merge(exam_two, on='Student Name')
exam_scores.head()
Out[81]:
Student Name Exam 1 Score Exam 2 Score
0 Edwin Duncan 83 33
1 Steve Hayes 69 42
2 Cecilia Richards 91 73
3 Dorothy Harris 53 75
4 Jordan Davis 76 81

 

Pivot Tables

Pivot tables are used to aggregate and filter data and are a useful tool for data analysis in Excel. We can group data by certain values in a given column and filter out rows. In pandas, we can easily filter out rows from our DataFrame by using Boolean logic. For this example, we would like to determine the student’s name that belong to section “A”. This is done in pandas by first creating an array of True/False values. This array corresponds to which rows met the condition. We then use the resulting Boolean array to only call rows that meet our condition.

In [82]:
IX = sections['Section'] == 'A'  # Boolean (True/False) indices or rows that met our filtering criteria
sections[IX]  # use the Boolean indices to access rows
Out[82]:
Student Name Section
0 Cecilia Richards A
3 Steve Hayes A

 

For each section, we would like to calculate the mean score for each exam. We can create pivot tables in pandas by using either the pivot_table or group_by method. Using the pivot_table method, the syntax is:
In [83]:
# use merge to add student's section
student_data = exam_scores.merge(sections, on='Student Name')

# create pivot table using pivot_table method
student_data.pivot_table(index='Section', aggfunc='mean')
Out[83]:
Exam 1 Score Exam 2 Score
Section
A 80.0 57.5
B 64.5 78.0
C 83.0 33.0

 

In the above code, the new index corresponds to the former Section column and the aggfunc is the operation we want to perform. An alternate approach is to utilize the groupby method, akin to the GROUP BY statement in SQL.
In [96]:
student_data.groupby('Section').agg({'Exam 1 Score': 'mean', 'Exam 2 Score': 'mean'})
Out[96]:
Exam 2 Score Exam 1 Score
Section
A 57.5 80.0
B 78.0 64.5
C 33.0 83.0

 

In the above code, after we applied groupby, we then used the agg method and passed a Python dictionary. The keys of the dictionary are the column names to apply the aggregation and the values are the actual aggregation function. If we want to apply different or more than one aggregation function to each column, we can pass a dictionary whose values consist of lists of aggregation functions:
In [97]:
student_data.groupby('Section').agg({'Exam 1 Score':['max', 'mean'], 'Exam 2 Score': ['max', 'min']})
Out[97]:
Exam 2 Score Exam 1 Score
max min max mean
Section
A 73 42 91 80.0
B 81 75 76 64.5
C 33 33 83 83.0

 

Concluding Remarks

Pandas provides the powerful DataFrame class, of which Excel users will recognize as resembling and behaving similarly to an Excel spreadsheet. Through the DataFrame class, one can perform operations equivalent to Excel’s Vlookup and pivot tables. One of the barriers of learning a new tool is knowing how to perform operations one was accustomed to with the old tool. This article shows pandas’ syntax, illustrating how pandas works and easing one’s transition from Excel. Learning pandas is a good start in familiarizing oneself with the toolkit of a data scientist.

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Back to index