Microsoft Excel is a spreadsheet software, containing data in tabular form. Entries of the data are located in cells, with numbered rows and letter labeled columns. Excel is widespread across industries and has been around for over thirty years. It is often people’s first introduction to data analysis.

Most users feel at home using a GUI to operate Excel and no programming is necessary for the most commonly used features. The data is presented right in front of the user and it is easy to scroll around through the spreadsheet. Making plots from the data only involves highlighting cells in the spreadsheet and clicking a few buttons.

There are various shortcomings with Excel. It is closed source and not free. There are free open-source alternatives like OpenOffice and LibreOffice suites, but there might be compatibility issues between file formats, especially for complex spreadsheets. Excel becomes unstable for files reaching 500 MB, being unresponsive and crashing for large files, hindering productivity. Collaborations can become difficult because it is hard to inspect the spreadsheet and understand how certain values are calculated/populated. It is difficult to understand the user’s thought process and work flow for the analysis.

The functionality of the spreadsheet is sensitive to the layout and moving entries around can have disastrous effects. Tasks like data cleaning, munging, treating different data types, and handling missing data are often difficult and require manual manipulation. Further, the user is limited to the built-in functionality of the program. In short, Excel is great for certain tasks but becomes unwieldy and inefficient as the tasks become more complicated.

The Python library pandas is a great alternative to Excel, providing much of the same functionality and more. Pandas is great for data manipulation, cleaning, analysis, and exploration. Additionally, these tasks can be easily automated and reapplied to different datasets. It is built on top of Python’s NumPy library which offers fast and efficient numerical computations. Unlike Excel, it is open-source and free.

Some people may be intimidated to use Python and pandas because they will have to move away from the point-and-click GUI approach of Excel. Knowing pandas is a great introduction to more powerful and complex data analysis. This article explores how do to some common Excel tasks in pandas, helping people get their feet wet with one of the most powerful Python libraries for data science.

**A Quick Introduction to pandas**

The equivalent to an Excel spreadsheet in pandas is the `DataFrame`

class. It looks like a spreadsheet, with rows, columns, and indices. Let us consider three spreadsheets, the first two containing each student’s grade on an exam and the third has information on which section the students belong. These `DataFrames`

are loaded into memory from CSV files using the `read_csv`

function.

*# import pandas library*
**import pandas as pd**
*# create pandas DataFrame from CSV*
exam_one = pd.read_csv('csv/exam_one.csv', delimiter=',')
exam_two = pd.read_csv('csv/exam_two.csv', delimiter=',')
sections = pd.read_csv('csv/sections.csv', delimiter=',')
*# print the top five entries of the DataFrame*
exam_one.head()

Columns can be indiviudally selected simply by referring to them by name.

```
exam_one['Student Name']
```

**Vlookup**

Experienced Excel users rely on Vlookup, a built-in function that searches (looks up) a specified value in one column and returns the corresponding value of another column. For our example of exam scores, we would like to take a student’s second exam score and include it into the table of first exam score. The column of student names may not be in the same order, e.g., the first name in one table may not correpsond to the first name in another table.

A Vlookup can be done in pandas using the `merge`

method of a DataFrame. The `merge`

method takes the second `DataFrame`

and the `on`

keyword specifies the column name to perform that matching on. The result is a new `DataFrame`

with both exam scores.

*# apply "Vlookup" in padas*
exam_scores = exam_one.merge(exam_two, on='Student Name')
exam_scores.head()

**Pivot Tables**

Pivot tables are used to aggregate and filter data and are a useful tool for data analysis in Excel. We can group data by certain values in a given column and filter out rows. In pandas, we can easily filter out rows from our `DataFrame`

by using Boolean logic. For this example, we would like to determine the student’s name that belong to section “A”. This is done in pandas by first creating an array of True/False values. This array corresponds to which rows met the condition. We then use the resulting Boolean array to only call rows that meet our condition.

`IX = sections['Section'] == 'A' `*# Boolean (True/False) indices or rows that met our filtering criteria*
sections[IX] *# use the Boolean indices to access rows
*

`pivot_table`

or `group_by`

method. Using the `pivot_table`

method, the syntax is:*# use merge to add student's section*
student_data = exam_scores.merge(sections, on='Student Name')
*# create pivot table using pivot_table method*
student_data.pivot_table(index='Section', aggfunc='mean')

`Section`

column and the `aggfunc`

is the operation we want to perform. An alternate approach is to utilize the `groupby`

method, akin to the `GROUP BY`

statement in SQL.```
student_data.groupby('Section').agg({'Exam 1 Score': 'mean', 'Exam 2 Score': 'mean'})
```

`groupby`

, we then used the `agg`

method and passed a Python dictionary. The keys of the dictionary are the column names to apply the aggregation and the values are the actual aggregation function. If we want to apply different or more than one aggregation function to each column, we can pass a dictionary whose values consist of lists of aggregation functions:```
student_data.groupby('Section').agg({'Exam 1 Score':['max', 'mean'], 'Exam 2 Score': ['max', 'min']})
```

**Concluding Remarks**

Pandas provides the powerful `DataFrame`

class, of which Excel users will recognize as resembling and behaving similarly to an Excel spreadsheet. Through the `DataFrame`

class, one can perform operations equivalent to Excel’s Vlookup and pivot tables. One of the barriers of learning a new tool is knowing how to perform operations one was accustomed to with the old tool. This article shows pandas’ syntax, illustrating how pandas works and easing one’s transition from Excel. Learning pandas is a good start in familiarizing oneself with the toolkit of a data scientist.