If you’re an aspiring data scientist but still processing your data in Excel, you might want to upgrade your toolset. Why? Firstly, while advanced features like Excel Pivot tables can do a lot, they don’t offer nearly the flexibility, control, and power of tools like SQL, or their functional equivalents in Python (Pandas) or R (Dataframes). Also, Excel has low size limits, making it suitable for “small data”, not “big data.”
In this blog entry we’ll talk about SQL. This should cover your “medium data” needs, which we’ll define as the next level of data where the rows do not fit the 1 million row restriction in Excel. SQL stores data in tables, which you can think of as a spreadsheet layout but with more structure. Each row represents a specific record, (e.g. an employee at your company) and each column of a table corresponds to an attribute (e.g. name, department id, salary). Critically, each column must be of the same “type”. Here is a sample of the table Employees:
EmployeeId Name StartYear Salary DepartmentId 1 Bob 2001 10.5 10 2 Sally 2004 20. 10 3 Alice 2005 25. 20 4 Fred 2004 12.5 20
SQL has many keywords which compose its query language but the ones most relevant to data scientists are SELECT, WHERE, GROUP BY, JOIN. We’ll go through these each individually.
SELECT is the foundational keyword in SQL. SELECT can also filter on columns. For example
SELECT Name, StartYear FROM Employees
Name StartYear Bob 2001 Sally 2004 Alice 2005 Fred 2004
The WHERE clause filters the rows. For example
SELECT * FROM Employees WHERE StartYear=2004
EmployeeId Name StartYear Salary DepartmentId 2 Sally 2004 20. 10 4 Fred 2004 12.5 20
Next, the GROUP BY clause allows for combining rows using different functions like COUNT (count) and AVG (average). For example,
SELECT StartYear, COUNT(*) as Num, AVG(Salary) as AvgSalary FROM EMPLOYEES GROUP BY StartYear
StartYear Num AvgSalary 2001 1 10.5 2004 2 16.25 2005 1 25.
Finally, the JOIN clause allows us to join in other tables. For example, assume we have a table called Departments:
DepartmentId DepartmentName 10 Sales 20 Engineering
We could use JOIN to combine the Employees and Departments tables based ON the DepartmentId fields:
SELECT Employees.Name AS EmpName, Departments.DepartmentName AS DepName FROM Employees JOIN Departments ON Employees.DepartmentId = Departments.DepartmentId;
The results might look like:
EmpName DepName Bob Sales Sally Sales Alice Engineering Fred Engineering
We’ve ignored a lot of details about joins: e.g. there are actually (at least) 4 types of joins, but hopefully this gives you a good picture.
Conclusion and Further Reading
With these basic commands, you can get a lot of basic data processing done. Don’t forget, that you can nest queries and create really complicated joins. It’s a lot more powerful than Excel, and gives you much better control of . Of course, there’s a lot more to SQL than what we’ve mentioned and this is only intended to wet your appetite and give you a taste of what you’re missing.
- For additional tutorials on SQL, checkout this interactive online tutorial or the data-scientist-geared one provided by Mode Analytics. Also, check out this comprehensive list of 59 websites to learn SQL.
- For additional resources on Python Pandas, checkout a list of tutorials (complete with github repos!) or the 10 minute pandas tutorial for the impatient.
- For a tutorial about R’s Dataframes, checkout this page.
And when you’re ready to step it up from “medium data” to “big data”, you should apply for a fellowship at The Data Incubator where we work with current-generation data-processing technologies like MapReduce and Spark!