At The Data Incubator, we run a free six week data science fellowship to help our Fellows land industry jobs. Our hiring partners love considering Fellows who don’t mind getting their hands dirty with data. That’s why our Fellows work on cool capstone projects that showcase those skills. One of the biggest obstacles to successful projects has been getting access to interesting data. Here are some more cool public data sources you can use for your next project:
- Summer Olympics Data: The Olympic Database is a great starting place for all Olympic data. You can also look at medal winners from 1896-2008 here and find more data on medalists at Sports-refence.
- College Football Data: You can look at Datahub for a majority of data sets. Cal-Poly has some limited data and Sports-referene can give you date broken down by players, schools, seasons, and coaches.
- Major League Baseball: Mlbfarm has data dating back to 1876. Statcrunch also has some interesting sets to explore as well as retrosheet, baseball prospectus, and baseball almanac.
- National Football League: The NFL publishes a lot of player statistics and NFLsavant offers a wealth of play by play data by year. Aragorn published a metadata CSV set since 1980 and the NFL has a full report of 2015 injuries.
- Soccer: You can find all FIFA 2014 World Cup players.
- Tennis: The 2016 US Open came and went a few weeks ago, but tennis data is forever, or for this prize winnings data set, at least back to 1968. Tennis Abstract has a vast list of up to date data, and there’s also years worth of match results + rankings.
- Pokemon: There’s no public dataset yet on the increasingly popular Pokemon-go game, but Pokéapi delivers everything from battle skill to evolution from the traditional game series.
- Usage & Reviews: Pew Research Center has a very thorough set of teens and gaming, GitHub has a great dataset of steam reviews and a Reddit user made this dataset of IGN game reviews.
- Sales: VGChartz is a great place to start for best selling video games, Garaph has data graphs for game hardware sales, and there’s a Video Game Sales wiki that has the most up to date sales.
- Public Transit: Transitfeeds and Transitland have compiled data on timetables, routes, and stops from transit systems across the globe. There’s also data on NYC transit zones, MTA feedback , and public transit deaths.
- Police: There is a lot of public police data, you can find up to date FBI employee data, police violence, police open data census, and salaries.
- Energy: The government publishes a list of electricity prices by state, weekly oil supply estimates, hydropower generation and more specific energy data by state.
- Public Libraries: You can find annual responses from almost 10,000 public library systems, detailing everything down to collection sizes and library hours. There’s also a set of every museum in the US.
- Public Policy: Michigan State University hosts the very extensive Correlates of State Policy Project and their codebook explains the data in greater detail. For state and local policies, check out this index.
- Politicians: Thanks to EveryPolitician, you can find just about every political figure from more than 200 countries. GitHub has compiled every US congressman since 1789.
While building your own project cannot replicate the experience of fellowship at The Data Incubator (our Fellows get amazing access to hiring managers and access to nonpublic data sources) we hope this will get you excited about working in data science. And when you are ready, you can apply to be a Fellow!
Got any more data sources? Let us know and we’ll add them to the list!