Data are becoming the new raw material of business
The Economist

2018 Data Sources for Cool Data Science Projects, provided by Thinknum

Links to our previous “Data Sources for Cool Data Science Projects” posts:
Part 1Part 2Part 3Part 4, Part 5

At The Data Incubator, we run a Data Science Fellowship program for Master’s and PhD graduates looking to transition to a career in industry. Our admissions team, as well as our hiring partners, love Fellows who don’t mind getting their hands dirty with data. That’s why our applicants submit ideas for capstone projects they’ll work on throughout the 8-week Fellowship to showcase their data science skills. One of the biggest obstacles to creating and completing successful projects has been getting access to interesting data.

Today, we’re excited to announce a partnership with leading alternative webcrawled data provider, Thinknum. Thinknum has been the principal provider of web crawled data to the finance community for over 3 years, counting more than 150 elite hedge funds and a majority of investment banks in their client list, employing the data to experiment with ever-more innovative and differentiated ways of producing investment ideas across all sectors and multiple asset classes. More recently, Thinknum’s data has been in high demand for the some of the largest and most innovative corporate customers for internal strategic decision making. The data is also heavily used by journalists, especially those reporting on the financial sector, with the media outlets like CNN, Business Insider and CNBC all using Thinknum resources in their stories. This partnership will provide Fellows and Fellowship applicants access to some of the data used by experts in the finance industry and corporate leaders on a daily basis.

Business, economic and social activity is continually moving online. This increasing digital activity leaves behind data trails that, with proper organization, can reveal otherwise invisible trends, shifts and movements. Thinknum clients, and now The Data Incubator Fellows and applicants can utilize this data for the purposes of investing, gaining deeper understanding of businesses, or telling a story about an industry trend. Thinknum trawls the internet to collect data on over 400,000 public and private companies across the globe every day, generating huge amounts of data. Their intuitive web-based tool will allow fellows to easily navigate huge volumes of data to gather insights, create correlations, and generate visualisations to share with other fellows in seconds.

Thinknum Data

Thinknum tracks thousands of websites capturing and indexing vast amounts of public data, indexes it and maps it back to individual companies. In the full Thinknum library there are over 20 datasets, each containing dozens of metrics updated daily.

3 Datasets

Thinknum is providing The Data Incubator with access to three real world datasets for our fellows to analyze and explore. In terms of potential projects, there are virtually limitless options for each dataset and most of them haven’t been worked through. If you take a look at the number of columns for each, you will get a sense just how many questions one can ask. Included are a few initial suggestions though.

Enter your email to receive the data sets and get started on your own data science projects:


    Job Postings:

  • This database tracks individual job postings on corporate websites, allowing researchers and data scientists to view overall hiring plans of a company overtime. As well as historical data, users explore in a great detail what types of positions a company is looking to fill, where a company is looking to grow geographically, and in what specific product/business lines the company is looking to expand the most.
  • Using this database,Thinknum Media journalists were able to show that the number of job listings at Apples new headquarters containing the word “Siri” had spiked in the recent weeks. They also saw that almost all of the 161 jobs related to Siri, 154 were in software engineering. From their findings, Thinknum journalists were able predict Apple’s efforts to concentrate on Siri development an entire week before the plans were officially announced by Apple.
    • Project suggestions:

    • In which geographies are tech companies hiring the most engineers, blockchain developers, etc?
    • Using job openings data, explore how banks are shifting their strategy to heavier reliance on technology/heavier regulatory burden.


    Linkedin Profiles:

  • This database tracks and records the number of employees across companies on daily basis and provides real time insight into how aggressively a company is growing vs its own plans and within its industry.
  • Here, Thinknum Media looked at the LinkedIn profile data for Vox and Buzzfeed employees, as well as job listings data. The journalists also looked at company survey data from Glassdoor, and found that the numbers of Vox employees who had a positive outlook on the future of their company had fallen almost 20%. By combining all these datasets, they found that the number of open job listings was falling, the number of people reporting to be employees of the companies had fallen, and coupled with the findings from the Glassdoor surveys – showed a picture of slowing company growth for both Vox and Buzzfeed.
    • Project suggestions:

    • Which companies have delivered on their strategic expansion plans (filled the most job openings that showed up on Linkedin)?
    • Find companies where hiring is most predictive of stock prices.


    Facebook Followers:

  • Social media platforms like Facebook provide a myriad of data points about companies such as customer traction, foot traffic, and brand awareness among others.
  • By analyzing Facebook ‘check in’ data, Investment Bank Cowen used this data to track foot traffic to Chipotle starting in 2017, and thus predict falls in Chipotle stock performance as well. This metric of analyzing footfall became a staple for fast food restaurant research analysts as discussed by Yahoo Finance article.
    • Project suggestions:

    • Compare companies with highest volatility of “talking about count” — who they are – and use any information online to see if this metric overlaps with highly publicized events and marketing campaigns.
    • Facebook check-ins as a metric for foot traffic for restaurant, hospitality and retail businesses. Who are the winners in attracting customers to physical locations.
    • Facebook followers and which companies are the most successful at growing social media traction

While building your own project cannot replicate the experience of fellowship at The Data Incubator (our Fellows get amazing access to hiring managers and access to nonpublic data sources) we hope this will get you excited about working in data science. And when you are ready, you can apply to be a Fellow!

Got any more data sources? Let us know and we’ll add them to the list!


Visit our website to learn more about our offerings:


Here’s How to Survive the Rise of A.I. – Become a Data Facilitator

Front office jobs at investment banks are increasingly being taken over by intelligent machines. Many current front office employees are worried about being displaced by artificial intelligence, and their fears are not unfounded. Huy Nguyen Trieu, former head of macro structuring at Citigroup, has a positive message for traders who risk being replaced by automation: become a data facilitator.


Huy Nguyen Trieu left Citi in 2016. After 13 years in financial engineering at SocGen and RBS, before becoming an Managing Director and head of macro structuring at Citi, he has shifted his focus to acting as a thought leader in the fintech space. Currently a fintech fellow at London’s Imperial College and mentor at fintech accelerator, Level 39, Nguyen Trieu is both fintech guru and entrepreneur. And his current focus is centered on the issue of long term employability in investment banks.

Continue reading

A Study Of Reddit Politics

This article was written for The Data Incubator by Jay Kaiser, a Fellow of our 2018 Winter cohort in Washington, DC who landed a job with our hiring partner, ZeniMax Online Studios, as a Big Data Engineer.


The Question

The 2016 Presidential Election was, in a single word, weird. So much happened during the months leading up to November that it became difficult to keep track with what who said when and why. However, the finale of the election that culminated with Republican candidate Donald J. Trump winning the majority of the Electoral College and hence becoming the 45th President of the United States was an outcome which at the time I had thought impossible, if solely due to the aforementioned eccentric series of events that had circulated around Trump for a majority of his candidacy.

Following the election, the prominent question that could not leave my mind was a simple one: how? How had the American people changed so much in only a couple of years to allow an outsider hit by a number of black marks during the election to be elected to the highest position in the United States government? How did so many pollsters and political scientists fail to predict this outcome? How can we best analyze the campaigns of each candidate, now given hindsight and knowledge of the eventual outcome? In an attempt to answer each of these, I have turned to a perhaps unlikely source.

Continue reading

SQLite vs Pandas: Performance Benchmarks

This technical article was written for The Data Incubator by Paul Paczuski, a Fellow of our 2016 Spring cohort in New York City who landed a job with our hiring partner, Genentech as a Clinical Data Scientist.

As a data scientist, we all know that unglamorous data manipulation is 90% of the work. Two of the most common data manipulation tools are SQL and pandas. In this blog, we’ll compare the performance of pandas and SQLite, a simple form of SQL favored by Data Scientists.

Let’s find out the tasks at which each of these excel. Below, we compare Python’s pandas to sqlite for some common data analysis operations: sort, select, load, join, filter, and group by.

Continue reading

Data Science in 30 Minutes: The Accidental Data Scientist with Katrina Riehl, Director of Data Science for

This FREE webinar will take place LIVE online on July 24th at 5:30PM ET. Register below now, space is limited!

Join The Data Incubator and Katrina Riehl, Director of Data Science for, for the July 2018 installment of our free monthly webinar series, Data Science in 30 minutes: The Accidental Data Scientist.

Katrina will detail the journey her career has taken from researcher and software developer to Data Scientist. She will explain how her technology roles and skills have evolved as this new discipline emerged over the last decade. First, starting out as a young Python and Artificial Intelligence enthusiast and eventually after many years, finally embracing Data Science as a discipline, and leading a strong and diverse Data Science team.
Continue reading

GPU Cloud Computing Services Compared: AWS, Google Cloud, IBM Nimbix/Power AI, and Crestle

This technical article was written for The Data Incubator by Tim Pollio, a Fellow of our 2017 Fall cohort in Washington, DC who joined The Data Incubator team as one of our resident Data Scientist instructors.

At The Data Incubator, a data science training and placement company, we’re excited about the potential for neural networks and deep learning to transform AI and Big Data. Of course, to practically run deep learning, normal CPUs won’t suffice — you’ll need GPUs. GPUs can dramatically increase the speed of deep learning algorithms, so it’s no surprise that they’re becoming increasingly popular and accessible. Amazon, Google, and IBM all offer GPU enabled options with their cloud computing services, and newer companies like Crestle provide additional options.

We tried four different services — Amazon Web Services, Google Cloud Platform, Nimbix/PowerAI, and Crestle — to find the options with the best performance, price, and convenience. Each service was tested using the same task: 1000 training steps on a tensorflow neural network designed for text prediction. The code for this benchmark can be found here.

Continue reading

Python Multi-Threading vs Multi-Processing

There is a library called threading in Python and it uses threads (rather than just processes) to implement parallelism. This may be surprising news if you know about the Python’s Global Interpreter Lock, or GIL, but it actually works well for certain instances without violating the GIL. And this is all done without any overhead — simply define functions that make I/O requests and the system will handle the rest.


Global Interpreter Lock

The Global Interpreter Lock reduces the usefulness of threads in Python (more precisely CPython) by allowing only one native thread to execute at a time. This made implementing Python easier to implement in the (usually thread-unsafe) C libraries and can increase the execution speed of single-threaded programs. However, it remains controvertial because it prevents true lightweight parallelism. You can achieve parallelism, but it requires using multi-processing, which is implemented by the eponymous library multiprocessing. Instead of spinning up threads, this library uses processes, which bypasses the GIL.

It may appear that the GIL would kill Python multithreading but not quite. In general, there are two main use cases for multithreading:

  1. To take advantage of multiple cores on a single machine
  2. To take advantage of I/O latency to process other threads

In general, we cannot benefit from (1) with threading but we can benefit from (2).

Continue reading

Tensorflow with Keras – Empowering Neural Networks for Deep Learning

Building deep neural networks just got easier. TensorFlow has announced that they are incorporating the popular deep learning API, Keras, as part of the core code that ships with TensorFlow 1.2. In the words of Keras’ author François Chollet, “Theano and TensorFlow are closer to NumPy, while Keras is closer to scikit-learn,” which is to say that Keras is at a higher level compared to pure TensorFlow and makes building deep learning models much more manageable.

TensorFlow is one of the fastest, most flexible, and most scalable machine-learning libraries available. It was developed internally by Google Brain and released as an open-source library in November 2015. Almost immediately upon its release, TensorFlow became one of the most popular machine learning libraries. But, as is the case with many libraries that emphasize speed and flexibility, TensorFlow tends to be a bit low-level.

Continue reading

Data Science in 30 Minutes: A Conversation with Gregory Piatetsky-Shapiro, President of KDnuggets

KDnuggets’ Gregory Piatetsky-Shapiro, Ph.D  joined The Data Incubator in January for the first 2018 installment of our free online webinar series, Data Science in 30 minutes! Gregory discussed his career – from Data Mining to Data Science and examine current trends in the field.

From Data Mining to Knowledge Discovery to Data Science: Gregory Piatetsky talked about his pioneering career in data science, including founding KDnuggets, and co-founding KDD Conferences and ACM SIGKDD, and examined current trends in the field, Data Science Automation, citizen Data Scientists, and implications of AI.
Continue reading

Ranking Popular Deep Learning Libraries for Data Science

Gold Blog
At The Data Incubator, we pride ourselves on having the most up to date data science curriculum available. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are using and learning. In addition to their feedback we wanted to develop a data-driven approach for determining what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.

The Rankings

Below is a ranking of 23 open-source deep learning libraries that are useful for Data Science, based on Github and Stack Overflow activity, as well as Google search results. The table shows standardized scores, where a value of 1 means one standard deviation above average (average = score of 0). For example, Caffe is one standard deviation above average in Github activity, while deeplearning4j is close to average. See below for methods.

Continue reading