Data are becoming the new raw material of business
The Economist


Here’s How to Survive the Rise of A.I. – Become a Data Facilitator

Front office jobs at investment banks are increasingly being taken over by intelligent machines. Many current front office employees are worried about being displaced by artificial intelligence, and their fears are not unfounded. Huy Nguyen Trieu, former head of macro structuring at Citigroup, has a positive message for traders who risk being replaced by automation: become a data facilitator.

 

Huy Nguyen Trieu left Citi in 2016. After 13 years in financial engineering at SocGen and RBS, before becoming an Managing Director and head of macro structuring at Citi, he has shifted his focus to acting as a thought leader in the fintech space. Currently a fintech fellow at London’s Imperial College and mentor at fintech accelerator, Level 39, Nguyen Trieu is both fintech guru and entrepreneur. And his current focus is centered on the issue of long term employability in investment banks.

Continue reading


A Study Of Reddit Politics

This article was written for The Data Incubator by Jay Kaiser, a Fellow of our 2018 Winter cohort in Washington, DC who landed a job with our hiring partner, ZeniMax Online Studios, as a Big Data Engineer.

 

The Question

The 2016 Presidential Election was, in a single word, weird. So much happened during the months leading up to November that it became difficult to keep track with what who said when and why. However, the finale of the election that culminated with Republican candidate Donald J. Trump winning the majority of the Electoral College and hence becoming the 45th President of the United States was an outcome which at the time I had thought impossible, if solely due to the aforementioned eccentric series of events that had circulated around Trump for a majority of his candidacy.

Following the election, the prominent question that could not leave my mind was a simple one: how? How had the American people changed so much in only a couple of years to allow an outsider hit by a number of black marks during the election to be elected to the highest position in the United States government? How did so many pollsters and political scientists fail to predict this outcome? How can we best analyze the campaigns of each candidate, now given hindsight and knowledge of the eventual outcome? In an attempt to answer each of these, I have turned to a perhaps unlikely source.

Continue reading


SQLite vs Pandas: Performance Benchmarks

This technical article was written for The Data Incubator by Paul Paczuski, a Fellow of our 2016 Spring cohort in New York City who landed a job with our hiring partner, Genentech as a Clinical Data Scientist.

As a data scientist, we all know that unglamorous data manipulation is 90% of the work. Two of the most common data manipulation tools are SQL and pandas. In this blog, we’ll compare the performance of pandas and SQLite, a simple form of SQL favored by Data Scientists.

Let’s find out the tasks at which each of these excel. Below, we compare Python’s pandas to sqlite for some common data analysis operations: sort, select, load, join, filter, and group by.

Continue reading


Data Science in 30 Minutes: The Accidental Data Scientist with Katrina Riehl, Director of Data Science for HomeAway.com

This FREE webinar will take place LIVE online on July 24th at 5:30PM ET. Register below now, space is limited!


Join The Data Incubator and Katrina Riehl, Director of Data Science for HomeAway.com, for the July 2018 installment of our free monthly webinar series, Data Science in 30 minutes: The Accidental Data Scientist.

Katrina will detail the journey her career has taken from researcher and software developer to Data Scientist. She will explain how her technology roles and skills have evolved as this new discipline emerged over the last decade. First, starting out as a young Python and Artificial Intelligence enthusiast and eventually after many years, finally embracing Data Science as a discipline, and leading a strong and diverse Data Science team.
Continue reading


GPU Cloud Computing Services Compared: AWS, Google Cloud, IBM Nimbix/Power AI, and Crestle

This technical article was written for The Data Incubator by Tim Pollio, a Fellow of our 2017 Fall cohort in Washington, DC who joined The Data Incubator team as one of our resident Data Scientist instructors.

At The Data Incubator, a data science training and placement company, we’re excited about the potential for neural networks and deep learning to transform AI and Big Data. Of course, to practically run deep learning, normal CPUs won’t suffice — you’ll need GPUs. GPUs can dramatically increase the speed of deep learning algorithms, so it’s no surprise that they’re becoming increasingly popular and accessible. Amazon, Google, and IBM all offer GPU enabled options with their cloud computing services, and newer companies like Crestle provide additional options.

We tried four different services — Amazon Web Services, Google Cloud Platform, Nimbix/PowerAI, and Crestle — to find the options with the best performance, price, and convenience. Each service was tested using the same task: 1000 training steps on a tensorflow neural network designed for text prediction. The code for this benchmark can be found here.

Continue reading


Python Multi-Threading vs Multi-Processing

There is a library called threading in Python and it uses threads (rather than just processes) to implement parallelism. This may be surprising news if you know about the Python’s Global Interpreter Lock, or GIL, but it actually works well for certain instances without violating the GIL. And this is all done without any overhead — simply define functions that make I/O requests and the system will handle the rest.

 

Global Interpreter Lock

The Global Interpreter Lock reduces the usefulness of threads in Python (more precisely CPython) by allowing only one native thread to execute at a time. This made implementing Python easier to implement in the (usually thread-unsafe) C libraries and can increase the execution speed of single-threaded programs. However, it remains controvertial because it prevents true lightweight parallelism. You can achieve parallelism, but it requires using multi-processing, which is implemented by the eponymous library multiprocessing. Instead of spinning up threads, this library uses processes, which bypasses the GIL.

It may appear that the GIL would kill Python multithreading but not quite. In general, there are two main use cases for multithreading:

  1. To take advantage of multiple cores on a single machine
  2. To take advantage of I/O latency to process other threads

In general, we cannot benefit from (1) with threading but we can benefit from (2).

Continue reading


Tensorflow with Keras – Empowering Neural Networks for Deep Learning

Building deep neural networks just got easier. TensorFlow has announced that they are incorporating the popular deep learning API, Keras, as part of the core code that ships with TensorFlow 1.2. In the words of Keras’ author François Chollet, “Theano and TensorFlow are closer to NumPy, while Keras is closer to scikit-learn,” which is to say that Keras is at a higher level compared to pure TensorFlow and makes building deep learning models much more manageable.

TensorFlow is one of the fastest, most flexible, and most scalable machine-learning libraries available. It was developed internally by Google Brain and released as an open-source library in November 2015. Almost immediately upon its release, TensorFlow became one of the most popular machine learning libraries. But, as is the case with many libraries that emphasize speed and flexibility, TensorFlow tends to be a bit low-level.

Continue reading


Data Science in 30 Minutes: A Conversation with Gregory Piatetsky-Shapiro, President of KDnuggets


KDnuggets’ Gregory Piatetsky-Shapiro, Ph.D  joined The Data Incubator in January for the first 2018 installment of our free online webinar series, Data Science in 30 minutes! Gregory discussed his career – from Data Mining to Data Science and examine current trends in the field.

From Data Mining to Knowledge Discovery to Data Science: Gregory Piatetsky talked about his pioneering career in data science, including founding KDnuggets, and co-founding KDD Conferences and ACM SIGKDD, and examined current trends in the field, Data Science Automation, citizen Data Scientists, and implications of AI.
Continue reading


Ranking Popular Deep Learning Libraries for Data Science

Gold Blog
At The Data Incubator, we pride ourselves on having the most up to date data science curriculum available. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are using and learning. In addition to their feedback we wanted to develop a data-driven approach for determining what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.
 

The Rankings

Below is a ranking of 23 open-source deep learning libraries that are useful for Data Science, based on Github and Stack Overflow activity, as well as Google search results. The table shows standardized scores, where a value of 1 means one standard deviation above average (average = score of 0). For example, Caffe is one standard deviation above average in Github activity, while deeplearning4j is close to average. See below for methods.


Continue reading