Data are becoming the new raw material of business
The Economist

Data Science in 30 Minutes: Scikit-Learn with Core-Contributor Andreas Müller



scikit-learn‘s Andreas Müller joined The Data Incubator on December 5th, 2017 for our FREE monthly webinar series, Data Science in 30 Minutes!

We talked about everything new in 0.19, that got released in July of this year, and what the plans are for 0.20 that will be released early next year. Highlights are the multiple metric grid-search, faster T-SNE and better handling of categorical and mixed data.
Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Data Science in 30 Minutes: A Conversation with Gregory Piatetsky-Shapiro, President of KDnuggets


This FREE webinar will be on January 11th at 5:30 PM ET. Register now, space is limited! Sign up HERE.

Join The Data Incubator and KDnuggets’ Gregory Piatetsky-Shapiro, Ph.D for the next installment of our free online webinar series, Data Science in 30 minutes: A Conversation with KDnugget’s President. Gregory will discuss his career – from Data Mining to Data Science and examine current trends in the field.

From Data Mining to Knowledge Discovery to Data Science: Gregory Piatetsky talks about his pioneering career in data science, including founding KDnuggets, and co-founding KDD Conferences and ACM SIGKDD, and examines current trends in the field, Data Science Automation, citizen Data Scientists, and implications of AI.
Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Data Science in 30 Minutes: Infrastructure for Usable Machine Learning with Spark Creator and Stanford Professor, Matei Zaharia


This FREE webinar will be on April 3rd at 5:30 PM ET. Register now, space is limited! Sign up HERE.

Join The Data Incubator and Databricks co-founder, Matei Zaharia, Ph.D for the next installment of our FREE online webinar series, Data Science in 30 minutes: Infrastructure for Usable Machine Learning.

Despite incredible recent advances in machine learning, building machine learning applications remains prohibitively time-consuming and expensive for all but the best-trained, best-funded engineering teams. This expense usually comes not from a need for new and improved statistical models but instead from a lack of systems and tools for supporting end-to-end machine learning application development, from data preparation and labeling to productionization and monitoring. In the Stanford DAWN project, we are developing a set of tools to make these processes easier, from weak supervision approaches to dramatically reduce the need for labeled data, to query-specific model specialization to reduce serving cost, and end-to-end ML systems that encapsulate a complete task and greatly simplify the interface to the user.
Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

The Value of Prioritizing Python: Alumni Spotlight on Aviv Bachan

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists.  Aviv was a Fellow in our Fall 2016 cohort who landed a job with our hiring partner, Argyle Data

Tell us about your background. How did it set you up to be a great data scientist? 

My background is in Geosciences. I was a climate modeler so I had a substantial amount of experience with scientific computing (numerical linear algebra, differential equations, data assimilation, etc).

What do you think you got out of The Data Incubator?

Two things. First, I got my first exposure to data science in a non-academic setting: what sort of problems data scientists might be tasked with solving within a company, and the tools they use to do so.

Second, and more importantly, I got to know a fantastic group of people and make valuable connections. I got the interview for my current position based on  a recommendation from a friend from my cohort who had been hired prior to me (we now work together, which is great!). Recently, I got another email from a friend indicating that he would be happy to refer me to his company if I wished. I don’t know if I just happened to have been part of a particularly great cohort, but I really did have a blast going through the incubator with them, and I look forward to keeping in touch with all of them for years to come. Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Beyond Excel: Popular Data Analysis Methods from Excel, using pandas

Microsoft Excel is a spreadsheet software, containing data in tabular form. Entries of the data are located in cells, with numbered rows and letter labeled columns. Excel is widespread across industries and has been around for over thirty years. It is often people’s first introduction to data analysis.

Most users feel at home using a GUI to operate Excel and no programming is necessary for the most commonly used features. The data is presented right in front of the user and it is easy to scroll around through the spreadsheet. Making plots from the data only involves highlighting cells in the spreadsheet and clicking a few buttons.

There are various shortcomings with Excel. It is closed source and not free. There are free open-source alternatives like OpenOffice and LibreOffice suites, but there might be compatibility issues between file formats, especially for complex spreadsheets. Excel becomes unstable for files reaching 500 MB, being unresponsive and crashing for large files, hindering productivity. Collaborations can become difficult because it is hard to inspect the spreadsheet and understand how certain values are calculated/populated. It is difficult to understand the user’s thought process and work flow for the analysis.

The functionality of the spreadsheet is sensitive to the layout and moving entries around can have disastrous effects. Tasks like data cleaning, munging, treating different data types, and handling missing data are often difficult and require manual manipulation. Further, the user is limited to the built-in functionality of the program. In short, Excel is great for certain tasks but becomes unwieldy and inefficient as the tasks become more complicated.
Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Predicting Which Bills Will Become Laws, with Data Science: Alumni Spotlight on Michael Yen


At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists.  Michael was a Fellow in our Winter 2017 cohort in San Francisco, who landed a job with one of our hiring partners, Cerego

 

Tell us about your background. How did it set you up to be a great data scientist?

My formal education is in physics, but I’ve also done a lot of my research at UC Berkeley’s Computer Science department. I cherish both of these backgrounds equally since I learned how to “do science” from physics and build really cool things from computer science. I think this is a winning combination for a data scientist since a lot of companies are looking for scientist who can write code.

 

What do you think you got out of The Data Incubator?

Two things, and both are equally important. First, TDI gave me the exposure to their hiring partners that I just couldn’t get on my own. Before starting the fellowship I had spent over 10 months applying to jobs on my own with a call back rate of 3%. At TDI, my call back rate shot to 90% and I even began fielding unsolicited interviews. I think having the TDI mark of approval certainly moved me up in the stack of resumes. Secondly, I expanded my professional network by becoming close friends with twelve other fellows who are all going to be doing fantastic things in the future.
Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Scikit-learn vs. StatsModels: Which, why, and how?

At The Data Incubator, we pride ourselves on having the most up to date data science curriculum available. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are using and learning. In addition to their feedback we wanted to develop a data-driven approach for determining what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.

This technical article was written for The Data Incubator by Brett Sutton, a Fellow of our 2017 Summer cohort in Washington, DC. 

When you’re getting started on a project that requires doing some heavy stats and machine learning in Python, there are a handful of tools and packages available. Two popular options are scikit-learn and StatsModels. In this post, we’ll take a look at each one and get an understanding of what each has to offer.

Scikit-learn’s development began in 2007 and was first released in 2010. The current version, 0.19, came out in in July 2017. StatsModels started in 2009, with the latest version, 0.8.0, released in February 2017. Though they are similar in age, scikit-learn is more widely used and developed as we can see through taking a quick look at each package on Github. Both packages have an active development community, though scikit-learn attracts a lot more attention, as shown below. Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Learning to Think Like a Data Scientist: Alumni Spotlight on Ceena Modarres

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists.  Ceena was a Fellow in our Winter 2017 cohort who landed a job with our hiring partner, Capital One

Tell us about your background. How did it set you up to be a great data scientist 

I received my M.S. in Reliability Engineering from the University of Maryland. In Reliability Engineering, a practitioner will assess/prevent the failure of a physical system (car, computer, etc.). Many of these approaches tend to be statistics and data driven and much of the modern research in the field (including my own) uses Machine Learning to improve relevant analyses. However, when I was done with my Master’s, I realized I was more passionate about the Data Science/Machine Learning than the engineering side. So when I heard about The Data Incubator, it seemed like a great fit.

What do you think you got out of The Data Incubator?

As a recent MS graduate who had never worked before, the most important thing I learned at The Data Incubator was how to think like a Data Scientist. Since Data Science is still a new field, many positions require a unique and not necessarily homogenous set of skills. The Data Incubator not only teaches its students all the necessary technology, but it teaches them how to think about Data Science problems in a systematic and effective way. TDI also provided a network of possible employers and former alumni that proved valuable for my job search.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

MATLAB vs. Python NumPy for Academics Transitioning into Data Science

h5g3etjnacmazg8oq17z_400x400

At The Data Incubator, we pride ourselves on having the most up to date data science curriculum available. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are using and learning. In addition to their feedback we wanted to develop a data-driven approach for determining what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.

This technical article was written for The Data Incubator by Dan Taylor, a Fellow of our 2017 Spring cohort in Washington, DC. 

 

For many of us with roots in academic research, MATLAB was our first introduction to data analysis. However, due to its high cost, MATLAB is not very common beyond the academy. It is simply too expensive for most companies to be able to afford a license. Luckily, for experienced MATLAB users, the transition to free and open source tools, such as Python’s NumPy, is fairly straight-forward. This post aims to compare the functionalities of MATLAB with Python’s NumPy library, in order to assist those transitioning from academic research into a career in data science.

MATLAB has several benefits when it comes to data analysis. Perhaps most important is its low barrier of entry for users with little programming experience. MathWorks has put a great deal of effort into making MATLAB’s user interface both expansive and intuitive. This means new users can quickly get up and running with their data without knowing how to code. It is possible to import, model, and visualize structured data without typing a single line of code. Because of this, MATLAB is a great entrance point for scientists into programmatic analysis. Of course, the true power of MATLAB can only be unleashed through more deliberate and verbose programming, but users can gradually move into this more complicated space as they become more comfortable with programming. MATLAB’s other strengths include its deep library of functions and extensive documentation, a virtual “instruction manual” full of detailed explanations and examples.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Taking on Data Science with Mathematics: Alumni Spotlight on Brian Munson

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists.  Brian was a Fellow in our Winter 2017 cohort who landed a job with Quantworks

Tell us about your background. How did it set you up to be a great data scientist 

I was a research mathematician and professor before deciding on a career change. Having a deep knowledge of math really helps me understand how things work, whether it is the theoretical ideas behind fancy algorithms or reading a piece of code and deciphering what it does.

What do you think you got out of The Data Incubator?

Confidence in my code-writing ability. A polished resume and an important talking point with employers in my capstone project.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone