Data are becoming the new raw material of business
The Economist

Data Science in 30 Minutes: Alan Schwarz, Former NYTimes Journalist, on Numbers-Based Journalism

This FREE webinar will be on February 27th at 5:30 PM ET. Register below now, space is limited!

Join The Data Incubator and former NY Times journalist Alan Schwarz for the next installment of our free online webinar series, Data Science in 30 minutes: Numbers-Based Journalism.

Alan Schwarz, former N.Y. Times investigative reporter and Pulitzer finalist, discusses numbers-based journalism that shook industries from the National Football League to Big Pharma. Alan used data analysis to expose the NFL’s cover-up of concussions as well as issues in child psychiatry.
Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Data Science in 30 Minutes: Kirk Borne – A Fortuitous Career in Data Science


Booz Allen Hamilton’s Kirk Borne joined The Data Incubator in August for our FREE monthly webinar series, Data Science in 30 minutes!

Kirk Borne took us on a journey through his career in science and technology, explaining how the industry – and himself – have evolved over the last 4 decades. Starting with skipping lunches in high school to a systematic twitter obsession, Kirk shed light on his road to success in the data science industry.
Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Finding a Curiosity in Data: Alumni Spotlight on Jun Zhang

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists.  Jun was a Fellow in our Winter 2017 cohort who has moved to Germany for a job with our hiring partner, Boehringer Ingelheim

Tell us about your background. How did it set you up to be a great data scientist 

I have a background in applied mechanics and engineering. My Ph.D. research simulated the response of randomly structured material, from which I learned a lot about statistical analysis, numerical computing and model development. Moreover, my academic experience fostered in me a “curiosity in data”, which I think is the most important quality for a data scientist.

What do you think you got out of The Data Incubator?

During the program, I got the chance to learn what “data science” really is as an insider. In addition to those data analytics skills, I learned about how data science is applied in different industries, what qualities employers are looking for in a data scientist, what are the “front end” and “back end” of a data science project are and what are the associated skills with each stage. Only after those closer views, I can know what my strength and interest are and how I should prepare for my future career path.
Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Data Science in 30 Minutes: Scikit-Learn with Core-Contributor Andreas Mueller


scikit-learn‘s Andreas Mueller joined The Data Incubator in December 2017 for our FREE monthly webinar series, Data Science in 30 Minutes!

We talked about everything new in 0.19, that got released in July of this year, and what the plans are for 0.20 that will be released early next year. Highlights are the multiple metric grid-search, faster T-SNE and better handling of categorical and mixed data.
Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Data Science in 30 Minutes: A Conversation with Gregory Piatetsky-Shapiro, President of KDnuggets


KDnuggets’ Gregory Piatetsky-Shapiro, Ph.D  joined The Data Incubator on January 11th for the first 2018 installment of our free online webinar series, Data Science in 30 minutes! Gregory discussed his career – from Data Mining to Data Science and examine current trends in the field.

From Data Mining to Knowledge Discovery to Data Science: Gregory Piatetsky talked about his pioneering career in data science, including founding KDnuggets, and co-founding KDD Conferences and ACM SIGKDD, and examined current trends in the field, Data Science Automation, citizen Data Scientists, and implications of AI.
Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Data Science in 30 Minutes: Infrastructure for Usable Machine Learning with Spark Creator and Stanford Professor, Matei Zaharia

This FREE webinar will be on April 3rd at 5:30 PM ET. Register below now, space is limited!

Join The Data Incubator and Databricks co-founder, Matei Zaharia, Ph.D for the next installment of our FREE online webinar series, Data Science in 30 minutes: Infrastructure for Usable Machine Learning.

Despite incredible recent advances in machine learning, building machine learning applications remains prohibitively time-consuming and expensive for all but the best-trained, best-funded engineering teams. This expense usually comes not from a need for new and improved statistical models but instead from a lack of systems and tools for supporting end-to-end machine learning application development, from data preparation and labeling to productionization and monitoring. In the Stanford DAWN project, we are developing a set of tools to make these processes easier, from weak supervision approaches to dramatically reduce the need for labeled data, to query-specific model specialization to reduce serving cost, and end-to-end ML systems that encapsulate a complete task and greatly simplify the interface to the user.
Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

The Value of Prioritizing Python: Alumni Spotlight on Aviv Bachan

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists.  Aviv was a Fellow in our Fall 2016 cohort who landed a job with our hiring partner, Argyle Data

Tell us about your background. How did it set you up to be a great data scientist? 

My background is in Geosciences. I was a climate modeler so I had a substantial amount of experience with scientific computing (numerical linear algebra, differential equations, data assimilation, etc).

What do you think you got out of The Data Incubator?

Two things. First, I got my first exposure to data science in a non-academic setting: what sort of problems data scientists might be tasked with solving within a company, and the tools they use to do so.

Second, and more importantly, I got to know a fantastic group of people and make valuable connections. I got the interview for my current position based on  a recommendation from a friend from my cohort who had been hired prior to me (we now work together, which is great!). Recently, I got another email from a friend indicating that he would be happy to refer me to his company if I wished. I don’t know if I just happened to have been part of a particularly great cohort, but I really did have a blast going through the incubator with them, and I look forward to keeping in touch with all of them for years to come. Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Beyond Excel: Popular Data Analysis Methods from Excel, using pandas

Microsoft Excel is a spreadsheet software, containing data in tabular form. Entries of the data are located in cells, with numbered rows and letter labeled columns. Excel is widespread across industries and has been around for over thirty years. It is often people’s first introduction to data analysis.

Most users feel at home using a GUI to operate Excel and no programming is necessary for the most commonly used features. The data is presented right in front of the user and it is easy to scroll around through the spreadsheet. Making plots from the data only involves highlighting cells in the spreadsheet and clicking a few buttons.

There are various shortcomings with Excel. It is closed source and not free. There are free open-source alternatives like OpenOffice and LibreOffice suites, but there might be compatibility issues between file formats, especially for complex spreadsheets. Excel becomes unstable for files reaching 500 MB, being unresponsive and crashing for large files, hindering productivity. Collaborations can become difficult because it is hard to inspect the spreadsheet and understand how certain values are calculated/populated. It is difficult to understand the user’s thought process and work flow for the analysis.

The functionality of the spreadsheet is sensitive to the layout and moving entries around can have disastrous effects. Tasks like data cleaning, munging, treating different data types, and handling missing data are often difficult and require manual manipulation. Further, the user is limited to the built-in functionality of the program. In short, Excel is great for certain tasks but becomes unwieldy and inefficient as the tasks become more complicated.
Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Predicting Which Bills Will Become Laws, with Data Science: Alumni Spotlight on Michael Yen


At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists.  Michael was a Fellow in our Winter 2017 cohort in San Francisco, who landed a job with one of our hiring partners, Cerego

 

Tell us about your background. How did it set you up to be a great data scientist?

My formal education is in physics, but I’ve also done a lot of my research at UC Berkeley’s Computer Science department. I cherish both of these backgrounds equally since I learned how to “do science” from physics and build really cool things from computer science. I think this is a winning combination for a data scientist since a lot of companies are looking for scientist who can write code.

 

What do you think you got out of The Data Incubator?

Two things, and both are equally important. First, TDI gave me the exposure to their hiring partners that I just couldn’t get on my own. Before starting the fellowship I had spent over 10 months applying to jobs on my own with a call back rate of 3%. At TDI, my call back rate shot to 90% and I even began fielding unsolicited interviews. I think having the TDI mark of approval certainly moved me up in the stack of resumes. Secondly, I expanded my professional network by becoming close friends with twelve other fellows who are all going to be doing fantastic things in the future.
Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone

Scikit-learn vs. StatsModels: Which, why, and how?

At The Data Incubator, we pride ourselves on having the most up to date data science curriculum available. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are using and learning. In addition to their feedback we wanted to develop a data-driven approach for determining what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.

This technical article was written for The Data Incubator by Brett Sutton, a Fellow of our 2017 Summer cohort in Washington, DC. 

When you’re getting started on a project that requires doing some heavy stats and machine learning in Python, there are a handful of tools and packages available. Two popular options are scikit-learn and StatsModels. In this post, we’ll take a look at each one and get an understanding of what each has to offer.

Scikit-learn’s development began in 2007 and was first released in 2010. The current version, 0.19, came out in in July 2017. StatsModels started in 2009, with the latest version, 0.8.0, released in February 2017. Though they are similar in age, scikit-learn is more widely used and developed as we can see through taking a quick look at each package on Github. Both packages have an active development community, though scikit-learn attracts a lot more attention, as shown below. Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedInEmail this to someone
Share this with someone