Data are becoming the new raw material of business
The Economist

4 Data Science Projects That We Can’t Get Enough Of

LI3Y5U376XAt The Data Incubator we run a free advanced 8-week fellowship for PhDs looking to enter the industry as data scientists.  

As part of the application process, we ask potential fellows to propose and begin working on a data science project to highlight their skills to employers.  Regardless of whether you’re selected to be a fellow, this project will be instrumental in attracting employer interest and highlighting your skills.  Here are some projects that we would love to see, and that we hope to see you take on as well.

 

Multi-Axial Political Analysis  

We often think of American politics in terms of a single axis: left versus right, democrat versus republican.  In reality, the parties are composed of varying factions with different identities and political priorities and American politics is actually broken along multiple axes: foreign policy, social issues, regulation, social spending, education, second amendment, just to name a few.

Voters often do not completely agree with their party on all issues.  While it’s hard to distinguish this from two-way state and national races, primary elections and ballot measures offer a multidimensional probe of voter sentiment that goes beyond two-way elections.  Leverage unstructured learning to find clusters of precincts that tend to vote similarly and visualize them on a map.  Are there patterns that defy the traditional partisan ones?

There’s plenty of data out there- for example, the states of Colorado, and California maintain official state repositories of data.  Projects like LEAP, the CEDA, Seattle Times, or Open Elections provide third-party data.

 

Quantifying Academic Publication Prestige

When applying for a PhD, it’s hard to quantitatively determine which professors to select as a thesis advisor.  And while US News and World Report provides graduate school rankings for course fields, it’s harder to determine which universities are the best for your subfield of interest.  

While journals provide a proxy for rankings, knowledge of top journals for fields (much less subfields) is extremely rarified.  However, SSRN, arXIv, PubMed, and Google Scholar, provide extensive archives of papers and citation lists to help identify the influence of papers, authors, institutions, and journals.

 

Tracking Grants

The US government gives us billions in science grants each year through the NSF, the NIH, DARPA, amongst others.  You can track grants from proposal to paper.  For example, the NSF releases all its grant data in XML and grant recipients are required to publically acknowledge their grant number in publications.  

Other funding agencies have similar rules.  How long does it take to write a paper and does that vary by discipline?  How much does it cost to produce a paper?  Are there certain fields or subfields that are more or less expensive?  Can you use natural language processing to identify research topics that did not produce what they were promised?

 

Public Salaries

In the spirit of public transparency, states publish the compensation for all state employees (as a random sample, here are ones for Illinois, Rhode Island, California, Washington State).  Which public sector professions or geographies pay better?  How much would salaries increase over time?  How does this vary by state?  Can we correlate teacher pay with school ratings or hospital performance?  Does university professor pay track with the prestige of their publication record?

 

Trending Technology Projects

Stackoverflow offers a comprehensive dump of all its data (available here).  You can use this data to identify which users are experts at different subjects (top respondents), which topics get the most traction (plentiful questions and answers), and which topics are naturally harder (have responses only from highly experienced users).  Can you use natural language processing to identify a good question (presumably one that receives answers, receives many upvotes, or isn’t marked as off topic) or a good answer (presumably one that receives many upvotes or is marked as “accepted”)?

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Predicting Flight Delays with Random Forests: Alumni Spotlight on Stacy Karthas

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists.  Stacy was a Fellow in our Winter 2017 cohort who landed a job with one of our hiring partners, AdTheorent

Tell us about your background. How did it set you up to be a great data scientist 

I received my Bachelor of Science degrees in mathematics and physics from the University of New Hampshire. I then went on to graduate school at Stony Brook University. I graduated with my master’s degree in Physics in December 2016. During my master’s degree, I did research in Nuclear Heavy Ion Physics with a focus on the analysis of gluons and their products as they traversed our detector. The data analysis, simulation, and clustering algorithms I worked on prepared me to become a data scientist because it was a physical application of many of the tools used by data scientists.

What do you think you got out of The Data Incubator?

The Data Incubator gave me the chance to solidify my data science knowledge. It helped me pull together tools and concepts I had been using during all of my previous research experiences. I learned a lot of new machine learning concepts and how they could be applied to real world data.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

From Researcher to Algorithm Engineer: Alumni Spotlight on Anthony Finch

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists.  Anthony was a Fellow in our Winter 2017 cohort who landed a job with one of our hiring partners, Afiniti

Tell us about your background. How did it set you up to be a great data scientist?

148125074726-anthony_j_finch

I came into The Data Incubator with a Master’s degree in Computational Operations Research from The College of William and Mary. My Master’s program gave me a strong background in theory and in the practical application of machine learning, simulation, and optimization. I had a few internships as well, primarily in finance.

What do you think you got out of The Data Incubator?

The Data Incubator gave me a lot of experience handling data in a way that I didn’t get in an academic environment. The data sets were big, messy, and realistic. In addition, I thought that the capstone was an excellent way to get into a more industrial environment. The Data Incubator required a lot of database management, web scraping, and the like, which I didn’t get in the academic setting I came from

I also felt that The Data Incubator gave me a number of excellent opportunities. It may seem frustrating at times, but the partners really do want to hire Fellows, and The Data Incubator’s salary and compensation ranges are very accurate (in my experience). I’m not sure I would have gotten the same response rate and offers if I hadn’t been applying through the fellowship.  

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Analyzing Time Series Data for Parkinson’s Wearables: Alumni Spotlight on Jordan Webster

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Jordan was a Fellow in our Spring 2017 cohort who landed a job with one of our hiring partners, IronNet Cybersecurity

 

Tell us about your background. How did it set you up to be a great data scientist?

My background is in particle physics. As a physicist, I analyzed large datasets of particle collision images, and I used machine learning tools to classify rare and interesting collisions.

148911675912-jordan_webster

 

What do you think you got out of The Data Incubator?

At The Data Incubator I I learned a whole new toolset for approaching data analytics. I was exposed to new concepts like language processing and map-reduce, which never arose in physics. Furthermore, I was coached on how to best market myself to employers.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Ride-sharing for Senior Citizens: Alumni Spotlight on Aurora LePort

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Aurora was a Fellow in our Spring 2016 cohort who landed a job with Verizon Wireless

 

Tell us about your background. How did it set you up to be a great data scientist?Version 2

I obtained my Ph.D. in Neurobiology and Behavior from UC, Irvine in 2014. I collected data related to brain activity representing autobiographical memory using Magnetic Resonance Imaging (MRI) for my dissertation. The accurate analysis of MRI data demanded the ability to preprocess, and clean data as well as automate the processing steps using Matlab and R. Understanding how to properly use these tools was instrumental towards acquiring a new programming language (i.e. Python). Furthermore, the ability to apply statistical concepts to analyze various forms of data from diverse scenarios was highly conducive towards becoming a well-rounded data scientist who excels at analyzing novel datasets.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Solving Interdisciplinary Problems with Data Science: Alumni Spotlight on Wendy Ni

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Wendy was a Fellow in our Winter 2017 cohort who landed a job with one of our hiring partners, Facebook.

Tell us about your background. How did it set you up to be a great data scientist?

I have a PhD in Electrical Engineering from Stanford University, where I’m currently a postdoc.  My doctoral and postdoctoral research focus on the translation of novel magnetic resonance imaging (MRI) technologies to clinical neuroimaging applications, and the extraction of “hidden” imaging biomarkers from conventional clinical images.  In my research, I utilized my engineering, programming, study design, and communication skills to solve interdisciplinary problems with real-world impact.  I am now pivoting to data science, because I want to use my quantitative and analytical skills to discover hidden insights and guide decision-making for immediate applications in industry.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Ranked: 15 Python Packages for Data Science

Cover of Python Packages for Data Science

At The Data Incubator we pride ourselves on having the latest data science curriculum. Much of our course material is based on feedback from corporate and government partners about the technologies they are looking to learn. However, we wanted to develop a more data-driven approach to what we teach in our data science corporate training and our free fellowship for
Data science masters and PhDs looking to begin their careers in the industry.

This report is the second in a series analyzing data science related topics, to see more be sure to check out our R Packages for Machine Learning report. We thought it would be useful to the data science community to rank and analyze a variety of topics related to the profession in a simple, easy to digest cheat sheet, rankings or reports. Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

JUST Capital and The Data Incubator Challenge

Data Science For Social Good (1)

 

Today, we’re excited to announce that we’re teaming up with JUST Capital to help crowd-source data science for social good.  The Data Incubator offers a free eight-week data science fellowship for those with a PhD or a masters degree looking to transition into data science.  As a part of the application process, students are asked to submit a data science capstone project and the best students are invited to work on them during the fellowship.  JUST Capital is helping providing data and project prompts to harness the collective brainpower amongst The Data Incubator fellows to solve these high-impact social problems.

  • These projects focus on applied data science techniques with tangible impacts on JUST Capital’s mission.
  • The projects are open ended and creativity is encouraged. The documents provided, below, are suitable for analysis, but one should not shy in seeking out additional sources of data.

JUST Capital is a nonprofit that provides information and rankings on how large corporations perform on issues that matter most to the public. We give individuals a voice on what really matters to them, and evaluate how companies perform on those issues. By providing the right knowledge and making it easy to access and understand, we believe capital will flow to corporations that are more JUST, ultimately leading to a balanced business world that takes into account human needs that are so often neglected today. The meaning of JUST is defined by the American public as fair, equitable and balanced. In 2016, JUST Capital surveyed nearly 4,000 Americans from all regions and walks of life, in its second annual Poll on Corporate America. The issues identified by the public form the basis of our benchmark — it is against these Drivers and Components that we measure corporate performance. The most important factors broadly relate to employees, customers, company leadership, the environment, communities and investors.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Spark 2.0 on Jupyter with Toree

spark-logo2Spark

Spark is one of the most popular open-source distributed computation engines and offers a scalable, flexible framework for processing huge amounts of data efficiently. The recent 2.0 release milestone brought a number of significant improvements including DataSets, an improved version of DataFrames, more support for SparkR, and a lot more. One of the great things about Spark is that it’s relatively autonomous and doesn’t require a lot of extra infrastructure to work. While Spark’s latest release is at 2.1.0 at the time of publishing, we’ll use the example of 2.0.1 throughout this post.

Jupyter

Jupyter notebooks are an interactive way to code that can enable rapid prototyping and exploration. It essentially connects a browser-based frontend, the Jupyter Server, to an interactive REPL underneath that can process snippets of code. The advantage to the user is being able to write code in small chunks which can be run independently but share the same namespace, greatly facilitating testing or trying multiple approaches in a modular fashion. The platform supports a number of kernels (the things that actually run the code) besides the out-of-the-box Python, but connecting Jupyter to Spark is a little trickier. Enter Apache Toree, a project meant to solve this problem by acting as a middleman between a running Spark cluster and other applications.

In this post I’ll describe how we go from a clean Ubuntu installation to being able to run Spark 2.0 code on Jupyter. Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Making the Switch from Management Consulting: Alumni Spotlight on Armand Quenum

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Armand was a Fellow in our Fall 2016 cohort who landed a job with KPMG.

Tell us about your background. How did it set you up to be a great data scientist?

I received my Bachelor’s degree in Mechanical Engineering from NC State University. After college, I became a management consultant specializing in program and strategic management. As a consultant, I saw the value of data-driven decisions and extracting insights from data. As a result, I decided to go back to school to obtain my Master’s in Systems Engineering. There I was introduce to R Programming software, data mining techniques, and applications of optimization. My Masters not only exposed me to data science, but it also provided me a framework to approach complex problems.

Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn