Data scientists are in demand like never before, but nonetheless, getting a job as a data scientist requires a resume that shows off your skills. At The Data Incubator, we’ve received tens of thousands of resumes from applicants for our free Data Science Fellowship. While we work hard to read between the lines to find great candidates who happen to have lackluster CVs, many recruiters may not be as diligent. Based on our experience, here’s the advice we give to our Fellows about how to craft the perfect resume to get hired as a data scientist. Continue reading
New courses target the need for managers and techies to talk to each other as data proliferate
For most students, a top degree in a field such as computer science or maths ought to be a passport to a career perfectly in tune with the relentless digitisation of work.
For the 30 graduates taking up a new one-year course at MIT’s Sloan School of Management in September, it will be only the prelude to a spell in a Big Data finishing school.
This first cohort of students will pay $75,000 in tuition fees for their Master of Business Analytics degree, with classes ranging from “Data mining: Finding the Data and Models that Create Value” to “Applied Probability”.
They will be calculating that the qualification will sprinkle their CVs with extra stardust, attracting elite employers that are trying to find meaning in the increasing volumes of data that businesses are generating. Continue reading
At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. David was a Fellow in our Winter 2016 cohort who landed a job with one of our hiring partners, Amazon.
Before joining The Data Incubator, I completed my Ph.D. in chemistry at Johns Hopkins University, where I focused on the design and synthesis of new magnetic materials. My work gave me the opportunity to work alongside scientists in many different disciplines, and exposed me to a vast array of experimental techniques and theoretical constructs. From a data science perspective, this meant that I was constantly encountering new types of data and searching for scientifically rigorous models to explain those results. As the volume and complexity of these datasets increased, graphical data analysis tools like Excel and Origin weren’t making the cut for me, and I gradually made the transition to performing data transformation and analysis entirely in Python. That was a big technical leap that took a lot of time and frustration, but I think it ultimately made me a better researcher.
From a research perspective, working in a vibrant academic setting also meant learning how to ask bold questions, even at the risk of sounding stupid in front of a large group of mentors and peers–something I’ve done more than I care to admit. For me, finding the right question to ask is just as important as having the technical expertise to find an answer, and that’s one of the things that makes Data Science so exciting.
Jupyter is a fantastic tool that we use at The Data Incubator for instructional purposes. One perk of using Jupyter is that we can easily test our code samples across any language there’s a Jupyter kernel for. In this post, I’ll show you some of the code we use to test notebooks!
First, a quick discussion of the current state of testing ipython notebooks: there isn’t much documented about the process. ipython_nose is a really helpful extension for writing tests into your notebooks, but there’s no documentation or information about easy end-to-end testing. In particular, we want the programmatic equivalent of clicking “run all cells”.After poking around things like github’s list of cool ipython notebooks and the Jupyter docs, two things became apparent to us:
- Most people do not test their notebooks.
- Automated end-to-end testing is extremely easy to implement. Continue reading
At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Michael was a Fellow in our Winter 2016 cohort who landed a job with one of our hiring partners, Schireson Associates.
My PhD work was in computational materials science, where I worked with reactive molecular dynamics simulations. The field is totally simulation based, and typically requires high performance computing resources. Running these simulations helped build my chops for working with parallel systems and command line tools. The software required familiarity with some powerful languages and APIs like C and CUDA. Learning those definitely helped my understanding of Python once I converted to using it.
Toward the third year of my PhD I got really interested in machine learning. I started using scikit-learn to predict different aspects of simulations I worked on. These projects became a large part of my thesis and contributed to choosing The Data Incubator as a next step in my career.
At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Sumanth was a Fellow in our Winter 2016 cohort who landed a job with one of our hiring partners, Revon.
I did my bachelors degree in Chemical Engineering at the University of Delaware and my PhD in Applied Mathematics at Northwestern University. After some postdoctoral work between Northwestern and Oxford University, I went into industry as a quantitative consultant for W.L. Gore & Associates. For the past 4 years, I have spent most of my time delivering technology solutions at W.L. Gore, teaching mathematics at the University of Delaware, and performing and teaching Indian Classical Music.
On the question of what makes a strong data scientist, I think that the better practitioners in the field tend to be hypothesis driven, strong critical thinkers with hard skills in statistics, programming, mathematics, and hardware. Hence, my background in engineering and mathematics, my consulting experience, and my years of teaching probably contributed the most to my success.
At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Xia was a Fellow in our Summer 2015 cohort who landed a job at LinkedIn.
I am an experimental physicist in soft condensed matter by training in my PhD program at Emory University. There are three things that I think have helped me a lot to become a good data scientist:
1). The solid background in physics and math that I obtained back in my college. The knowledge itself isn’t necessarily reflected in my day to day work now. However, the training of logical thinking and critical thinking is really beneficial in a long run.
2). Persistence in finding root causes. The massive amount of data can easily leave you feeling swamped. I believe that always asking why until you get to the true cause of the problem is really essential. Sometimes, the insights are hidden behind and need our motivation to dig them out. No matter if it’s driven by natural stubbornness or original curiosity, I find the persistence usually a great help for walking the last mile to the final discovery.
3). Passion for solving problems using data. There is a joint program in our department where I took computer science courses for a masters degree. In the course projects, I started to find my passion in solving practical problems using data science approaches. Now I am working on product analytics and I cannot imagine how tough it could be without that passion and curiosity about what we can do to improve it.
The venerable NLTK has been the standard tool for natural language processing in Python for some time. It contains an amazing variety of tools, algorithms, and corpuses. Recently, a competitor has arisen in the form of spaCy, which has the goal of providing powerful, streamlined language processing. Let’s see how these toolkits compare.
NLTK provides a number of algorithms to choose from. For a researcher, this is a great boon. Its nine different stemming libraries, for example, allow you to finely customize your model. For the developer who just wants a stemmer to use as part of a larger project, this tends to be a hindrance. Which algorithm performs the best? Which is the fastest? Which is being maintained?
In contrast, spaCy implements a single stemmer, the one that the spaCy developers feel to be best. They promise to keep it updated, and may replace it with an improved algorithm as the state of the art progresses. You may update your version of spaCy and find that improvements to the library have boosted your application without any work necessary. (The downside is that you may need to rewrite some test cases.)
As a quick glance through the NLTK documentation demonstrates, different languages may need different algorithms. NLTK lets you mix and match the algorithms you need, but spaCy has to make a choice for each language. This is a long process and spaCy currently only has support for English.
Strings versus objects
NLTK is essentially a string processing library. All the tools take strings as input and return strings or lists of strings as output. This is simple to deal with at first, but it requires the user to explore the documentation to discover the functions they need.
In contrast, spaCy uses an object-oriented approach. Parsing some text returns a document object, whose words and sentences are represented by objects themselves. Each of these objects has a number of useful attributes and methods, which can be discovered through introspection. This object-oriented approach lends itself much better to modern Python style than does the string-handling system of NLTK.
A more detailed comparison between these approaches is available in this notebook.
An important part of a production-ready library is its performance, and spaCy brags that it’s ready to be used. We’ll run some tests on the text of the Wikipedia article on NLP, which contains about 10 kB of text. The tests will be word tokenization (splitting a document into words), sentence tokenization (splitting a document into sentences), and part-of-speech tagging (labeling the grammatical function of each word).
It is fairly obvious that spaCy dramatically out-performs NLTK in word tokenization and part-of-speech tagging. Its poor performance in sentence tokenization is a result of differing approaches: NLTK simply attempts to split the text into sentences. In contrast, spaCy is actually constructing a syntactic tree for each sentence, a more robust method that yields much more information about the text. (You can see a visualization of the result here.)
While NLTK is certainly capable, I feel that spaCy is a better choice for most common uses. It makes the hard choices about algorithms for you, providing state-of-the-art solutions. Its Pythonic API will fit in well with modern Python programming practices, and its fast performance will be much appreciated.
Unfortunately, spaCy is English only at the moment, so developers concerned with other languages will need to use NLTK. Developers that need to ensure a particular algorithm is being used will also want to stick with NLTK. Everyone else should take a look at spaCy.
At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Matt was a Fellow in our Winter 2016 cohort who landed a job with one of our hiring partners, 1010data.
I defended my PhD dissertation at Washington University in St. Louis, a few weeks before coming to The Data Incubator. I was part of the MAPLE lab in Energy, Environmental, and Chemical Engineering (I know, it’s a mouthful). Our lab focused on physics-based electrochemical modeling, mostly geared toward Li-ion batteries.
For my main dissertation project, I studied how batteries age under different real-world cycling patterns. Most cycle life estimates for a battery are based on simple constant charge and constant discharge patterns, but lots of applications (such those experienced by batteries in electric vehicles or coupled to the electric grid) do not have simple cycling patterns. This variation effects the life of the battery.
Both through model simulation and long-term experiments, I had to analyze battery characteristics over thousands of cycles and pick out important features. This type of analysis along with programming computational models that were used to create these data sets helped give me a background to tackle data science problems.
Additionally, I think that working on my PhD projects allowed me to gain experience in solving unstructured problems, where the solution (and sometime even the problem/need) are not well defined. these type of problems are very common, especially once you get outside of academia.
We all want to be future-proof: not just prepared for unforeseen developments but positioned well to take advantage of them. Having a flexible, adaptable, and scalable technology stack is a great way to get to achieve that goal when it comes to being able to leverage data science effectively. Here are five ideas I personally think it’s crucial to keep in mind when building out your own functionality:
1. Your pipeline is only as good as its weakest link.
It’s great that your predictive modelers have come up with a thousand new features to incorporate, but have you asked your data engineers how that will affect the performance of backend queries? What about your data collection and ingestion flow? Maybe your team is frothing at the mouth for an upgrade to Spark Streaming to run their clustering algorithms in real time, but your frontend will lose responsiveness if you try to display the results as fast as they come in. The key here is not to get sucked into the hype of “scaling up” without fully recognizing the implications across your entire organization and what new demands will be placed on all those moving parts. Continue reading