At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Brendan was a Fellow in our Fall 2015 cohort who landed a job with one of our hiring partners, Jolata.
Tell us about your background. How did it set you up to be a great data scientist?
I did my PhD research in theoretical condensed matter physics at the University of California, Santa Barbara. The focus of my research was on studying the phase diagram of chains of non-abelian anyons. Because such chains are gapless in most regions of the phase diagram we had to model them using very large matrices in C++. To make this computation more tractable we used hash tables and sparse matrices. Besides my background in numerics I also took the time to learn Python, Pandas, SQL and MapReduce in Cloudera a few months before starting the fellowship.
What do you think you got out of The Data Incubator?
The Data Incubator gave me a solid foundation in data parsing, large scale data analysis and machine learning. I went into the fellowship already knowing about various concepts like SVM, bag-of-words and cross-validation. But I learned how tie these together into a elegant models that are both modular and easy to modify or upgrade. I also learned how to use Map Reduce on a cluster where the behavior of your program can be quite different then on a single node.
Could you tell us about your Data Incubator project?
There’s been a lot of hype around sensor data and how it could be used in wearable devices or smart cities. The aim of my projects was much more modest. I wanted to see if sensors installed on a product could be used to “review” it, just like customers do when they post an online review.
I looked at daily sensor data from 40,000 computer hard drives owned by Backblaze and compared their failure rate and life expectancy by model to the perceived failure rate and life expectancy obtained from scraping online reviews on Amazon and Newegg. Because there is approximately a linear correlation between star rating of the review and the perceived failure rate of a hard drive I was able to map the sensor data to an expected star rating for each hard drive model. This expected star rating represents the overall rating that the hard drive would receive it were rated by sensors rather then human reviewers.
What advice would you give to someone who is applying for The Data Incubator, particularly someone with your background?
Get familiar with SQL, python and machine learning well before applying to the program. Also, get familiar with analyzing text data and learn how to deal with unicode.
What’s your favorite thing you learned while at The Data Incubator? This can be a technology, concept, or whatever you want!
Learning MapReduce and Spark on clusters was particularly useful. There are some subtle differences between running your code on a single node vs a cluster which are important to know. The miniprojects are especially useful when talking to employers because they are typically looking for someone with background knowledge covered in at least one of the miniprojects (in my case recommender systems), which may not have been covered in the capstone project.
Where are you working now and tell us a little about your new job!
At Jolata one of my main projects so far has been identifying one-way audio (OWA) in VoLTE calls. These are experienced by users as interruptions in cell phone conversation that either degrade the user experience or cause them to hang up. By finding regions where no packets are sent in one or both directions between the two users we can identify gaps in the VoLTE signal. Preceding and following these gaps are SCTP packets that allow us to classify the gap as either a true OWA or a normal operation such as a handover. One of my tasks as part of this project was presenting examples of OWAs in voice calls to our partners and ensuring that my statistical analysis would scale to the large volumes of data coming from four base stations. Currently I’m working on implementing a clustering approach to refine our classification of the gaps which may allow us to identify new types of OWAs outside of our current classification.