At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists. Rachel was a Fellow in our Spring 2017 cohort and an instructor for our Summer 2017 cohort.
My background is in neuroscience; specifically I studied how images are processed in the visual system of biological brains. For my capstone project I knew I wanted to use an artificial neural network to create an image classifier. Intel and Mobile ODT released a large dataset for a medical image classifying competition around the same time I was brainstorming possible projects. Their dataset included thousands of medical images of cervixes that were labeled by medical professionals as one of three types based on anatomy. Healthcare providers often have difficulty determining the anatomical classification of a cervix during an examination. Some types of cervixes require additional screening to determine if pathology is present. Thus, an algorithm-aided decision of cervical type could improve the quality of cervical cancer screening for patients and efficiency for practitioners.
I began my project with some exploration of the images. I used t-SNE (t-distributed stochastic neighbor embedding) in scikit-learn, which is a tool to visualize high-dimensional data. Visualizing each image as a point in a 3-D plot showed that none of the three classes of cervixes clustered together. I also used a hierarchical cluster analysis in seaborn to confirm that the images did not easily group together by their three classes.
After some research I found that convolutional neural networks (CNNs) are the type of neural network best suited for complex image classification tasks. While there are over a dozen different python frameworks for CNNs, I ended up installing Keras with the TensorFlow backend in my DigitalOcean box. I choose Keras since it has a large and fast growing community of researchers and organizations with lots of online resources and documentation. I was also interested in trying a pre-trained model for extracting features, as well as just building the architecture of the network myself. Both of these options are relatively straightforward in Keras. I tried two pre-trained weights- VGG16 and VGG19 released by VGG at Oxford. My highest validation accuracy came from training my own custom CNN with two Conv2D layers to extract features, two max pooling layers to downsample, two fully connected (dense) layers, and two dropout layers to prevent overfitting.
The thing that surprised me the most while working on my project was the lack of improvement that came from trying different pre-processing techniques. At the core of what separates the different cervix types is the amount of the “transition zone” visible on the cervix. This “transition zone” was usually at the center of the image surrounded by thousands of pixels containing no obviously useful classification information. I experimented with otsu thresholding in skimage, a histogram-based method that separates foreground from background. I also tried segmentation with a Gaussian mixture model analysis in sklearn. I was hoping that removing the background of the image and extracting just the cervix would improve the model’s accuracy. I think that perhaps there was just too much variation in foreground and background conditions of the individual images, because I did not gain model performance using either pre-processing technique.
The overall accuracy of my classifier when predicting a cervix image as one of three types is 57%. However in healthcare applications, both types 2 and 3 require additional cancer screening. When I combine types 2 and 3 into one group, I am able to predict an image as either type 1 or type 2/3 with 87% accuracy. Thus, this application could increase efficiency in identifying the need for additional cancer screenings during cervical examinations.