Data are becoming the new raw material of business
The Economist


NLTK vs. spaCy: Natural Language Processing in Python

The venerable NLTK has been the standard tool for natural language processing in Python for some time. It contains an amazing variety of tools, algorithms, and corpuses. Recently, a competitor has arisen in the form of spaCy, which has the goal of providing powerful, streamlined language processing. Let’s see how these toolkits compare.

Philosophy

board-2084777_960_720NLTK provides a number of algorithms to choose from. For a researcher, this is a great boon. Its nine different stemming libraries, for example, allow you to finely customize your model. For the developer who just wants a stemmer to use as part of a larger project, this tends to be a hindrance. Which algorithm performs the best? Which is the fastest? Which is being maintained?

In contrast, spaCy implements a single stemmer, the one that the spaCy developers feel to be best. They promise to keep it updated, and may replace it with an improved algorithm as the state of the art progresses. You may update your version of spaCy and find that improvements to the library have boosted your application without any work necessary. (The downside is that you may need to rewrite some test cases.)

As a quick glance through the NLTK documentation demonstrates, different languages may need different algorithms. NLTK lets you mix and match the algorithms you need, but spaCy has to make a choice for each language. This is a long process and spaCy currently only has support for English.  Continue reading

Tweet about this on TwitterShare on FacebookShare on LinkedIn