Introduction to SpaCy


Processing and understanding large volumes of text

SpaCy is a free open-source software library, written in Python and Cython, for advanced Natural Language Processing (NLP). The company behind SpaCy is Explosion AI, a studio specializing in artificial intelligence (AI) and NLP.

Their philosophy is that AI is not magic and that it is only useful if it’s easy to understand. AI should be accessible to everyone in such way that is easy to use, transparent and stable.

Alongside SpaCy, Explosion AI is also the maker of Prodigy, a very efficient annotation tool, and Thinc the machine learning library that powers SpaCy.

According to benchmarks and independent research published on their site *1) SpaCy is the fastest syntactic parser in the world. To accomplish this SpaCy has been built from the ground-up in Cython and it is now a fast-growing industry standard library for NLP.

Unlike for example NLTK, SpaCy is specifically designed for production use.

In the latest GitHub “The State of the Octoverse” report *2) you find that SpaCy was, in terms of contributors, the third most popular open source project with the “machine learning” label on GitHub in 2018.

GIthuub on Spacy

Although the Tensorflow project was far more popular than all the other projects, the SpaCy project focuses on natural language processing problems.

So, SpaCy is great at performing simple to complex text processing tasks.

When you work with large volumes of text as for example: documents, articles, tweets or reviews and you would like to know for example which companies, persons, products or locations are mentioned in the text, or if you like to know more about the context of the words being used in the document, then SpaCy can help you build custom applications that can process and “understand” these large volumes of text.

Among the many features of SpaCy are tokenization, part-of-speech (POS) tagging, named entity recognition (NER), dependency parsing, text classification, lemmatization, labeled dependency parsing, sentence segmentation and integrated word vectors.

For some of the SpaCy features, like tagging, parsing and named entity recognition, to work it will require you to load statistical neural models. Currently there are models for the following languages: German, Greek, English, Spanish, French, Italian, Dutch and Portuguese.

But due to the large open source community that SpaCy has, new models are being developed continuously. So make sure to keep an eye out!

*1) https://spacy.io/usage/facts-figures
*2) https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/

Een reactie plaatsen

Het e-mailadres wordt niet gepubliceerd. Vereiste velden zijn gemarkeerd met *