What is SpaCy/Prodigy in NLP
When dealing with NLP we have new tools which are important to know to make the process faster. SpaCy is a relatively new package for “Industrial strength NLP in Python” developed by Matt Honnibal. It is designed with the applied data scientist in mind, meaning it does not weigh the user down with decisions over what esoteric algorithms to use for common tasks and speed. It is 400 times faster than NLTK. If you are familiar with the Python data science stack, spaCy is your numpy
for NLP – it’s reasonably low-level, but very intuitive and performant.
Spacy provides a one-stop-shop for tasks commonly used in any NLP project, including:
- Tokenisation
- Lemmatisation
- Part-of-speech tagging
- Entity recognition
- Dependency parsing
- Sentence recognition
- Word-to-vector transformations
- Many convenience methods for cleaning and normalising text
Let’s get started!
First, we load SpaCy’s pipeline, which by convention is stored in a variable named nlp
. declaring this variable will take a couple of seconds as spaCy loads its models and data to it up-front to save time later. In effect, this gets some heavy lifting out of the way early, so that the cost is not incurred upon each application of the nlp
parser to your data. Note that here I am using the English language model, but there is also a fully featured German model, with tokenization (discussed below) implemented across several languages.
We invoke nlp on the sample text to create a Doc
object. The Doc
object is now a vessel for NLP tasks on the text itself, slices of the text (Span
objects) and elements (Token
objects) of the text. It is worth noting that Token
and Span
objects actually hold no data. Instead they contain pointers to data contained in the Doc
object and are evaluated lazily (i.e. upon request). Much of spaCy’s core functionality is accessed through the methods on Doc
(n=33), Span
(n=29) and Token
(n=78) objects.
In[1]: import spacy
...: nlp = spacy.load("en")
...: doc = nlp("The big grey dog ate all of the chocolate, but fortunately he wasn't sick!")
Tokenization
Tokenisation is a foundational step in many NLP tasks. Tokenising text is the process of splitting a piece of text into words, symbols, punctuation, spaces and other elements, thereby creating “tokens”. A naive way to do this is to simply split the string on white space:
In[2]: doc.text.split()
...: Out[2]: ['The', 'big', 'grey', 'dog', 'ate', 'all', 'of', 'the', 'chocolate,', 'but', 'fortunately', 'he', "wasn't", 'sick!']
On the surface, this looks fine. But, note that a)
it disregards the punctuation and, b)
it does not split the verb and adverb (“was”, “n’t”). Put differently, it is naive, it fails to recognize elements of the text that help us (and a machine) to understand its structure and meaning. Let’s see how SpaCy handles this:
In[3]: [token.orth_ for token in doc]
...: Out[3]: ['The', 'big', 'grey', 'dog', 'ate', 'all', 'of', 'the', 'chocolate', ',', 'but', 'fortunately', 'he', 'was', "n't", ' ', 'sick', '!']
Here we access the each token’s .orth_
method, which returns a string representation of the token rather than a SpaCy token object, this might not always be desirable, but worth noting. SpaCy recognizes punctuation and is able to split these punctuation tokens from word tokens. Many of SpaCy’s token method offer both string and integer representations of processed text – methods with an underscore suffix return strings, methods without an underscore suffix return integers. For example:
In[4]: [(token, token.orth_, token.orth) for token in doc]
...: Out[4]: [
(The, 'The', 517),
(big, 'big', 742),
(grey, 'grey', 4623),
(dog, 'dog', 1175),
(ate, 'ate', 3469),
(all, 'all', 516),
(of, 'of', 471),
(the, 'the', 466),
(chocolate, 'chocolate', 3593),
(,, ',', 416),
(but, 'but', 494),
(fortunately, 'fortunately', 15520),
(he, 'he', 514),
(was, 'was', 491),
(n't, "n't", 479),
( , ' ', 483),
(sick, 'sick', 1698),
(!, '!', 495)]
Here, we return the SpaCy token, the string representation of the token and the integer representation of the token in a list of tuples.
If you want to avoid returning tokens that are punctuation or white space, SpaCy provides convenience methods for this (as well as many other common text cleaning tasks — for example, to remove stop words you can call the .is_stop
method.
In[5]: [token.orth_ for token in doc if not token.is_punct | token.is_space]
...: Out[5]: ['The', 'big', 'grey', 'dog', 'ate', 'all', 'of', 'the', 'chocolate', 'but', 'fortunately', 'he', 'was', "n't", 'sick']
Lemmatization
A related task to tokenization is lemmatization. Lemmatization is the process of reducing a word to its base form, its mother word if you like. Different uses of a word often have the same root meaning. For example, practice, practised and practising
all essentially refer to the same thing. It is often desirable to standardize words with similar meaning to their base form. With SpaCy we can access each word’s base form with a token’s .lemma_
method:
In[6]: practice = "practice practiced practicing"
...: nlp_practice = nlp(practice)
...: [word.lemma_ for word in nlp_practice]
...: Out[6]: ['practice', 'practice', 'practice']
Why is this useful? An immediate use case is in machine learning, specifically text classification. Lemmatizing the text prior to, for example, creating a “bag-of-words” avoids word duplication and, therefore, allows for the model to build a clearer picture of patterns of word usage across multiple documents.
Tagging
Part-of-speech tagging is the process of assigning grammatical properties (e.g. noun, verb, adverb, adjective etc.) to words. Words that share the same POS tag tend to follow a similar syntactic structure and are useful in rule-based processes.
For example, in a given description of an event we may wish to determine who owns what. By exploiting possessives, we can do this (providing the text is grammatically sound!). SpaCy uses the popular Penn Treebank POS tags, see https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. With SpaCy you can access coarse and fine-grained POS tags with the .pos_
and .tag_
methods, respectively. Here, I access the fine grained POS tag:
In[7]: doc2 = nlp("Conor's dog's toy was hidden under the man's sofa in the woman's house")
...:
pos_tags = [(i, i.tag_) for i in doc2] ...:
pos_tags
...: Out[7]: [(Conor, 'NNP'), ('s, 'POS'), (dog, 'NN'), ('s, 'POS'), (toy, 'NN'), (was, 'VBD'), (hidden, 'VBN'), (under, 'IN'), (the, 'DT'), (man, 'NN'), ('s, 'POS'), (sofa, 'NN'), (in, 'IN'), (the, 'DT'), (woman, 'NN'), ('s, 'POS'), (house, 'NN')]
We can see that the “ ’s
” tokens are labelled as POS
. We can exploit this tag to extract the owner and the thing that they own:
In[8]: owners_possessions = []
...: for i in pos_tags:
...: if i[1] == "POS": ...: owner = i[0].nbor(-1)
...: possession = i[0].nbor(1)
...: owners_possessions.append((owner, possession)) ...: ...: owners_possessions
...: Out[8]: [(Conor, dog), (dog, toy), (man, sofa), (woman, house)]
This returns a list of owner-possession tuples. If you want to be super Pythonic about it, you can do this in a list comprehension (which, I think is preferable!):
In[9]: [(i[0].nbor(-1), i[0].nbor(+1)) for i in pos_tags if i[1] == "POS"]
...: Out[9]: [(Conor, dog), (dog, toy), (man, sofa), (woman, house)]
Here we are using each token’s .nbor
method which returns a token’s neighboring tokens.
Entity recognition
Entity recognition is the process of classifying named entities found in a text into pre-defined categories, such as persons, places, organizations, dates, etc. spaCy uses a statistical model to classify a broad range of entities, including persons, events, works-of-art and nationalities / religions (see the documentation for the full list https://spacy.io/docs/usage/entity-recognition).
For example, let’s take the first two sentences from Barack Obama’s wikipedia entry. We will parse this text, then access the identified entities using the Doc
object’s .ents
method. With this method called on the Doc
we can access additional Token
methods, specifically .label_
and .label
:
In[10]: wiki_obama = """Barack Obama is an American politician who served as ...: the 44th President of the United States from 2009 to 2017. He is the first ...: African American to have served as president, ...: as well as the first born outside the contiguous United States."""
...:
...: nlp_obama = nlp(wiki_obama) ...: [(i, i.label_, i.label) for i in nlp_obama.ents]
...: Out[10]: [(Barack Obama, 'PERSON', 346), (American, 'NORP', 347), (the United States, 'GPE', 350), (2009 to 2017, 'DATE', 356), (first, 'ORDINAL', 361), (African, 'NORP', 347), (American, 'NORP', 347), (first, 'ORDINAL', 361), (United States, 'GPE', 350)]
You can see the entities that the model has identified and how accurate they are (in this instance). PERSON is self explanatory, NORP is nationalities or religious groups, GPE identifies locations (cities, countries, etc.), DATE recognizes a specific date or date-range and ORDINAL identifies a word or number representing some type of order.
While we are on the topic of Doc
methods, it is worth mentioning spaCy’s sentence identifier. It is not uncommon in NLP tasks to want to split a document into sentences. It is simple to do this with SpaCy by accessing a Doc's
.sents
method:
In[11]: for ix, sent in enumerate(nlp_obama.sents, 1):
...: print("Sentence number {}: {}".format(ix, sent))
...:
Sentence number 1: Barack Obama is an American politician who served as the 44th President of the United States from 2009 to 2017. Sentence number 2: He is the first African American to have served as president, as well as the first born outside the contiguous United States.
Prodigy
Besides SpaCy we have Prodigy which brings together state-of-the-art insights from machine learning and user experience. With its continuous active learning system, you’re only asked to annotate examples the model does not already know the answer to. The web application is powerful, extensible and follows modern UX principles. The secret is very simple: it’s designed to help you focus on one decision at a time and keep you clicking — like Tinder for data.
Everyone knows data scientists should spend more time looking at their data. When good habits are hard to form, the trick is to remove the friction. Prodigy makes the right thing easy, encouraging you to spend more time understanding your problem and interpreting your results.
Will provide more in detail description of Prodigy and how it is used for annotations in my next article.
Please feel free to reach me @kristinelpetrosyan@gmail.com with all possible questions you may have.