NER and NED with spaCy

Named Entity Recognition

A named entity is an object that’s assigned a name — for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

import spacynlp = spacy.load("en_core_web_sm")doc = nlp("Apple is looking at buying U.K. startup for $1 billion")for ent in doc.ents:print(ent.text, ent.start_char, ent.end_char, ent.label_)

TEXT START END LABEL DESCRIPTION Apple05ORGCompanies, agencies, institutions.U.K.2731GPEGeopolitical entity, i.e. countries, cities, states.$1 billion4454MONEYMonetary values, including unit.

Using spaCy’s built-in displaCy visualizer, here’s what our example sentence and its named entities look like:

Accessing entity annotations and labels

The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value or as a string, using the attributes ent.label and ent.label_. The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

You can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.

IOB SCHEME

BILUO SCHEME

import spacynlp = spacy.load("en_core_web_sm")doc = nlp("San Francisco considers banning sidewalk delivery robots")# document levelents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]print(ents)# token levelent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]print(ent_san)  # ['San', 'B', 'GPE']print(ent_francisco)  # ['Francisco', 'I', 'GPE']

TEXT ENT_IOBENT_IOB_ENT_TYPE_DESCRIPTIONS an3B"GPE"beginning of an entityFrancisco1I"GPE"inside an entity considers2O""outside an entity banning2O""outside an entity sidewalk2O""outside an entity delivery2O""outside an entity robots2O""outside an entity

Setting entity annotations

To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level. However, you can’t write directly to the token.ent_iob or token.ent_type attributes, so the easiest way to set entities is to use the doc.set_ents function and create the new entity as a Span.

import spacyfrom spacy.tokens import Spannlp = spacy.load("en_core_web_sm")doc = nlp("fb is hiring a new vice president of global policy")ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]print('Before', ents)# The model didn't recognize "fb" as an entity :(# Create a span for the new entityfb_ent = Span(doc, 0, 1, label="ORG")orig_ents = list(doc.ents)# Option 1: Modify the provided entity spans, leaving the rest unmodifieddoc.set_ents([fb_ent], default="unmodified")# Option 2: Assign a complete list of ents to doc.entsdoc.ents = orig_ents + [fb_ent]ents = [(e.text, e.start, e.end, e.label_) for e in doc.ents]print('After', ents)# [('fb', 0, 1, 'ORG')]

Keep in mind that Span is initialized with the start and end token indices, not the character offsets. To create a span from character offsets, use Doc.char_span:

fb_ent = doc.char_span(0, 2, label="ORG")

Setting entity annotations from array

You can also assign entity annotations using the doc.from_array method. To do this, you should include both the ENT_TYPE and the ENT_IOB attributes in the array you’re importing from.

Editable CodespaCy v3.0 · Python 3 · via Binderimport numpyimport spacyfrom spacy.attrs import ENT_IOB, ENT_TYPEnlp = spacy.load("en_core_web_sm")doc = nlp.make_doc("London is a big city in the United Kingdom.")print("Before", doc.ents)  # []header = [ENT_IOB, ENT_TYPE]attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")attr_array[0, 0] = 3  # Battr_array[0, 1] = doc.vocab.strings["GPE"]doc.from_array(header, attr_array)print("After", doc.ents)  # [London]

Visualizing named entities

The displaCy ENT visualizer lets you explore an entity recognition model’s behavior interactively. If you’re training a model, it’s very useful to run the visualization yourself. To help you do that, spaCy comes with a visualization module. You can pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup.

For more details and examples, see the usage guide on visualizing spaCy.

NAMED ENTITY EXAMPLEimport spacy
from spacy import displacy
text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style="ent")

Entity Linking(NED)

To better understand the goal of Named Entity Disambiguation, it is helpful to place it as the last step in a longer NLP pipeline that begins with Named Entity Recognition. Named Entity Recognition identifies within a unit of text the word or words that represent an entity or object — such as names, organizations, locations, or other proper nouns.

Once a named entity has been identified, it must be matched to a corresponding node in a knowledge base in order to grow that knowledge base with primary source documentation and improve understanding of the node. The simplest way to link a named entity is by direct string matching.

To ground the named entities into the “real world”, spaCy provides functionality to perform entity linking, which resolves a textual entity to a unique identifier from a knowledge base (KB). You can create your own KnowledgeBase and train a new EntityLinker using that custom knowledge base.

Accessing entity identifiers NEEDS MODEL

The annotated KB identifier is accessible as either a hash value or as a string, using the attributes ent.kb_id and ent.kb_id_ of a Span object, or the ent_kb_id and ent_kb_id_ attributes of a Token object.

import spacynlp = spacy.load("my_custom_el_pipeline")
doc = nlp("Ada Lovelace was born in London")
# Document level
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]
# Token level
ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
print(ent_ada_0) # ['Ada', 'PERSON', 'Q7259']
print(ent_ada_1) # ['Lovelace', 'PERSON', 'Q7259']
print(ent_london_5) # ['London', 'GPE', 'Q84']

Will share more with real time examples from my project in my next article.