NER and NED with spaCy
Named Entity Recognition
A named entity is an object that’s assigned a name — for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.
Named entities are available as the ents
property of a Doc
:
import spacynlp = spacy.load("en_core_web_sm")doc = nlp("Apple is looking at buying U.K. startup for $1 billion")for ent in doc.ents:print(ent.text, ent.start_char, ent.end_char, ent.label_)
- Text: The original entity text.
- Start: Index of start of entity in the
Doc
. - End: Index of end of entity in the
Doc
. - Label: Entity label, i.e. type.
TEXT START END LABEL DESCRIPTION Apple05ORG
Companies, agencies, institutions.U.K.2731GPE
Geopolitical entity, i.e. countries, cities, states.$1 billion4454MONEY
Monetary values, including unit.
Using spaCy’s built-in displaCy visualizer, here’s what our example sentence and its named entities look like:
Accessing entity annotations and labels
The standard way to access entity annotations is the doc.ents
property, which produces a sequence of Span
objects. The entity type is accessible either as a hash value or as a string, using the attributes ent.label
and ent.label_
. The Span
object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.
You can also access token entity annotations using the token.ent_iob
and token.ent_type
attributes. token.ent_iob
indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.
IOB SCHEME
I
– Token is inside an entity.O
– Token is outside an entity.B
– Token is the beginning of an entity.
BILUO SCHEME
B
– Token is the beginning of a multi-token entity.I
– Token is inside a multi-token entity.L
– Token is the last token of a multi-token entity.U
– Token is a single-token unit entity.O
– Toke is outside an entity.
import spacynlp = spacy.load("en_core_web_sm")doc = nlp("San Francisco considers banning sidewalk delivery robots")# document levelents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]print(ents)# token levelent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]print(ent_san) # ['San', 'B', 'GPE']print(ent_francisco) # ['Francisco', 'I', 'GPE']
TEXT ENT_IOBENT_IOB_ENT_TYPE_DESCRIPTIONS an3B"GPE"
beginning of an entityFrancisco1I"GPE"
inside an entity considers2O""
outside an entity banning2O""
outside an entity sidewalk2O""
outside an entity delivery2O""
outside an entity robots2O""
outside an entity
Setting entity annotations
To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level. However, you can’t write directly to the token.ent_iob
or token.ent_type
attributes, so the easiest way to set entities is to use the doc.set_ents
function and create the new entity as a Span
.
import spacyfrom spacy.tokens import Spannlp = spacy.load("en_core_web_sm")doc = nlp("fb is hiring a new vice president of global policy")ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]print('Before', ents)# The model didn't recognize "fb" as an entity :(# Create a span for the new entityfb_ent = Span(doc, 0, 1, label="ORG")orig_ents = list(doc.ents)# Option 1: Modify the provided entity spans, leaving the rest unmodifieddoc.set_ents([fb_ent], default="unmodified")# Option 2: Assign a complete list of ents to doc.entsdoc.ents = orig_ents + [fb_ent]ents = [(e.text, e.start, e.end, e.label_) for e in doc.ents]print('After', ents)# [('fb', 0, 1, 'ORG')]
Keep in mind that Span
is initialized with the start and end token indices, not the character offsets. To create a span from character offsets, use Doc.char_span
:
fb_ent = doc.char_span(0, 2, label="ORG")
Setting entity annotations from array
You can also assign entity annotations using the doc.from_array
method. To do this, you should include both the ENT_TYPE
and the ENT_IOB
attributes in the array you’re importing from.
Editable CodespaCy v3.0 · Python 3 · via Binderimport numpyimport spacyfrom spacy.attrs import ENT_IOB, ENT_TYPEnlp = spacy.load("en_core_web_sm")doc = nlp.make_doc("London is a big city in the United Kingdom.")print("Before", doc.ents) # []header = [ENT_IOB, ENT_TYPE]attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")attr_array[0, 0] = 3 # Battr_array[0, 1] = doc.vocab.strings["GPE"]doc.from_array(header, attr_array)print("After", doc.ents) # [London]
Visualizing named entities
The displaCy ENT visualizer lets you explore an entity recognition model’s behavior interactively. If you’re training a model, it’s very useful to run the visualization yourself. To help you do that, spaCy comes with a visualization module. You can pass a Doc
or a list of Doc
objects to displaCy and run displacy.serve
to run the web server, or displacy.render
to generate the raw markup.
For more details and examples, see the usage guide on visualizing spaCy.
NAMED ENTITY EXAMPLEimport spacy
from spacy import displacytext = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style="ent")
Entity Linking(NED)
To better understand the goal of Named Entity Disambiguation, it is helpful to place it as the last step in a longer NLP pipeline that begins with Named Entity Recognition. Named Entity Recognition identifies within a unit of text the word or words that represent an entity or object — such as names, organizations, locations, or other proper nouns.
Once a named entity has been identified, it must be matched to a corresponding node in a knowledge base in order to grow that knowledge base with primary source documentation and improve understanding of the node. The simplest way to link a named entity is by direct string matching.
To ground the named entities into the “real world”, spaCy provides functionality to perform entity linking, which resolves a textual entity to a unique identifier from a knowledge base (KB). You can create your own KnowledgeBase
and train a new EntityLinker
using that custom knowledge base.
Accessing entity identifiers NEEDS MODEL
The annotated KB identifier is accessible as either a hash value or as a string, using the attributes ent.kb_id
and ent.kb_id_
of a Span
object, or the ent_kb_id
and ent_kb_id_
attributes of a Token
object.
import spacynlp = spacy.load("my_custom_el_pipeline")
doc = nlp("Ada Lovelace was born in London")# Document level
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]# Token level
ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
print(ent_ada_0) # ['Ada', 'PERSON', 'Q7259']
print(ent_ada_1) # ['Lovelace', 'PERSON', 'Q7259']
print(ent_london_5) # ['London', 'GPE', 'Q84']
Will share more with real time examples from my project in my next article.