NER and NED with spaCy
Named Entity Recognition
A named entity is an object that’s assigned a name — for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.
Named entities are available as the
ents property of a
import spacynlp = spacy.load("en_core_web_sm")doc = nlp("Apple is looking at buying U.K. startup for $1 billion")for ent in doc.ents:print(ent.text, ent.start_char, ent.end_char, ent.label_)
- Text: The original entity text.
- Start: Index of start of entity in the
- End: Index of end of entity in the
- Label: Entity label, i.e. type.
TEXT START END LABEL DESCRIPTION Apple05
ORGCompanies, agencies, institutions.U.K.2731
GPEGeopolitical entity, i.e. countries, cities, states.$1 billion4454
MONEYMonetary values, including unit.
Using spaCy’s built-in displaCy visualizer, here’s what our example sentence and its named entities look like:
Accessing entity annotations and labels
The standard way to access entity annotations is the
doc.ents property, which produces a sequence of
Span objects. The entity type is accessible either as a hash value or as a string, using the attributes
Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.
You can also access token entity annotations using the
token.ent_iob indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.
I– Token is inside an entity.
O– Token is outside an entity.
B– Token is the beginning of an entity.
B– Token is the beginning of a multi-token entity.
I– Token is inside a multi-token entity.
L– Token is the last token of a multi-token entity.
U– Token is a single-token unit entity.
O– Toke is outside an entity.
import spacynlp = spacy.load("en_core_web_sm")doc = nlp("San Francisco considers banning sidewalk delivery robots")# document levelents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]print(ents)# token levelent_san = [doc.text, doc.ent_iob_, doc.ent_type_]ent_francisco = [doc.text, doc.ent_iob_, doc.ent_type_]print(ent_san) # ['San', 'B', 'GPE']print(ent_francisco) # ['Francisco', 'I', 'GPE']
TEXT ENT_IOBENT_IOB_ENT_TYPE_DESCRIPTIONS an
3B"GPE"beginning of an entityFrancisco
1I"GPE"inside an entity considers
2O""outside an entity banning
2O""outside an entity sidewalk
2O""outside an entity delivery
2O""outside an entity robots
2O""outside an entity
Setting entity annotations
To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level. However, you can’t write directly to the
token.ent_type attributes, so the easiest way to set entities is to use the
doc.set_ents function and create the new entity as a
import spacyfrom spacy.tokens import Spannlp = spacy.load("en_core_web_sm")doc = nlp("fb is hiring a new vice president of global policy")ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]print('Before', ents)# The model didn't recognize "fb" as an entity :(# Create a span for the new entityfb_ent = Span(doc, 0, 1, label="ORG")orig_ents = list(doc.ents)# Option 1: Modify the provided entity spans, leaving the rest unmodifieddoc.set_ents([fb_ent], default="unmodified")# Option 2: Assign a complete list of ents to doc.entsdoc.ents = orig_ents + [fb_ent]ents = [(e.text, e.start, e.end, e.label_) for e in doc.ents]print('After', ents)# [('fb', 0, 1, 'ORG')]
Keep in mind that
Span is initialized with the start and end token indices, not the character offsets. To create a span from character offsets, use
fb_ent = doc.char_span(0, 2, label="ORG")
Setting entity annotations from array
You can also assign entity annotations using the
doc.from_array method. To do this, you should include both the
ENT_TYPE and the
ENT_IOB attributes in the array you’re importing from.
Editable CodespaCy v3.0 · Python 3 · via Binderimport numpyimport spacyfrom spacy.attrs import ENT_IOB, ENT_TYPEnlp = spacy.load("en_core_web_sm")doc = nlp.make_doc("London is a big city in the United Kingdom.")print("Before", doc.ents) # header = [ENT_IOB, ENT_TYPE]attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")attr_array[0, 0] = 3 # Battr_array[0, 1] = doc.vocab.strings["GPE"]doc.from_array(header, attr_array)print("After", doc.ents) # [London]
Visualizing named entities
The displaCy ENT visualizer lets you explore an entity recognition model’s behavior interactively. If you’re training a model, it’s very useful to run the visualization yourself. To help you do that, spaCy comes with a visualization module. You can pass a
Doc or a list of
Doc objects to displaCy and run
displacy.serve to run the web server, or
displacy.render to generate the raw markup.
For more details and examples, see the usage guide on visualizing spaCy.
NAMED ENTITY EXAMPLEimport spacy
from spacy import displacytext = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
To better understand the goal of Named Entity Disambiguation, it is helpful to place it as the last step in a longer NLP pipeline that begins with Named Entity Recognition. Named Entity Recognition identifies within a unit of text the word or words that represent an entity or object — such as names, organizations, locations, or other proper nouns.
Once a named entity has been identified, it must be matched to a corresponding node in a knowledge base in order to grow that knowledge base with primary source documentation and improve understanding of the node. The simplest way to link a named entity is by direct string matching.
To ground the named entities into the “real world”, spaCy provides functionality to perform entity linking, which resolves a textual entity to a unique identifier from a knowledge base (KB). You can create your own
KnowledgeBase and train a new
EntityLinker using that custom knowledge base.
Accessing entity identifiers NEEDS MODEL
The annotated KB identifier is accessible as either a hash value or as a string, using the attributes
ent.kb_id_ of a
Span object, or the
ent_kb_id_ attributes of a
import spacynlp = spacy.load("my_custom_el_pipeline")
doc = nlp("Ada Lovelace was born in London")# Document level
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]# Token level
ent_ada_0 = [doc.text, doc.ent_type_, doc.ent_kb_id_]
ent_ada_1 = [doc.text, doc.ent_type_, doc.ent_kb_id_]
ent_london_5 = [doc.text, doc.ent_type_, doc.ent_kb_id_]
print(ent_ada_0) # ['Ada', 'PERSON', 'Q7259']
print(ent_ada_1) # ['Lovelace', 'PERSON', 'Q7259']
print(ent_london_5) # ['London', 'GPE', 'Q84']
Will share more with real time examples from my project in my next article.