Named Entity Recognition

A named entity is an object that’s assigned a name — for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the property of a :

import spacynlp = spacy.load("en_core_web_sm")doc = nlp("Apple is looking at buying U.K. startup for $1 billion")for ent in doc.ents:print(ent.text, ent.start_char, ent.end_char, ent.label_)
  • Text: The original entity text.
  • Start: Index of start of entity in the .
  • End: Index of end of entity in the .
  • Label: Entity label, i.e. type.

TEXT START END LABEL DESCRIPTION Apple05Companies, agencies, institutions.U.K.2731Geopolitical entity, i.e. countries, cities, states.$1 billion4454Monetary values, including unit.

Using spaCy’s built-in displaCy visualizer, here’s what our example sentence and its named entities look like:

Accessing entity annotations and labels

The standard way to access entity annotations is the property, which produces a sequence of objects. The entity type is accessible either as a hash value or as a string, using the attributes and . The object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

You can also access token entity annotations using the and attributes. indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.

IOB SCHEME

  • – Token is inside an entity.
  • – Token is outside an entity.
  • – Token is the beginning of an entity.

BILUO SCHEME

  • – Token is the beginning of a multi-token entity.
  • – Token is inside a multi-token entity.
  • – Token is the last token of a multi-token entity.
  • – Token is a single-token unit entity.
  • – Toke is outside an entity.
import spacynlp = spacy.load("en_core_web_sm")doc = nlp("San Francisco considers banning sidewalk delivery robots")# document levelents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]print(ents)# token levelent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]print(ent_san)  # ['San', 'B', 'GPE']print(ent_francisco)  # ['Francisco', 'I', 'GPE']

TEXT ENT_IOBENT_IOB_ENT_TYPE_DESCRIPTIONS anbeginning of an entityFranciscoinside an entity considersoutside an entity banningoutside an entity sidewalkoutside an entity deliveryoutside an entity robotsoutside an entity

Setting entity annotations

To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level. However, you can’t write directly to the or attributes, so the easiest way to set entities is to use the function and create the new entity as a .

import spacyfrom spacy.tokens import Spannlp = spacy.load("en_core_web_sm")doc = nlp("fb is hiring a new vice president of global policy")ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]print('Before', ents)# The model didn't recognize "fb" as an entity :(# Create a span for the new entityfb_ent = Span(doc, 0, 1, label="ORG")orig_ents = list(doc.ents)# Option 1: Modify the provided entity spans, leaving the rest unmodifieddoc.set_ents([fb_ent], default="unmodified")# Option 2: Assign a complete list of ents to doc.entsdoc.ents = orig_ents + [fb_ent]ents = [(e.text, e.start, e.end, e.label_) for e in doc.ents]print('After', ents)# [('fb', 0, 1, 'ORG')]

Keep in mind that is initialized with the start and end token indices, not the character offsets. To create a span from character offsets, use :

fb_ent = doc.char_span(0, 2, label="ORG")

Setting entity annotations from array

You can also assign entity annotations using the method. To do this, you should include both the and the attributes in the array you’re importing from.

Editable CodespaCy v3.0 · Python 3 · via Binderimport numpyimport spacyfrom spacy.attrs import ENT_IOB, ENT_TYPEnlp = spacy.load("en_core_web_sm")doc = nlp.make_doc("London is a big city in the United Kingdom.")print("Before", doc.ents)  # []header = [ENT_IOB, ENT_TYPE]attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")attr_array[0, 0] = 3  # Battr_array[0, 1] = doc.vocab.strings["GPE"]doc.from_array(header, attr_array)print("After", doc.ents)  # [London]

Visualizing named entities

The displaCy ENT visualizer lets you explore an entity recognition model’s behavior interactively. If you’re training a model, it’s very useful to run the visualization yourself. To help you do that, spaCy comes with a visualization module. You can pass a or a list of objects to displaCy and run to run the web server, or to generate the raw markup.

For more details and examples, see the usage guide on visualizing spaCy.

NAMED ENTITY EXAMPLEimport spacy
from spacy import displacy
text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style="ent")

Entity Linking(NED)

To better understand the goal of Named Entity Disambiguation, it is helpful to place it as the last step in a longer NLP pipeline that begins with Named Entity Recognition. Named Entity Recognition identifies within a unit of text the word or words that represent an entity or object — such as names, organizations, locations, or other proper nouns.

Once a named entity has been identified, it must be matched to a corresponding node in a knowledge base in order to grow that knowledge base with primary source documentation and improve understanding of the node. The simplest way to link a named entity is by direct string matching.

To ground the named entities into the “real world”, spaCy provides functionality to perform entity linking, which resolves a textual entity to a unique identifier from a knowledge base (KB). You can create your own and train a new using that custom knowledge base.

Accessing entity identifiers NEEDS MODEL

The annotated KB identifier is accessible as either a hash value or as a string, using the attributes and of a object, or the and attributes of a object.

import spacynlp = spacy.load("my_custom_el_pipeline")
doc = nlp("Ada Lovelace was born in London")
# Document level
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]
# Token level
ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
print(ent_ada_0) # ['Ada', 'PERSON', 'Q7259']
print(ent_ada_1) # ['Lovelace', 'PERSON', 'Q7259']
print(ent_london_5) # ['London', 'GPE', 'Q84']

Will share more with real time examples from my project in my next article.

Data Science student @Flatiron-School