Named Entity Linking
In my previous article I went over NED and general steps which come after Named Entity Recognition . NER is a fundamental Natural Language Processing task and has a wide range of use cases.
Before looking into NEL, we will first understand information extraction. According to Wikipedia,
“ Information extraction is a task of automatically extracting structured information from unstructured and/or semi-structured documents. In most of the cases, this activity concerns processing human language texts by means of NLP.”
More closer look at Named Entity Linking
Information extraction comprises of multiple sub-tasks. In most cases, we will have the following sub-tasks.
- Named Entity Recognition (NER)
- Named Entity Linking (NEL)
- Relation Extraction
NER identifies and classify named entity occurrences in text into pre-defined categories. NER is modeled as a task of assigning a tag to each word in a sentence. Below is an example result from a NER system.
NER will tell us what words are entities and what are we can see their types. In the above example, NER will locate “Sebastian Thrun” as a person. But we still don’t know exactly which “Sebastian Thrun” the text is speaking about in the above example. NEL is the next task that will answer this question.
NEL is the task to link entity mentions in text with their corresponding entities in a knowledge base. In our above example, we can find exactly which “Sebastian Thrun” by linking the entities to DBpedia. DBpedia is a structured knowledge base extracted from Wikipedia. This process of linking entities to Wikipedia is also called as Wikification.
NEL is also referred to as Entity Linking, Named Entity Disambiguation (NED), Named Entity Recognition and Disambiguation (NERD) or Named Entity Normalization (NEN). NEL has a wide range of applications other than Information Extraction. There are many libraries available to implement NEL, But here we are going to use DBpedia Spotlight. Target knowledge base for NEL here is DBpedia. DBpedia Spotlight, a system for automatically annotating text documents with DBpedia URIs, is developed as a step towards interconnecting the Web of Documents with the Web of Data.
DBpedia Spotlight is deployed as a Web Service, and we can use the provided Spotlight API to achieve NEL. You can even check the status of DBpedia Spotlight server here.
As you can see in the above example, DBpedia Spotlight is linking the located entities to DBpedia knowledge base. As a result, we are getting annotated text back. Spotlight supports many languages and multiple response content type that includes HTML, JSON, XML, N-Triples, etc. If you are not comfortable with the Spotlight API, you can use publicly available wrappers written around DBpedia Spotlight’s REST Interface. One such wrapper is pyspotlight. For any significant Spotlight usage, it is strongly recommended to run your own server. Please follow the installation instructions for running Spotlight in your own server.
NEL is not a trivial task due to the name variation and ambiguity problem. Name variation means an entity can be mentioned in different ways. For example, the entity Michael Jeffrey Jordan can be referred to using numerous names, such as Michael Jordan, MJ, and Jordan. Whereas the ambiguity problem is related to the fact that a name may refer to different entities depending on the context.
In general, a typical entity linking system consists of three modules, namely Candidate Entity Generation, Candidate Entity Ranking, and Unlinkable Mention Prediction . A brief description of each module is given below.
- Candidate Entity Generation — In this module, the NEL system aims to retrieve a set of candidate entities by filtering out the irrelevant entities in the knowledge base. The retrieved set contains possible entities that may refer to an entity mention.
- Candidate Entity Ranking — Here, different kinds of evidence are leveraged to rank the candidate entities to find the most likely entity for the mention.
- Unlinkable Mention Prediction — This module will validate whether the top-ranked entity identified in the previous module is the target entity for the given mention. If not, then it will return NIL for the mention. Basically, this module is to deal with unlinkable mentions.
NEL is an essential NLP task that should be given more importance. Recently people started using deep learning techniques to improve the performance of NEL systems on standard datasets. I believe massive Linked Open Data present today provides an incredible opportunity for tomorrow’s Artificial Intelligence. Given NEL’s role in Information Extraction and Semantic Web, we need to work more on topics like these.
Hope you find this article helpful and please contact me with your questions.