Named Entity Recognition is a process of finding a fixed set of entities in a text. The entities are pre-defined such as person, organization, location etc. Typically a NER system takes an unstructured text and finds the entities in the text. Entities can be of a single token (word) or can span multiple tokens. For example, Ghana
is a location
entity and Microsoft Corp.
is an organization
entity consisting of multiple tokens.
There are many libraries available that perform NER out of the box. Mostly they can detect person, organization, location, time and money. But if your requirements are different, then you need to train your own NER model. Fortunately, many libraries also provide us with an API to train our own models. The difficult part is collecting the data. I’ll list a few libraries that you can use for NER. Note that these libraries can do much much more than just NER.
- NLTK It is not quite suitable for production environments but provides a lot of algorithms and features to try out.
- spaCy Fast and efficient algorithms but provides less features than others.
- Stanford Core NLP Written in Java. NER is based on Conditional Random Fields.
- Apache OpenNLP Written in Java. Supports NER along with many other NLP tasks.
In this post we’ll use spaCy library. If you haven’t already installed it then install it using
1
2
pip install spacy
python -m spacy download en
It will install spaCy and also download English language model. spaCy provides a number of models trained in different datasets and in different languages. Check them out at https://spacy.io/usage/models
Now we’ll import the library and load an English language model and perform NER on a sentence.
1
2
3
4
5
import spacy
nlp = spacy.load('en')
doc = nlp("Google is going to open a new office in China")
for entity in doc.ents:
print(entity.text, entity.label_)
1
2
Google ORG
China GPE
All we needed to do was create a doc of type spacy.tokens.doc.Doc
by calling nlp
function. A spaCy document consists of many other information about the text besides entities. It represents a processed version of the input text. Click here to find out more about spaCy documents. To find out the entities, we can iterate through the ents
field of the document. The entity
is of type spacy.tokens.span.Span
. You can read more about it’s attributes here. Span
has number of fields that we can use. Here we’re using .text
and .label_
fields.
To find out what the entity labels mean, check out this link . I’ll just list a few ones that popped up in this experiment.
label | description |
---|---|
PERSON | Person |
GPE | Countries, cities, states. |
NORP | Nationalities or religious or political groups. |
ORG | Companies, agencies, institutions, etc. |
Let’s look at some more examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Donald Trump was heard off camera saying "get me out of here" to an aide as he left the G20 summit in Buenos Aires, leaving the president of Argentina standing alone on stage
Donald Trump PERSON
Buenos Aires PERSON
Argentina GPE
********************
Supermarket ban sees '80% drop' in plastic bag consumption nationwide in Australia
80% PERCENT
Australia GPE
********************
Putin refuses to release Ukrainian sailors and ships
Putin PERSON
Ukrainian NORP
********************
Abandoned coal mines across the UK could be brought back to life as huge underground farms,according to academics. The initiative is seen as a way of providing large-scale crop production for a growing global pop. Advocates say subterranean farms could yield up to 10 times as much as farms above gnd
UK GPE
up to 10 CARDINAL
********************
Nearly 70 percent of plastic sent to be recycled in Japan is burned in a method called “thermal recycling” or “heat recovery,” prompting specialists to call for a review of the system, which they say contributes to global warming.
Nearly 70 percent PERCENT
Japan GPE
********************
U.S. Defense Secretary Jim Mattis accused Russian President Vladimir Putin on Saturday of being a “slow learner” who again tried to meddle in U.S. elections in November, adding that he had no trust in the Russian leader.
U.S. GPE
Defense ORG
Jim Mattis PERSON
Russian NORP
Vladimir Putin PERSON
Saturday DATE
U.S. GPE
November DATE
Russian NORP
********************
spaCy makes it really easy to perform common NLP tasks. It supports other features like classification, word vectors and similarity as well. Check out their usage docs for more details.
Comments