Introduction

The goal of Named entity recognition is to classify each token (word) in a sentence into certain class. The most common NER systems available freely in the Internet can identify PERSON, LOCATION, ORGANIZATION etc. There are several applications of NER and can be a part of your NLP pipeline for numerous tasks. For example

  • Identifying ingredients in a recipie to facilitate filtering of recipies by ingredients
  • Identifying name of people, location, email, bank accounts etc for data anonymization
  • Extracting address, contact details etc. from texts
  • Extracting product attributes from product descriptions

As an example, consider a product title “Technos 39 Inch Curved Smart LED TV E39DU2000 With Wallmount”. The possible entities in this sentence could be

entity value
brand Technos
display_size 39 Inch
display_type LED

Since existing NER models and openly available datasets might not be suitable for your task, we need to create a dataset of our own. Compared to other problems such as classification, I find annotating data for NER to be quite daunting and usage of several GUI based annotation tools are necessary. In this post, I will show how we can create dataset for NER quite easily and train a model using Huggingface transformers library.

You will need to install the following libraries to follow along

pip install -q datasets transformers

Data preparation

To annotate data for NER, you need to specify to which class each word in the sentence belongs to. Existing datasets available on the Internet are in various formats such as CoNLL which I believe are not easy to digest for human beings. I find the format used by Rasa to be quite easy to create/read for humans.

If we consider the example sentence from above, then our annotated sentence becomes

Original: Technos 39 Inch Curved Smart LED TV E39DU2000 With Wallmount

Annotated: [Technos](brand) [39 Inch](display_size) Curved Smart [LED](display_type) TV E39DU2000 With Wallmount

Another example,

Original: I come from Kathmandu valley, Nepal

Annotated: I come from [Kathmandu valley,](location) [Nepal](location)

The format is simple, you put the entities inside square brackets and immediately after the square brackets you specify the name of the entity inside small brackets.

The code below will take an annotated text as input and returns a list of tuples where the first item is the value of the entity and the second item is the entity name. If a token as not been annotated, the the token will have class O to indicate it does not belong to any entity.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import re
def get_tokens_with_entities(raw_text: str):
    # split the text by spaces only if the space does not occur between square brackets
    # we do not want to split "multi-word" entity value yet
    raw_tokens = re.split(r"\s(?![^\[]*\])", raw_text)

    # a regex for matching the annotation according to our notation [entity_value](entity_name)
    entity_value_pattern = r"\[(?P<value>.+?)\]\((?P<entity>.+?)\)"
    entity_value_pattern_compiled = re.compile(entity_value_pattern, flags=re.I|re.M)

    tokens_with_entities = []

    for raw_token in raw_tokens:
        match = entity_value_pattern_compiled.match(raw_token)
        if match:
            raw_entity_name, raw_entity_value = match.group("entity"), match.group("value")

            # we prefix the name of entity differently
            # B- indicates beginning of an entity
            # I- indicates the token is not a new entity itself but rather a part of existing one
            for i, raw_entity_token in enumerate(re.split("\s", raw_entity_value)):
                entity_prefix = "B" if i == 0 else "I"
                entity_name = f"{entity_prefix}-{raw_entity_name}"
                tokens_with_entities.append((raw_entity_token, entity_name))
        else:
            tokens_with_entities.append((raw_token, "O"))

    return tokens_with_entities

Let’s try some inputs

1
2
3
4
5
print(get_tokens_with_entities("I come from [Kathmandu valley,](location) [Nepal](location)"))
# [('I', 'O'), ('come', 'O'), ('from', 'O'), ('Kathmandu', 'B-location'), ('valley,', 'I-location'), ('Nepal', 'B-location')]

print(get_tokens_with_entities("[Technos](brand) [39 Inch](display_size) Curved Smart [LED](display_type) TV E39DU2000 With Wallmount"))
# [('Technos', 'B-brand'), ('39', 'B-display_size'), ('Inch', 'I-display_size'), ('Curved', 'O'), ('Smart', 'O'), ('LED', 'B-display_type'), ('TV', 'O'), ('E39DU2000', 'O'), ('With', 'O'), ('Wallmount', 'O')]

So far it looks good. We can have entity values that span multiple words and and we can have any kind of entity names.

But we still are not done yet. Transformer models typically use limited vocabulary size and therefore cannot know all the words in existence. So in case there are some words in our dataset which the model does not currently know about then that word is splitted into multiple “sub-words”. There are several tokenization scehems such as WordPiece, BytePairEncoding etc. used by different models. If a token from our annotation is splitted into multiple sub-words then our annotation becomes misaliged. We need to take care of this as well. Let me show you an example of what I mean.

1
2
3
4
5
6
7
8
9
10
11
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# note that I purposefully misspell Kathmandu to Kathamanduu
sample_input = "I come from [Kathmanduu valley,](location) [Nepal](location)"
tokens, entities = list(zip(*get_tokens_with_entities(sample_input)))
tokenized_input = tokenizer(tokens, is_split_into_words=True)
print("Original tokens           : ", tokens)
print("After subword tokenization: ", tokenizer.convert_ids_to_tokens(tokenized_input['input_ids']))
# Original tokens           :  ('I', 'come', 'from', 'Kathmanduu', 'valley,', 'Nepal')
# After subword tokenization:  ['[CLS]', 'i', 'come', 'from', 'kathmandu', '##u', 'valley', ',', 'nepal', '[SEP]']

We can see from the output after tokenization, the number of tokens are different than our original list of tokens. Depending on the tokenizer model we use, it adds several “special tokens” at the beginning or at the end. Also note that the tokenizer model does not know about the word “kathamanduu”, so it splitted it into two tokens “kathmandu” and “##u”. We need to align the labels from the original token/label pairs to the “new tokens”. This is also explained here

To make things eaier, I created a class called NERDataMaker which takes care of all the stuff we mentioned above and returns a datasets.Dataset object which can be directly passed to huggingface’s Trainer class. You can find the implementation in this gist.

For this demo, I’ve created a small dataset to extract product attributes from product descriptions posted in e-commerce websites.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
raw_text = """
[40"](display_size) [LED](display_type) TV
Specifications: [16″](display_size) HD READY [LED](display_type) TV.
[1 Year](warranty) Warranty
Rowa [29"](display_size) [LED](display_type) TV
Highlights:- 48"Full HD [LED](display_type) TV Triple Protection
[80cm](display_size) (32) HD Flat TV K4000 Series 4
[32"](display_size) LED, [2 yrs](warranty) full warranty, All care protection, Integrated Sound Station- Tweeter/20w, Family tv 2.0, Louvre Desing, Mega dynamic contract ratio, Hyper real engine, USB movie
CG 32D0003 [LED](display_type) TV
Screen Size : [43″](display_size)
Resolution : 1920*1080p
Response time : [8ms](response_time)
USB : Yes (Music+Photo+Movie)
Analog AV Out : Yes
Power Supply : 110~240V 50-60Hz
WEGA [32 Inch](display_size) SMART DLED TV HI Sound Double Glass - (Black)
Model: [32"](display_size) Smart DLED TV HI Sound
Hisense HX32N2176 [32"Inch](display_size) Full HD [Led](display_type) Tv
[32 Inch](display_size) [1366x768](display_resolution) pixels HD LED TV
[43 inch](display_size) [LED](display_type) TV
[2 Years](warranty) Warranty & 1 Year Service Warranty
[1920 X 1080](display_resolution) Full HD
[Technos](brand) [39 Inch](display_size) Curved Smart [LED](display_type) TV E39DU2000 With Wallmount
24″ Led Display Stylish Display Screen resolution : [1280 × 720](display_resolution) (HD Ready) USB : Yes VGS : Yes
Technos 24K5 [24 Inch](display_size) LED TV
Technos Led Tv [18.5″ Inch](display_size) (1868tw)
[18.5 inch](display_size) stylish LED dsiplay [1280 x 720p](display_resolution) HD display 2 acoustic speaker USB and HDMI port Technos brand
15.6 ” Led Display Display Screen resolution : 1280 720 (HD Ready) USB : Yes VGS : Yes HDMI : Yes Screen Technology : [led](display_type)
Model:CG55D1004U
Screen Size: [55"](display_size)
Resolution: [3840x2160p](display_resolution)
Power Supply: 100~240 V/AC
Sound Output (RMS): 8W + 8W
Warranty: [3 Years](warranty) wrranty
"""

dm = NERDataMaker(raw_text.split("\n"))
print(f"total examples = {len(dm)}")
print(dm[0:3])

# total examples = 35
# [{'id': 0, 'ner_tags': [0], 'tokens': ['']}, {'id': 1, 'ner_tags': [2, 3, 0], 'tokens': ['40"', 'LED', 'TV']}, {'id': 2, 'ner_tags': [0, 2, 0, 0, 3, 0], 'tokens': ['Specifications:', '16″', 'HD', 'READY', 'LED', 'TV.']}]

Now that we have our “data maker” ready, we can finally train the model.

Model training

For this demo, I’ll use distilbert-base-uncased model. The dm object contains few properties which we pass to the AutoModelForTokenClassification.from_pretrained method.

1
2
3
4
from transformers import AutoTokenizer, DataCollatorForTokenClassification, AutoModelForTokenClassification, TrainingArguments, Trainer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(dm.unique_entities), id2label=dm.id2label, label2id=dm.label2id)

Finally we can configure training arguments, create a datasets.Dataset object and a Trainer object to train the model. I am evaluating on training data just for the demo. Please create a proper dataset for evaluation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=40,
    weight_decay=0.01,
)

train_ds = dm.as_hf_dataset(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=train_ds, # eval on training set! ONLY for DEMO!!
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

The “validation loss” decreased to around 0.03 after 40 epochs. Although the validation loss here is calculated on the training data itself so don’t consider this number to represent actual performance of the model on unseen data. I posted the number here just so that you can compare the results if you are following along.

To use the trained model for inference, we will use pipeline from the transformers library to easily get the predictions.

1
2
3
from transformers import pipeline
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # pass device=0 if using gpu
pipe("""2 year warrantee Samsung 40 inch LED TV, 1980 x 1080 resolution""")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[{'end': 6,
  'entity_group': 'warranty',
  'score': 0.53562486,
  'start': 0,
  'word': '2 year'},
 {'end': 32,
  'entity_group': 'display_size',
  'score': 0.92803776,
  'start': 25,
  'word': '40 inch'},
 {'end': 36,
  'entity_group': 'display_type',
  'score': 0.7992602,
  'start': 33,
  'word': 'led'},
 {'end': 52,
  'entity_group': 'display_resolution',
  'score': 0.7081752,
  'start': 41,
  'word': '1980 x 1080'}]

Even though I purposefully misspelled the word “warranty”, the model was still able to find out the warranty of this product is “2 year”. I think the results are promising and we can create robust NER models that can handle noisy data if trained with sufficiently large number of examples.

Conclusion

In this post we created a simple and easy way to annotate our data for NER and also solved the problem of label alignment due to sub-word tokenization scheme that many transformer models use. Finally we also trained the model using Trainer class and used pipeline to easily use the trained model for inference.

If you liked this post then please share it with others. If there are any errors please let me know.

Comments