Sanjaya’s Blog

How Does Triplet Loss and Online Triplet Mining Work?

2024-10-06T14:22:00+00:00

Introduction

Metric learning is an approach to train a model such that similar items have lower distance than dis-similar items in the vector space learned by the model. For example, if we have a photo of a dog as an input, the distance to another photo of a dog should be smaller compared to photos of other animals. In supervised classification problem, we aim to assign an item to a predefined set of classes but in metric learning, we aim to learn the representation that capture the underlying structure of the items. This is useful for applications which use similarity between items as the “main component” e.g. semantic search, clustering, recommender systems etc. You can read more about this topic in the paper A Tutorial on Distance Metric Learning.

For example, consider the embeddings of news articles using a pre-trained model vs the one trained using the approach I’ll describe in this post. The difference is clear: the new embeddings are going to be more useful for tasks like custering compared to the old one.

When training a model, we use loss functions for metric learning such as Contrastive loss, Triplet loss etc. There are opensource libraries that already implement many of these. One of them is pytorch-metric-learning. For this post, I’ll focus on Triplet loss.

As the name indicates, triplet loss expects a list of triplets to compute the loss. Each triplet contains Anchor, Positive and Negative.

Anchor: This is the embedding of the item of concern. e.g. a photo of a dog

Positive: This is embedding of another item in the dataset which is similar to the Anchor

Negative: This is embedding of another item in the dataset which is not similar to the Anchor - in other words this item should be more dissimilar than the Positive

This loss function makes the model to minimize the distance between Anchor and Positive while maximizing the distance between Anchor and Negative. There are two concerns here: first is about the loss function itself. How do we write such a loss function? The second one is how do we create such dataset?

There are two ways to create such dataset. This process is also called ‘Triplet Mining’.

Offline Triplet Mining: We create such dataset while pre-processing so that at the end we’ll have a huge list of triplets. Depending on the size of the original dataset, this list of triplets will be huge and need a lot of memory.
Online Triplet Mining: Automatically generate these triplets using the data in a batch. This is the focus of this post.

All of this will be clear in the following sections.

Setup

Let me motivate this post with a toy example. Suppose we have a list of produce (fruits and vegetables) and we’d like the produce with same color to be closer to each other than the ones with different color. Maybe those embeddings will be used in a search engine where if a user submits a query “apple”, then we return fruits/vegetables which are red in color.

Before we dive in, I’ll import a few libraries and also load a small language model from SentenceTransformer. Note that I used Jupyter notebook to execute the code.

Click to expand code

Decoding strategies in Decoder models (LLMs)

2024-09-25T18:04:00+00:00

Introduction

Generative models such as GPT generate one token at a time. How we choose the next token plays a very important role in the generated text. There are a few approaches to do this. In this post we’ll cover Greedy sampling, Top-P sampling and Top-K sampling. We’ll also look at how Temperature parameter affects the overall generation process.

Setup

I will demonstrate the concepts along with the code so that you can also follow along. First, let’s import few libraries.

import numpy as np
import pandas as pd
import torch
from lets_plot import *
LetsPlot.setup_html()

Next we need a model. For this, I’ve used a model I trained as shown in Transformer Decoder post but you can use any model from HuggingFace Hub.

from transformers import AutoTokenizer
# I used this tokenizer to train the model I trained earlier so I'll use this one
# but feel free to switch to any tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
gpt = ... # load a model from HuggingFace Hub. I loaded my model from the disk

Before we dive in, let’s recap how we generate texts.

graph LR; Text -- tokenize --> InputIds InputIds --> GPT GPT --> Logits Logits --> NextTokenId[Sample next token] NextTokenId --> IsEOS{Is next token == EOS
or Max Length reached} IsEOS -- yes --> Stop IsEOS -- no --> NextTokenId2[Next Token Id] NextTokenId2 -- append --> InputIds style NextTokenId fill:#f9f

Below is a basic implementation of generate function which generates text using the model. This function implements ‘Greedy sampling’. After reading this post you can implement other approaches as well.

def generate(gpt, tokenizer, initial_text=None, max_len: int=20, temperature: float=1.0):
    gpt.eval()
    input_ids = [tokenizer.cls_token_id]
    if initial_text:
        # tokenizer add SEP token at the end, do not include that one
        input_ids = tokenizer(initial_text)['input_ids'][:-1] # type: ignore

    # you can also check only for newly generated tokens
    while len(input_ids) < max_len:
        with torch.no_grad():
            logits = gpt(input_ids=torch.LongTensor(input_ids).unsqueeze(0)[0]
        
        # take the logits of the last token and scale by temperature
        logits = logits[-1] / temperature
        
        # greedy sampling. take the token with max "probability"
        # this is where we can implement different sampling strategies
        next_token_id = logits.argmax(dim=-1).item()
        input_ids.append(next_token_id)
        # I've trained the model to use `sep_token_id` as an indicator for End of Sentence token.
        # depending on the tokenizer and the model you might have to adjust this.
        if next_token_id == tokenizer.sep_token_id:
            break

    return tokenizer.decode(input_ids)

Ok, now that we know how we generate texts, let’s explore different strategies.

Temperature

First let’s discuss about temperature. You might have already seen this parameter when using APIs for LLMs. The value for this parameter is typically constrained between 0 to 1. Higher temperature means the model gets more creative and is useful when you are generating stories. Lower values can be used to force the model to be more rigid. For example, if you want the model to extract all named entities in the input, you might want to lower the temperature to say around 0.1 or even lower than that.

One thing to note is that temperature parameter is used to scale the logits. So if we are using Greedy sampling i.e. choosing the token with highest logit value, then whatever value we use for temperature will not affect the result. The relative order of logits after scaling won’t change at all.

But this plays an important role when we actually sample i.e. randomly choose a token based on its probability. We convert the logits to probabilities (either raw logits or scaled by temperature) using softmax function, the probability will change depending on the temperature.

Let me give a concrete example. My prompt is microsoft to pay 3.5 billion to settle and I’m asking the model to predict the next token. The model returns logits or un-normalized scores for 30,522 tokens because that is the vocabulary size of the model. Below, I’ve only selected top 5 tokens (Top-K sampling) based on their logit values and then plotted the data.

Let’s first focus on the plot where temperature is 1.0 (temp@1) i.e. the logits are not changed because we are just dividing by 1. This will be our baseline to compare.

The token charges has a probability of 0.29, and the word with has probability of 0.26 and so on. This means, if we were to sample from this probability distribution, we have 28% chance of choosing the word charges as next token, 17% chance of selecting the word anti and 14% chance of selecting the word in.

Now let’s switch to the case when we have the lowest temperature (temp@0.1). Here we see the word charges has 68% chance of being the next token and the word with has 32%. The remaining 3 words have no chance at all. So basically we are “magnifying” the probability of tokens with slightly higher logits. This is what limits the “creativity” of the model by limiting the possible choices of tokens because many of them will have very low probability. This also means, that for tasks where such creativity is not needed, such as entity extraction, extractive question answering etc. lower temperature is more appropriate.

If we go a bit extreme and set the temperature to 10, we now see that the probabilities of all tokens are almost the same. If we were to sample from this distribution, all tokens have almost the same chance of being selected as next token. This also basically nullifies the “work” that the model has done and is almost equivalent to sampling from a uniform distribution. We could bascailly randomly select a token from the vocabulary instead of using a model! This is why almost all LLM apis limit the range of temperature between 0 and 1.

We can also calculate the entropy of the probability distribution from each temperature we used. As we increase the temperature, the entropy increases indicating more uncertainty. E.g. when temperature is 0.1, we only had two tokens with non-zero probability values so we are a bit more confident in which token will come next compared to the case when temperature is 10. In that case, all 5 tokens had almost the same probability so we are less certain about which token will be selected next.

The code to generate the plots above is down below if you want to try it for yourself.

Click to expand code

Implementing Transformer Encoder Layer From Scratch

2024-09-22T18:04:00+00:00

Introduction

In this post we’ll implement the Transformer’s Encoder layer from scratch. This was introduced in a paper called Attention Is All You Need. This layer is typically used to build Encoder only models like BERT which excel at tasks like classification, clustering and semantic search.

The figure below (taken from the paper above) shows the architecture of a Encoder network.

An encoder network consists of N Encoder layers. Each Encoder layer consists of a MultiHeadAttention layer, followed by LayerNorm. The outputs from the LayerNorm is then passed to a Feed Forward network which is then again passed through another LayerNorm. The outputs from the Encoder network can then be passed to futher layers depending on the task. For example, for a sentence classification task, we can pass the output embeddings to a classification head to produce class probabilities.

Implementation

Let’ start by defining a single Encoder layer. As seen in the figure above, we need a MutliHeadAttention layer and couple of LayerNorm layers and a Feed Forward block.

The Feed Forward block is mentioned in section 3.3 of the paper. They call it “Position-wise Feed-Forward Networks”. This is a simple “block” consisting of a Linear -> ReLU -> Linear layers. The output of the first Linear layer is defined by the parameter dim_feedforward and the authors used a 2048 as its value. The output of the last Linear layer is same as the input embedding dimension. In code it looks like this:

torch.nn.Sequential(
    torch.nn.Linear(in_features=embed_dim, out_features=dim_feedforward, bias=True),
    torch.nn.ReLU(),
    torch.nn.Linear(in_features=dim_feedforward, out_features=embed_dim, bias=True)
)

We also need a couple of Dropout layers. Dropout layers are not shown in the figure but the authors mention about their usage in section 5.4 of the paper. They apply dropout to output of each sub-layer before it is added to the sub-layer input and normalized.

Now we know everything there is to about an Encoder layer. The code below shows the implementation of EncoderLayer.

# import some libraries we'll probably use
import numpy as np
import pandas as pd
import torch
# just used for plotting
from lets_plot import *
LetsPlot.setup_html()

class EncoderLayer(torch.nn.Module):
    def __init__(self, embed_dim: int, n_heads: int, dim_feedforward: int = 128, dropout=0.1):
        super().__init__()
        self.embed_dim = embed_dim
        self.n_heads = n_heads
        self.mha = torch.nn.MultiheadAttention(embed_dim=embed_dim, num_heads=n_heads, batch_first=True, bias=True)
        self.layer_norm1 = torch.nn.LayerNorm(normalized_shape=embed_dim)
        self.layer_norm2 = torch.nn.LayerNorm(normalized_shape=embed_dim)

        # section 5.4
        # apply dropout to output of each sublayer before it is added to sublayer's input
        self.dropout1 = torch.nn.Dropout(p=dropout)
        self.dropout2 = torch.nn.Dropout(p=dropout)
        
        # section 3.3 in paper
        self.position_wise_ff = torch.nn.Sequential(
            torch.nn.Linear(in_features=embed_dim, out_features=dim_feedforward, bias=True),
            torch.nn.ReLU(),
            torch.nn.Linear(in_features=dim_feedforward, out_features=embed_dim, bias=True)
        )

Now let’s focus on the forward method of the EncoderLayer class.

def forward(self, x, src_key_padding_mask=None, src_mask=None):
    # x.shape = (batch_size, seq_len, embed_dim)
    # src_key_padding_mask = (bs, seq_len), True value indicates it should not attend
    # src_mask.shape = (bs, seq_len, seq_len) of dtype torch.bool, True value indicates it shouldn't attend
    attn_output, attn_weights = self.mha(x, x, x, key_padding_mask=src_key_padding_mask, attn_mask=src_mask)
    # dropout and residual connection
    x  = x + self.dropout1(attn_output)
    x = self.layer_norm1(x)

    projection = self.position_wise_ff(x)
    # dropout and residual connection
    x = x + self.dropout2(projection)
    # layer norm
    x = self.layer_norm2(x)
    return x

As mentioned above, we first pass the input embeddings x through MHA layer. We then apply the dropout to the output of MHA and then add it with the original input embedding. Then this is passed to the first LayerNorm layer. Then we again pass this to the feed-forward block, apply the drop out, add to residual connection and pass it through the second LayerNorm layer. We finally return the final embeddings.

I’ve already covered about masking in the previous post so I will not go over them again here.

Now that we have a Encoder layer, let’s create another class to create a Encoder network. This class is just a simple wrapper that contains N different Encoder layers.

from copy import deepcopy
class Encoder(torch.nn.Module):
    def __init__(self, encoder_layer, num_layers: int):
        super().__init__()
        layers = []
        for _ in range(num_layers):
            layers.append(deepcopy(encoder_layer))
        self.layers = torch.nn.ModuleList(layers)

    def forward(self, x, src_key_padding_mask=None, src_mask=None):
        for layer in self.layers:
            x = layer(x, src_key_padding_mask=src_key_padding_mask, src_mask=src_mask)
        return x

Pytorch vs Our

To compare our implementation against Pytorch’s implementation, let’s build a text classification model and compare the performance. The TextClassifier class below implements a simple text classification model. It accepts an encoder parameter to try out different Encoder implementation. I’ve also copied an implementation of Positional Embedding from the link I’ve shared as a comment in the code.

import math
class PositionalEncoding(torch.nn.Module):
    # source: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html#Positional-encoding
    def __init__(self, embed_dim, max_len=256):
        super().__init__()
        # create a matrix of [seq_len, hidden_dim] representing positional encoding for each token in sequence
        pe = torch.zeros(max_len, embed_dim)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) # (max_len, 1)
        div_term = torch.exp(torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe, persistent=False)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return x
    
class TextClassifier(torch.nn.Module):
    def __init__(self, vocab_size: int, embed_dim: int, num_classes: int, encoder: torch.nn.Module, max_len):
        super().__init__()
        self.positional_encoding = PositionalEncoding(embed_dim=embed_dim, max_len=max_len)
        self.embedding = torch.nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim, padding_idx=0)
        self.encoder = encoder
        self.fc1 = torch.nn.Linear(in_features=embed_dim, out_features=128)
        self.relu = torch.nn.ReLU()
        self.final = torch.nn.Linear(in_features=128, out_features=num_classes)

    def forward(self, input_ids: torch.Tensor, src_key_padding_mask=None, **kwargs):
        # inputs: (bs, seq_len)
        # embeddings: (bs, seq_len, embed_dim)
        embeddings = self.get_embeddings(input_ids)
        attn = self.encoder(embeddings, src_key_padding_mask=src_key_padding_mask)
                                    
        # take the first token's embeddings i.e. embeddings of CLS token
        # cls_token_embeddings: (bs, embed_dim)
        cls_token_embeddings = attn[:, 0, :] 
        return self.final(self.relu(self.fc1(cls_token_embeddings)))
    
    def get_embeddings(self, input_ids):
        return self.positional_encoding(self.embedding(input_ids))

Click to expand dataset processing code

Implementing Transformer Decoder Layer From Scratch

2024-09-22T18:04:00+00:00

Introduction

In this post we’ll implement the Transformer’s Decoder layer from scratch. This was introduced in a paper called Attention Is All You Need. This layer is typically used to build “Decoder only” models such as ChatGPT, LLama etc. Of course we can combine the Decoder with Encoder as proposed in the paper but in this post, we’ll use the Decoder layers to build a Decoder-only model similar to GPT.

Decoder layer is very similar to the Encoder layer. Only difference is how masking is used. As explained in previous post, we’ll use causal mask when calculating the attention. To be more precise, Decoders are used to build model which generate the next token in sequence. During training, the model can technically “see” future tokens as well but to prevent this data leakage, we introduce causal mask so that attention is calculated using the tokens observed so far.

The image below shows how the future tokens are “excluded” by setting their mask value to negative infinity when calculating attention weights in the MultiHeadAttention layer. Please refer to previous post where I cover this in more detail.

Implementation

Implementing it is quite straight forward. Let’s import few libraries and implement two classes DecoderLayer and Decoder. Decoder class just encapsulates N number of DecoderLayers.

# pip install -q lightning datasets

import numpy as np
import pandas as pd
import torch
import lightning as L
from copy import deepcopy

class DecoderLayer(torch.nn.Module):
    def __init__(
        self,
        embed_dim: int,
        n_heads: int,
        dim_feedforward: int = 512,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.mha = torch.nn.MultiheadAttention(
            embed_dim=embed_dim, num_heads=n_heads, dropout=0.1, batch_first=True
        )
        self.norm1 = torch.nn.LayerNorm(normalized_shape=embed_dim)
        self.norm2 = torch.nn.LayerNorm(normalized_shape=embed_dim)
        self.dropout1 = torch.nn.Dropout(p=dropout)
        self.dropout2 = torch.nn.Dropout(p=dropout)
        self.ff_block = torch.nn.Sequential(
            torch.nn.Linear(in_features=embed_dim, out_features=dim_feedforward),
            torch.nn.ReLU(),
            torch.nn.Linear(in_features=dim_feedforward, out_features=embed_dim),
        )

    def forward(self, x: torch.Tensor, key_padding_mask=None, attn_mask=None):
        attn_output, attn_weights = self.mha(
            x, x, x, attn_mask=attn_mask, key_padding_mask=key_padding_mask
        )
        x = self.norm1(x + self.dropout1(attn_output))
        projection = self.ff_block(x)
        x = self.norm2(x + self.dropout2(projection))
        return x

class Decoder(torch.nn.Module):
    def __init__(self, decoder_layer, num_layers: int):
        super().__init__()
        layers = []
        for _ in range(num_layers):
            layers.append(deepcopy(decoder_layer))
        self.layers = torch.nn.ModuleList(layers)

    def forward(self, x, key_padding_mask=None, attn_mask=None):
        for layer in self.layers:
            x = layer(x, key_padding_mask=key_padding_mask, attn_mask=attn_mask)
        return x

Now that we have the building blocks. Let’s implement a model similar to GPT. We need new parameters like number of decoder layers, embedding dimension, number of heads etc. which we will accept in the __init__ function. In the end, the model will return probabilities of next token.

Click to expand Positional Embedding code

Masking in Transformer Encoder/Decoder Models

2024-09-21T18:04:00+00:00

Introduction

You might have probably encountered parameters like key_padding_mask, attn_mask etc.when using Pytorch’s MultiheadAttention layer. Similarly if you are using TransformerEncoderLayer, you can pass parameters like src_mask and src_key_padding_mask.

When using TransformerDecoder layer you’ll encounter even more parameters related to masking including tgt_mask, tgt_key_padding_mask.

In this post, I’ll demistify what these parameters are and how they are used internally. My goal is that after reading this post you are aware of importance of masking, and how to properly create masks as pass them to the models.

Note that I assume that you have some idea how attention is calculated. In the previous post, we implemented the “core” part of Transformers model, the Scaled Dot Product Attention function. However, we deliberately skipped how masking is used. Please refer to that post if you want a bit more context.

Masking padded tokens

When training or predicting, we typically pass a batch of data at once to the model rather than one by one. Let’s say we are classifying text into some categories and we have 3 sentences in our batch.

In this example, we have sentences of different lengths. We cannot create a a Pytorch Tensor using the token ids because, all of the rows must have same number of columns. So, in order to do this we “pad” the shorter sentences with a PAD token so that they all have same length as the longest one in the batch.

Now we have a batch of data with padded tokens as well. Padding was necessary just to create a tensor of token IDs that could be fed into the model. These PAD tokens serve no other purpose for our actual task of sentence classification or any other task for that matter.

Since they are useless and provide no meaningful information, the model should “ignore” such tokens. This is where the masks come in. We use masks to tell the model which are the actual tokens and should be considered by the model and which are the tokens which should be ignored.

As we see in the figure below, the mask is also a 2D matrix with same shape as token ids. The mask shown here a binary mask i.e a tensor with dtype=torch.bool. True value indicates that the token should be ignored.

Note that, we can also define a float mask instead of binary mask. We’ll see later how the binary mask is actually converted to float mask and used by Pytorch. In Pytorch, when you want to use the mask for padded tokens, you need to provide it through the parameter called *_key_padding_mask. In the next section we’ll see how the mask is actually used by the model.

Before that, let’s also look at another situation where masking is necessary.

Causal Masked Self-Attention

Decoder models especially in tasks like causal language modeling, where the models generates text in an auto-regressive manner (i.e. predicts one token at a time), masking is essential while training.

Let’s look at how we structure the training process. Assume we have a single sentence in a batch “how are you”. Since our goal is to predict the next token, our simplified training process looks like the following.

For each token we have a corresponding label which is the next token in the sequence. Now in the Multi-Head Attention layer, we compute the dot-product similarity between query and key i.e. \(QK^T\) as shown in the figure below.

However, the problem is that for the token how, the label is are but when computing this dot-product similarity, the token how can also “see” the future tokens i.e. are and you. Same for the second token are. It can “see” the future token you.

This is a problem of data leakage and we should avoid it. The figure below shows the entries in this dot-product similarity matrix which are “invalid”.

In order to fix this issue, we introduce “Attention Mask”. In the context of Decoders, this is also called “Masked Self-Attention”. The figure below shows how the attention mask looks like.

The mask has a value of negative infinity for “invalid” entries. Now, this mask is added to the dot-product similarity matrix before we apply the softmax function. For valid entires, we just add zero so the original dot product between query and key does not change, but for invalid entries, the new value will be negative infinity.

Here is the intuition: When a dot product of two vectors is higher, we consider them to be similar but so when two vectors have a dot product of negative infinity, then they are “infinitely dissimilar”.

How it is used

⚠️ In Pytorch scaled_dot_product_attention function when a boolean mask is passed to attn_mask parameter, a value of True indicates that the element should take part in attention. However in MultiHeadAttention Layer, TransformerEncoderLayer and TransformerDecoderLayer for a binary mask, a True value indicates that the corresponding key value will be ignored for the purpose of attention. Not sure why they implemented it differently, but I will consider True value to be ignored during attention calculation.

Let’s first see what the output of Pytorch’s MultiHeadAttention layer looks like.

import torch
embed_dim = 4
mha = torch.nn.MultiheadAttention(embed_dim=embed_dim, num_heads=1, batch_first=True)

# assume we have a batch of 2 sentences. 1st has 3 tokens and 2nd has 2 tokens
embeddings = torch.normal(mean=0, std=1, size=(2, 3, embed_dim))
# create a padding mask with all zeros so that every token is valid by default
key_padding_mask = torch.zeros(size=(2, 3), dtype=torch.bool)
# 3rd token of second sentence is a pad token
key_padding_mask[1, 2] = 1

_, torch_attn_mask = mha(embeddings, embeddings, embeddings, key_padding_mask=key_padding_mask)
print(torch_attn_mask)

tensor([
        # attention weights for tokens in first sentence
        [[0.2176, 0.4359, 0.3465],
         [0.5966, 0.1536, 0.2498],
         [0.4046, 0.2353, 0.3600]],

        # attention weights for tokens second sentence
        [[0.5787, 0.4213, 0.0000],
         [0.4964, 0.5036, 0.0000],
         [0.5379, 0.4621, 0.0000]]], grad_fn=)

I’ve printed the attention mask produced by the MHA layer. As we specified that the last token of 2nd sentence is a PAD token, the attention weights for that token is 0 (3rd column of 2nd matrix). Since we take a weighted sum of Value vectors \(V_i\) using the attention weights, the embeddings of PAD tokens will not have any contribution.

For the first token in first sentence, the final embedding will be calculated as \(token1\_embeddings = 0.21*V_{token1} + 0.43*V_{token2} + 0.34 * V_{token3}\)

And for the first token in second sentence, the final embedding will be calculated as \(token1\_embeddings = 0.57*V_{token1} + 0.42*V_{token2} + 0 * V_{token3}\)

Note that the Value embeddings of \(token3\) in this case does not contribute at all since it becomes a zero vector after multiplying it with 0.

Let’s see how the mask is actually used internally. First we need to reshape our mask to proper shape. The key_padding_mask is 2D i.e. (batch_size, seq_len). But as we saw previously, we add the mask to the dot-product similarity. For this we need to create a 3D tensor (batch_size, seq_len, seq_len)

# reshape mask to proper shape
key_padding_mask_expanded = key_padding_mask.unsqueeze(1) # (bs, 1, seq_len)
# expand 3 times in the 2nd dimension since we have 3 tokens
key_padding_mask_expanded = key_padding_mask_expanded.expand(-1, 3, -1)
print(key_padding_mask_expanded)

tensor([
        # mask for 1st sentence. every token is valid
        [[False, False, False],
         [False, False, False],
         [False, False, False]],

        # mask for 2nd sentence. last token is invalid
        [[False, False,  True],
         [False, False,  True],
         [False, False,  True]]])

We are basically copying the same padding mask for each sentence 3 times.

Now let’s use the mask before calculating the final attention weights.

# compute dot-product between Query and Key tokens
scores = embeddings @ embeddings.transpose(1, 2)
print(scores)
# where ever the mask value is True, fill the corresponding entry in scores to -inf
scores = scores.masked_fill(key_padding_mask_expanded, -torch.inf)
print(scores)
attn_weights = torch.softmax(scores, dim=-1)
print(attn_weights.round(decimals=2))

# scores
tensor([[[ 1.5403, -2.2226, -0.4307],
         [-2.2226,  7.0114,  2.7344],
         [-0.4307,  2.7344,  2.9128]],

        [[ 2.1827, -0.3097, -1.5490],
         [-0.3097,  0.1501,  0.7644],
         [-1.5490,  0.7644,  4.9995]]])

# add -inf to masked tokens
tensor([[[ 1.5403, -2.2226, -0.4307],
         [-2.2226,  7.0114,  2.7344],
         [-0.4307,  2.7344,  2.9128]],

        [[ 2.1827, -0.3097,    -inf],
         [-0.3097,  0.1501,    -inf],
         [-1.5490,  0.7644,    -inf]]])

# attention weights
tensor([
        # attention weights for tokens in 1st sentence
        [[0.8600, 0.0200, 0.1200],
         [0.0000, 0.9900, 0.0100],
         [0.0200, 0.4500, 0.5300]],

        # attention weights fo tokens in 2nd sentence
        [[0.9200, 0.0800, 0.0000],
         [0.3900, 0.6100, 0.0000],
         [0.0900, 0.9100, 0.0000]]])

As we see, the attention weights for the PAD token is 0 for the second sentence. Note that other attention weights are not same as the one from mha layer because mha passes the input embeddings through a linear layer which changes the values of embeddings. But that is not our concern. We are just making sure that the attention weights for the 3rd token is 0 in both the cases.

Here the important part is scores = scores.masked_fill(key_padding_mask_expanded, -torch.inf). This is same as the following.

scores = embeddings @ embeddings.transpose(1, 2)
# create a float_mask as I describe previously
float_mask = torch.zeros_like(key_padding_mask_expanded, dtype=torch.float32).masked_fill(key_padding_mask_expanded, -torch.inf)
# add the float mask to the scores and apply softmax function
print(torch.softmax(scores + float_mask, dim=-1).round(decimals=2))

This is all there is to it. Now let’s look at causal mask, which is also very easy to create. As mentioned above, the purpose of causal mask is to prevent attending to future tokens. So we can create this kind of mask using torch.triu function.

# we have 2 sentences and 3 tokens
causal_mask = torch.ones((2, 3, 3), dtype=torch.bool)
causal_mask = torch.triu(causal_mask, diagonal=1)
print(causal_mask)
print(mha(embeddings, embeddings, embeddings, attn_mask=causal_mask)[1])

# causal mask
tensor([[[False,  True,  True],
         [False, False,  True],
         [False, False, False]],

        [[False,  True,  True],
         [False, False,  True],
         [False, False, False]]])

# attention weights
tensor([
        # attention weights for tokens in 1st sentence
        [[1.0000, 0.0000, 0.0000],
         [0.7953, 0.2047, 0.0000],
         [0.4046, 0.2353, 0.3600]],

        # attention weights for tokens in 2nd sentence
        [[1.0000, 0.0000, 0.0000],
         [0.4964, 0.5036, 0.0000],
         [0.3736, 0.3210, 0.3054]]], grad_fn=)

As we see, the weights for future tokens are 0. This way these future tokens have no influence when calculating the embeddings of the current token. There is also a helper function in Pytorch that you can use to easily generate this kind of mask.

causal_mask = torch.nn.Transformer.generate_square_subsequent_mask(sz=3) # we have 3 tokens, so size=3
print(mha(embeddings, embeddings, embeddings, attn_mask=causal_mask)[1])

which returns the following which is exactly the same as before.

tensor([[[1.0000, 0.0000, 0.0000],
         [0.7953, 0.2047, 0.0000],
         [0.4046, 0.2353, 0.3600]],

        [[1.0000, 0.0000, 0.0000],
         [0.4964, 0.5036, 0.0000],
         [0.3736, 0.3210, 0.3054]]], grad_fn=)

Conclusion

We explored how masks are used internally when calculating attention. For most of the cases, we can just create a binary mask and pass it to the layers. Internally it will be converted to float mask and will be added to the dot-product similarity between Query and Key tokens before passing it to the softmax function.

I hope you found this post useful. Please let me know if you find any errors.

Multi-Head Attention From Scratch

2024-09-09T14:04:00+00:00

Introduction

In this post, we’ll implement Multi-Head Attention layer from scratch using Pytorch. We’ll also compare our implementation against Pytorch’s implementation and use this layer in a text classification task. Specifically we’ll do the following:

Implement Scaled Dot Product Attention
Implement our own Multi-Head Attention (MHA) Layer
Implement an efficient version of Multi-Head Attention Layer
Use our two implementations and Pytorch’s implementation in a model to classify texts and evaluate their performance
Implement Positional Embeddings and see why they are useful

I’ve tried to explain each step in detail as much as possible so some of the details may be obvious to many but I will cover them here anyways. Overall idea of MHA is pretty straightforward but when implementing it I faced many issues especially related to reshaping of the tensors. So this post is also for my own sake because I want to refer to these implementation details for reference as well.

Ok, now let’s begin. Since you are already here I guess you know Multi-Head Attention layer is the backbone for Transformer architecture. Many recent models including ChatGPT, Gemini, LLama etc. are based on Transformer architecture. This was introduced in a paper called Attention Is All You Need. As mentioned above, in this post we’ll just focus on Multi-Head Attention layer.

Scaled Dot Product Attention

Let’s focus on one of many approaches to calculate attention - Scaled Dot Product Attention. The figure below (taken from the paper shared above) shows how scaled dot product attention is calculated.

Let’s look at the formula first.

\[Attention(Q,K,V) = (Attention\_Weights ) V\]

where, \(Attention\_Weights = softmax(\frac{QK^T}{\sqrt{d_k}})\)

This function \(Attention\) accepts 3 matrices: Query, Key and Value. What are those?

For the sake of this discussion, I’ll NOT consider batched operation so there is no batch dimension. During implemetation we’ll take care of it. Let’s say we have a single text (i.e. sequence) how are you. When using a tokenizer, we’ll get something like

Token	Token ID
how	10
are	3
you	5

Once we pass the token ids through an embedding layer, we’ll obtain a vector for each token. Let’s say our embedding dimension is 2, so the embeddings matrix for this sequence will have a shape of \(3 \times 2\)

\[\begin{bmatrix} 1.1 & 1.2\\ 2.1 & 2.2\\ 3.1 & 3.2 \end{bmatrix}_{3\times2}\]

Typically this kind of embedding matrix is used as Query, Key and Value in the first Multi-Head Attention Layer. Output of previous layers are also used as long as they are in the required shape i.e. (sequence_length, embedding_dim). To limit the scope of this post, I’ll focus on a variant called self-attention where same embedding matrix is passed as Query, Key and Value.

Intuition

Let’s look at the formula to calculate attention weights again. \(Attention\_Weights = softmax(\frac{QK^T}{\sqrt{d_k}})\)

What does it mean?

If we just focus on \(QK^T\), we can think of this as a pairwise dot-product similarity calculation for each token in Query and each token in Key. From our example above, we had 3 tokens and an embedding matrix of shape \(3 \times2\) (embedding dimension is 2). Since we use the same embedding matrix as Query and Key, we get a \(3 \times 3\) matirx giving pairwise dot product as follows. The values shown are random.

\[\begin{bmatrix} & how & are & you\\ how & 0.5 & 0.1 & 0.4\\ are & 0.1 & 0.8 & 0.3\\ you & 0.4 & 0.3 & 0.1 \end{bmatrix}_{3 \times 3}\]

So this matrix tells us how “similar”, each pair of tokens are. This is un-normalized score, so the authors of the paper proposed to divide this matrix elementwise by square root of the embedding dimension of the Key (\(d_k\)) and then apply a softmax function.

Softmax function is applied for each row so that the numbers in each row add up to 1. This is done so that we can interpret these values as weights. Here is what applying softmax function for each row looks like. Note that I’ve rounded the numbers to 2 decimal places for illustration so they might not add up to 1 exactly. Also, I haven’t divided by \(\sqrt{d_k}\) for this illustration.

\[\begin{bmatrix} & how & are & you\\ how & 0.38 & 0.26 & 0.35\\ are & 0.23 & 0.48 & 0.29\\ you & 0.38 & 0.34 & 0.28 \end{bmatrix}_{3 \times 3}\]

Now we have attention weights. These weights are used to create the final “attention output” by taking weighted sum of Value vectors. Let me illustrate.

If we look at the attention weights for the token how (1st row in the attention weights matrix), we have: \(\begin{bmatrix} how & are & you\\ 0.38 & 0.26 & 0.35 \end{bmatrix}\)

This means that to produce final attention output for token how the Value vector of how should be weighted by 0.38, are by 0.26 and you by 0.35. Finally these weighted vectors will be summed together to create the final vector for the token how. Same goes for other tokens as well.

This is obtained by performing a matrix multiplication of attention weights and value vector as shown below in the formula.

\[Attention(Q,K,V) = (Attention\_Weights ) V\]

Since we used same data as Query, Key and Value, here is how the attention weights and Value vector look like.

\[\begin{bmatrix} & how & are & you\\ how & 0.38 & 0.26 & 0.35\\ are & 0.23 & 0.48 & 0.29\\ you & 0.38 & 0.34 & 0.28 \end{bmatrix}_{3 \times 3} \begin{bmatrix} 1.1 & 1.2\\ 2.1 & 2.2\\ 3.1 & 3.2 \end{bmatrix}_{3\times2}\]

and this is the output we get

\[\begin{bmatrix} how & 2.0630 & 2.1630\\ are & 2.1523 & 2.2523\\ you & 2.0020 & 2.1020 \end{bmatrix}_{3 \times 2}\]

You can think of this output as “enriched” embeddings for each token. Also, you’ve probably noticed that this output has same shape as the original embeddding matrix where we have 3 tokens and 2 embedding dimension.

Implementation

Now, let’s switch to implementation which is pretty straightforward.

def my_scaled_dot_product_attention(query, key=None, value=None):
    key = key if key is not None else query
    value = value if value is not None else query
    # query and key must have same embedding dimension
    assert query.size(-1) == key.size(-1)

    dk = key.size(-1) # embed dimension of key
    # query, key, value = (bs, seq_len, embed_dim)
    
    # compute dot-product to obtain pairwise "similarity" and scale it
    qk = query @ key.transpose(-1, -2) / dk**0.5
    
    # apply softmax
    # attn_weights = (bs, seq_len, seq_len)
    attn_weights = torch.softmax(qk, dim=-1)

    # compute weighted sum of value vectors
    # attn = (bs, seq_len, embed_dim)
    attn = attn_weights @ value
    return attn, attn_weights

This function implements Scaled Dot Product Attention. Note that I’ve ignored masking. We apply mask so that we do not attend to the tokens which are padded. But for this post, I’ll not focus on implementing it. The comments in the code assume a 3 dimensional tensor for query, key and value but as we’ll see later, this will work for even higher dimensional tensors.

First we make sure that query and key have same embedding dimension. Note that value can have different dimension.

Next, we figure out the embedding dimension by taking the size of last dimension dk = key.size(-1).

Then we compute the pair-wise dot product between each token in query and key by query @ key.transpose(-1, -2). We need so transpose the key so that we can perform matrix multiplication with the query.

Rest of the code should be straight forward.

Let’s verify our implementation against Pytorch’s implementation.

X = torch.normal(mean=0, std=1, size=(2, 3, 6))
torch_attended = torch.nn.functional.scaled_dot_product_attention(X, X, X)
attended, attn_weights = my_scaled_dot_product_attention(X, X, X)
assert torch.allclose(torch_attended, attended) == True

Batched Matrix multiplication

A bit about matrix multiplications for higher dimensional tensors. Matrix is a 2D tensor and matrix-matrix multiplication is pretty well known. But what happens when we have a 3D or even 4D tensor? I’ll give a couple of examples

Let’s say we have a batch of 3 sequences each with 10 tokens and each token has 256 embedding dimension. So we have a tensor \(A\) of shape <3, 10, 256>. What happens when we do \(AA^T\) or A @ A.transpose(-1, -2). Since there are 3 matrices, you can imagine a for loop for each matrix multiplication. A pseudo-code for that would look like:

batch_size = 3
A = matrix(batch_size, 10, 256)

output = []
for batch_idx in range(batch_size):
    pairwise_dot_product = A[batch_idx] @ A[batch_idx].transpose(-1, -2)
    output.append(pairwise_dot_product)

# Output has shape (batch_size, 10, 10)
return output

What about for 4D tensor? Same as above, every dimension other than the last two is used to loop.

batch_size = 3
n_heads = 2
A = matrix(batch_size, n_heads, 10, 256)
output = []

for batch_idx in range(batch_size):
    output_per_head = []
    for head_idx in range(n_heads):
        pairwise_dot_product = A[batch_idx][head_idx] @ A[batch_idx][head_idx].transpose(-1, -2)
        output_per_head.append(pairwise_dot_product)
    output.append(output_per_head)

# Output has shape (batch_size, n_heads, 10, 10)
return output

Multi-Head Attention

The authors of the paper found that instead of computing attention once using the full embedding size, it was beneficial to project the query, key and value \(h\) times and use those projections to compute the attention, concatenate them together and then again project the concatenated result. The figure below is from the paper that shows how MHA works.

The formula for computing MHA is as follows:

\[MultiHeadAttention(Q, K, V) = Concat(head_1, head_2, ... head_h)W^o \\ head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)\]

We’ll go over step by step to understand each of the concept. For this explanation again, I’ll ignore batch dimension and focus on one sequence only.

Let’s imagine we have a sequence with 3 tokens and each token has 4 dimensional embeddings. The authors also refer to embedding dimension as \(d_{model}\). I’ll just refer to it as embed_dim.

\[input = \begin{bmatrix} how & 1 & 10 & 100 & 1000\\ are & 2 & 20 & 200 & 2000\\ you & 3 & 30 & 300 & 4000 \end{bmatrix}_{3 \times 4}\]

And as mentioned above, we have \(Query = Key = Value = input\) in case of self-attention.

Step 1: Linearly project Query, Key and Value \(h\) times

As shown in the formula, we first need to calculate the output of each head. Let’s consider we have 2 heads (n_heads). Note that the embed_dim must be divisible by n_heads.

Each head will project the Query, Key and Value into embed_dim / n_heads i.e. 4/2 = 2 dimensions. I’ll refer to this as head_dim. This projection is done via a Linear layer where in_features = embed_dim, out_features=head_dim.

Let’s assume that after projection Head 1 and Head 2 produces the following. I’ve used the same value as original embeddings for the sake of explanation.

\[Q_1,K_1,V_1 = \begin{bmatrix} how & 1 & 10\\ are & 2 & 20\\ you & 3 & 30 \end{bmatrix}_{3 \times 2} Q_2,K_2,V_2 = \begin{bmatrix} how & 100 & 1000\\ are & 200 & 2000\\ you & 300 & 4000 \end{bmatrix}_{3 \times 2}\]

Step 2: Compute Attention for each head

Now we compute the attention for each of the heads using the respective Query, Key and Values.

\[head_1 = Attention(Q_1, K_1, V_1) \\ head_2 = Attention(Q_2, K_2, V_2)\]

We’ll have an output something like this. Again for the sake of explanation, let’s assume that 0.1 is added to each value by \(head_1\) and 0.2 by \(head_2\) when computing attention.

\[head_1 = \begin{bmatrix} how & 1.1 & 10.1\\ are & 2.1 & 20.1\\ you & 3.1 & 30.1 \end{bmatrix}_{3 \times 2} head_2 = \begin{bmatrix} how & 100.2 & 1000.2\\ are & 200.2 & 2000.2\\ you & 300.2 & 4000.2 \end{bmatrix}_{3 \times 2}\]

Step 3: Concatenate head outputs

As shown in the formulat, we need to concat the outputs of each head. Also note the shape after concatenation, which is same as the original embedding.

\[input = \begin{bmatrix} how & 1.1 & 10.1 & 100.2 & 1000.2\\ are & 2.1 & 20.1 & 200.2 & 2000.2\\ you & 3.1 & 30.1 & 300.2 & 4000.2 \end{bmatrix}_{3 \times 4}\]

Step 4: Final projection

We again project the concatenated output with a Linear layer. For this layer, the weights is of shape i.e. in_features = embed_dim, out_features=embed_dim because we want the output of MHA to have same embedding dimension as the input.

After the final projection, MHA is done!

Naive Implementation

Let’s implement MHA using the approach mentioned in the paper where there are \(h\) different heads and each head has its own Linear layers for projecting Query, Key and Value.

class AttentionBlock(torch.nn.Module):
    def __init__(self, input_dim: int, output_dim: int, bias=False):
        super().__init__()
        # Linear layers to project Query, Key and Value 
        self.W_q = torch.nn.Linear(input_dim, output_dim, bias=bias)
        self.W_k = torch.nn.Linear(input_dim, output_dim, bias=bias)
        self.W_v = torch.nn.Linear(input_dim, output_dim, bias=bias)

    def forward(self, query, key, value):
        # project Q, K, V
        q_logits = self.W_q(query)
        k_logits = self.W_k(key)
        v_logits = self.W_v(value)

        # apply scaled dot product attention on projected values
        attn, weights = my_scaled_dot_product_attention(q_logits, k_logits, v_logits)
        return attn, weights

class MyMultiheadAttention(torch.nn.Module):
    def __init__(self, embed_dim: int, n_heads: int, projection_bias=False):
        super().__init__()
        assert embed_dim % n_heads == 0, "embed_dim must be divisible by n_heads"
        self.embed_dim = embed_dim
        self.n_heads = n_heads
        head_embed_dim = self.embed_dim // n_heads
        # for each head, create an attention block
        self.head_blocks = torch.nn.ModuleList([AttentionBlock(input_dim=embed_dim, output_dim=head_embed_dim, bias=projection_bias) for i in range(self.n_heads)])
        # final projection of MHA
        self.projection = torch.nn.Linear(embed_dim, embed_dim, bias=projection_bias)


    def forward(self, query, key, value):
        # these lists are to store output of each head
        attns_list = []
        attn_weights_list = []

        # for every head pass the original query, key, value
        for head in self.head_blocks:
            attn, attn_weights = head(query, key, value)
            attns_list.append(attn)
            attn_weights_list.append(attn_weights)

        # concatenate attention outputs and take average of attention weights
        attns, attn_weights = torch.cat(attns_list, dim=2), torch.stack(attn_weights_list).mean(dim=0)
        # shape: (bs, seq_len, embed_dim), attn_weights: (bs, seq_len, seq_len)
        return self.projection(attns), attn_weights

In the code above we defined a class AttentionBlock which encapsulates the calcuations done by each head. Query, Key and Values are projected independently using 3 different linear layers and then scaled-dot product attention is calculated. Note that in the paper, when projecting they do not add bias but I’ve seen implementations that also add bias. That is why there is a parameter called projection_bias. If we set that to false then it is exactly the same as mentioned in the formula.

MyMultiheadAttention is the class that implements Multi-Head Attention. Here we make sure that embed_dim is divisible by n_heads and then we create AttentionBlock for each head. In the forward method, we loop through each head and then compute the attention. We save both the attention output and the weights in a list. We concatenate the attention outputs using torch.cat(attns_list, dim=2). Since we get multiple attention weights from each head, here I’ve just averaged the attention weights torch.stack(attn_weights_list).mean(dim=0).

Finally we project the attention outputs using self.projection(attns) and return.

This is all there is to it. We can implement this in a bit more efficient way by eliminating the loop over each head. But before we do that let’s use our implementation on a concrete task.

Usage: Text Classification

Let’s build a text classification model using our implementation of MHA and Pytorch’s implementation and compare the performance.

import datasets
from transformers import AutoTokenizer

original_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")


news_ds = datasets.load_dataset("SetFit/bbc-news", split="train")
# train a new tokenizer with limited vocab size for demo
tokenizer = original_tokenizer.train_new_from_iterator(news_ds['text'], vocab_size=1000)

To quickly get started, I’ve loaded a pre-trained tokenizer from HuggingFace hub and a dataset as well. This dataset contains news articles and there are 5 classes: tech, business, sports, entertainment, politics.

To keep things small, I created a new tokenizer with same config as original_tokenizer but with vocabulary size of just 1000. Original tokenizer has vocab size of 30,522 which results in large amount of data in Embedding layer. For this purpose vocab size of 1000 is just fine and we can train our models quickly in CPU as well.

Then we tokenize our dataset and split it into train and test set.

def tokenize(batch):
    return tokenizer(batch['text'], truncation=True)

ds = news_ds.map(tokenize, batched=True).select_columns(['label', 'input_ids', 'text']).train_test_split()

class_id_to_class = {
    0: "tech",
    1: "business",
    2: "sports",
    3: "entertainment",
    4: "politics",
}
num_classes = len(class_id_to_class)

Next, let’s create our text-classification model. The model needs few parameters like vocab_size, embed_dim, num_classes and mha. Since we’ll compare multiple implementations of MHA, we’ll accept this as a parameter when initializing. Note that I’ve implemented a very simple model here and the goal is not to get the best classifier but a working one to compare our MHA implementation.

class TextClassifier(torch.nn.Module):
    def __init__(self, vocab_size: int, embed_dim: int, num_classes: int, mha: torch.nn.Module):
        super().__init__()
        self.embedding = torch.nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim, padding_idx=0)
        self.mha = mha
        self.fc1 = torch.nn.Linear(in_features=embed_dim, out_features=128)
        self.relu = torch.nn.ReLU()
        self.final = torch.nn.Linear(in_features=128, out_features=num_classes)

    def forward(self, input_ids: torch.Tensor, **kwargs):
        # inputs: (bs, seq_len)
        # embeddings: (bs, seq_len, embed_dim)
        embeddings = self.get_embeddings(input_ids)
        attn, attn_weights = self.get_attention(embeddings, embeddings, embeddings)
        
        # take the first token's embeddings i.e. embeddings of CLS token
        # cls_token_embeddings: (bs, embed_dim)
        cls_token_embeddings = attn[:, 0, :] 
        return self.final(self.relu(self.fc1(cls_token_embeddings)))
    
    def get_embeddings(self, input_ids):
        return self.embedding(input_ids)
    
    def get_attention(self, query, key, value):
        attn, attn_weights = self.mha(query, key, value)
        return attn, attn_weights

n_heads = 8
embed_dim = 64
vocab_size = tokenizer.vocab_size
torch_mha = torch.nn.MultiheadAttention(embed_dim=embed_dim, num_heads=n_heads, batch_first=True)
my_mha = MyMultiheadAttention(embed_dim=embed_dim, n_heads=n_heads, projection_bias=True)
torch_classifier = TextClassifier(vocab_size=tokenizer.vocab_size, embed_dim=embed_dim, num_classes=num_classes, mha=torch_mha)
my_classifier = TextClassifier(vocab_size=tokenizer.vocab_size, embed_dim=embed_dim, num_classes=num_classes, mha=my_mha)

Here we have two different classifiers using Pytorch’s implementation vs the one we implemented. Both of them have 8 heads and the embed_dim is 64. If our implementation is correct then both of these models should have almost the same accuracy.

Next, we’ll create a train function with the following signature train(model: torch.nn.Module, train_dl, val_dl, epochs=10) -> list[tuple[float, float]]. This function will train the model and return a list of pairs of numbers indicating train loss and test loss for each epoch. Note that I was running this in CPU so the training loop code does not consider moving tensors/models to GPU.

Click to expand training loop code

Why do we need non-linear activation function in Neural Networks?

2024-09-04T14:22:00+00:00

Introduction

In Neural Networks, we use a non-linear activation function e.g. Sigmoid, TanH, ReLU etc. after layers like Linear/Dense or Conv2D etc. Consider a neural network with two hidden layers as shown below. The input is first passed through a Linear layer and then we apply an activation function ReLU which is then passed to the second hidden layer Linear2.

graph LR; Input --> Linear Linear --> ReLU ReLU --> Linear2 Linear2 --> Logits

But why do we need to do so?

Neural Networks are used to learn data where the relationship between the inputs and outputs are non-linear. I’ll make this a bit more concrete in the sections below. We’ll train a couple of neural networks in Pytorch with and without non-linear activation function and visualize the differences. Hopefully that will give you some idea about the need of non-linearity in neural networks.

Data Setup

To be a bit more concrete, let’s consider a problem of classifying data points into one of two classes. We’ll use scikit-learn to generate a toy dataset. Before we dive into the process, lets import few libraries

Click to expand code

Gradient Descent Algorithm From Scratch

2024-03-26T18:22:00+00:00

Introduction

In this post, we’ll explore Gradient Descent Algorithm and how it is used to train machine learning models. We will implement this algorithm from scratch for a simple linear regression model and compare our implementation against scikit-learn and pytorch implementation. Even though I’ve chosen linear regression model, the concept of gradient descent applies to any kind of model and another reason it to visualize gradient descent in action! Check the video below.

Gradient Descent Algorithm is an optimization algorithm. It tries to find the best values for the parameters of a machine learning model given the learning objectives. In this post, we’ll focus on its application in Machine Learning.

Definitions

Let’s first define few things before we proceed in simple terms. In the later sections, we’ll make these concepts concrete.

Model

First, we will need a model to make predictions. Machine learning model such as linear regression or a deep neural networks are some common examples.

Model parameters

Model parameters are values that a model uses to make predictions. These parameters are initialized randomly before training and the final values are learned during the training. For example, in a linear regression model with single input variable and single output variable, the parameters of this model are \(m\), which indicates slope, and \(c\) which indicates intercept. These parameters are used together to produce the output.

For neural networks, these parameters are also called weights and biases.

Loss function

We need some way to tell if a model is performing better. Also called cost function or objective function, it gives lower values when a model performs better.

Typical examples of loss functions used in practice are

Mean Squared Error: When the output is a numerical value, this loss function is common choice
Binary Cross Entropy: It is used when we want to do binary classification e.g. will it rain or not, is it a cat or a dog, has disease or no disease
Categorical Cross Entropy: It is used when we want to do multi-class classification e.g. predict the category of new articles from 10 possible categories.

Deep dive

Let’s explore this algorithm with concrete example to make it clear. First we’ll need a model to begin with. Will consider a simple linear regression model with one input and one output. In this case our model has two parameters \(m\), a slope, and \(c\), an intercept. These two parameters are used in the model in the following way: \(y = mx + c\)

Model

Linear regression is a method used to find a linear equation that best predicts the output using the input variables.

The equations for a case where there is only a single input variable is as follows

\[y = mx + c\]

Where,

\(y\) is the output
\(x\) is the input variable
\(m\) is the slope of the line
\(c\) is the intercept

\(m\) and \(c\) are parameters of the model. In this case, we can say that this model has 2 parameters. Compare this with ChatGPT, which is rumored to have around 175 billion parameters.

The implementation is quite simple for a linear regression model since in our case, we only have one input variable. This function accepts slope, intercept and the input data X as parameters and computes the prediction based on the forumla \(y = mx + c\).

def linear_regression(slope, intercept, X):
    return slope * X + intercept

Loss function

For our linear regression model which values for the parameters slope and intercept should we use? Is using 0.5 as slope better than 3.7?. This is where loss functions come in.

Loss function gives lower value when the predicted values closely match the true outputs.

Therefore, when the loss function’s value is minimized, it means that the model parameters, in this case slope and intercept, are tuned to ensure predictions from model closely match the true outputs.

Since our output is a numerical value, we will use Mean Squared Error as our loss function.

The implementation is fairly straight forward, we first compute the difference between true output and predicted output and then square it and then compute the mean of these differences. Note that y_true and y_pred are numpy arrays rather than a single numbers.

def mse(y_true, y_pred):
    return ((y_true - y_pred)**2).mean()

The animation below shows MSE in action. The table on the right shows the ground truth and the predicted values. Notice as the predicted values are closer to the ground truth, the loss value decreases.

Data exploration

To train this model, let’s create a toy dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y, true_slope = make_regression(
    n_samples=100,
    n_features=1,
    n_informative=1,
    n_targets=1,
    noise=5,
    coef=True,
    random_state=2,
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

When we plot the input in x-axis and the output in y-axis, we see that as the input value increases, the output also increases. If we look that the “true slope” value used by scikit-learn to generate this data, we see the value is 60.84, this means if the input is increased by 1, then the output will increase by 60.84.

Gradient Descent

Ok, so far we have the model and the loss function defined. But how do we find the best values for the model parameters so that the predictions are close to the true outputs.

We know that we need to minimize the loss. How do we do this?

First, we need to compute the gradient of the loss function with respect to each parameter in our model.

Gradient is basically a list of partial derivatives of loss function with respect to reach parameter. Gradient can be thought as a a vector indicating the direction of steepest ascent in the loss surface.

Since we want to go towards the direction where the loss is minimized, we will go in the opposite direction indicated the gradient.

Here we have two parameters, so we need to find two partial derivatives. If your calculus skills is rusty, you can use sympy library to calculate the derivaties for you as well. Note that when using libraries like Pytorch or Tensorflow, we do not need to calculate the derivatives ourselves. It is done by the library automatically!

import sympy
x, slope, intercept = sympy.symbols("x, slope, intercept")
y_true = sympy.symbols("y_true", constant=True)
y_pred = slope * x + intercept
loss = (y_true - y_pred)**2
display(loss.diff(slope).simplify())
display(loss.diff(intercept).simplify())

The partial derivative of the loss function with respect the the slope is \(-2x(y_{true} - y_{pred})\).

Similarly, the partial derivative of the loss function with respect to the intercept is \(-2(y_{true} - y_{pred})\).

Now that we know the gradient, we use the following rule to update the values of parameters so that we move in the direction where loss is minimized.

\[slope = slope - (lr * \frac{\partial L}{\partial slope})\] \[intercept = intercept - (lr * \frac{\partial L}{\partial intercept})\]

Here learning rate (lr) is hyper-parameter that we have to choose and is usually set between 0 and 1. The learning rate basically scales down the amount we move in the loss surface. Typical values are 0.001 and 0.0001.

We have all the basics needed for implementing gradient descent algorithm. Now let’s look at the code.

def sgd_step(slope, intercept, X, y, lr=0.1):
    y_pred = linear_regression(slope=slope, intercept=intercept, X=X)
    # compute the derivative of loss function wrt. slope
    dl_dslope = -(2 * (y - y_pred) * X).mean()
    # compute the derivative of loss function wrt. intercept
    dl_dintercept = -(2 * (y - y_pred)).mean()

    # update the parameters
    slope = slope - (lr * dl_dslope)
    intercept = intercept - (lr * dl_dintercept)
    return slope, intercept, dl_dslope, dl_dintercept

Here, I’ve defined a function called sgd_step which accepts slope, intercept which are the model paramters and the input X and output y. It also accepts learning rate as lr.

First we compute the predictions using current value of model parameter. Then we compute the partial derivative of loss with respect to slope parameter. Since we are doing this for a batch of data, we take the mean of all derivaties.

Similarly we compute the partial derivative of loss with respect to the intercept paramter.

Next, we update the model parameters using the update rule of Gradient Descent algorithm.

And finally we return the updated slope and intercept values. I’ve returned the partial derivaties for visualization purpose but it is not necessary.

The sgd_step function only updates the model paramters once. But we need to do this many times.

So, here is a complete training procedure.

import toolz

# initialize the parameters to some value
slope = -10
intercept = 9

epochs = 10
bs = 32

def train(slope, intercept, X, y, epochs=10, bs=32, lr=0.1):
    for epoch in range(epochs):
        # split the data into batches
        for batch_ids in toolz.partition_all(
            bs, np.random.permutation(np.arange(len(X)))
        ):
            batch_ids = np.array(batch_ids)
            batch_x = X[batch_ids]
            batch_y = y[batch_ids]
            slope, intercept, dl_slope, dl_intercept = sgd_step(
                slope=slope, intercept=intercept, X=batch_x, y=batch_y, lr=lr
            )

        # calculate the loss for this epoch
        # Note: typically losses are collected for each batch in the epoch and then average is taken as loss for the epoch
        loss = mse(y, linear_regression(slope=slope, intercept=intercept, X=X))
        print(f"Loss at epoch {epoch} = {loss}")

    return slope, intercept

# since X_train is a matrix with 1 column, we take all rows and first column as input vector X
slope, intercept = train(
    slope=slope, intercept=intercept, X=X_train[:, 0], y=y_train, lr=0.1
)
print(slope, intercept)
# (60.134316467596285, -0.00922299723642274)

First we randomly initialize our model parameters. Second we define the number of epochs to run. In each epoch, the model will see a complete data set. We need to run this many times so we set the epochs as 10 here.

Third, we set the batch size. For one update of model parameters, we will use that many samples in our dataset. This is the most common approach for training deep neural networks as not all dataset will fit in the memory. This version of Gradient Descent algorithm is also called Gradient Descent with mini-batch.

Then comes the actual training loop. For each epoch, we partition our data into batches. Then call the sgd_step function with this batch of data. We will replace values of slope and intercept with the values returned by the function so that next time it is called, it will use the updated values of these parameters.

If we visualize the loss value and parameter values over each epoch then we can see them converging as the epoch progresses.

Evaluation and comparison with sklearn and pytorch

Now let’s see how does this model work on our test set.

y_pred = linear_regression(slope=slope, intercept=intercept, X=X_test[:, 0])
print("MSE = ", mse(y_test, y_pred))
# MSE =  21.184015118341343

Let’s compare this with sklearn implementation

from sklearn.linear_model import SGDRegressor
lr = SGDRegressor().fit(X_train, y_train)
print(f"Slope = {lr.coef_[0]}, Intercept = {lr.intercept_[0]}")
print("MSE = ", mse(y_test, lr.predict(X_test)))

# Slope = 60.044860375107355, Intercept = -0.0518031540413587
# MSE =  21.320319924030493

Seems pretty close! The parameters found by sklearn and MSE on test data is almost the same as the values we found ourselves.

For one more comparison, let’s implement this using Pytorch and compare against it.

import torch
class TorchLR(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.slope = torch.nn.Parameter(data=torch.tensor(1.0, dtype=torch.float32), requires_grad=True)
        self.intercept = torch.nn.Parameter(data=torch.tensor(1.0, dtype=torch.float32), requires_grad=True)

    def forward(self, X):
        return X * self.slope + self.intercept
    
torch_lr = TorchLR()
optim = torch.optim.SGD(torch_lr.parameters(), lr=0.1)
loss_fn = torch.nn.MSELoss()
epochs = 10
for epoch in range(epochs):
    for batch_ids in toolz.partition_all(bs, np.random.permutation(np.arange(len(X)))):
        batch_ids = np.array(batch_ids)
        batch_x = torch.tensor(X[batch_ids])
        batch_y = torch.tensor(y[batch_ids])

        optim.zero_grad()
        
        loss_val = loss_fn(batch_y, torch_lr(batch_x[:, 0]))

        # these two steps automatically compute the gradients and perform the parameter updates!!
        # we do not need to calcuate the gradients and do the parameter updates ourselves!!
        # this logic is exactly the same even if our model had millions of parameters.
        loss_val.backward()
        optim.step()

    if epoch % 5 == 0 or epoch == epochs - 1:
        print(f"Epoch {epoch}, Loss = {loss_val}")

print()
print(f"Slope = {torch_lr.slope.data}, intercept = {torch_lr.intercept.data}")
print("MSE = ", mse(y_test, torch_lr(torch.tensor(X_test[:, 0])).detach().numpy()))

# Slope = 59.92798614501953, intercept = 0.7032995223999023
# MSE =  20.868084536438573

Once more, the parameters and the loss values are quite close.

Gradient Descent Visualization

Let’s visualize how the gradient descent algorithm works.

In the right hand side of the plot below, we can see the loss surface where the black color indicates higher loss values and white color indicates lower loss values. For each combination of Intercept and Slope, we can see different loss values.

In the Y-axis, we see the range of values for the intercept parameter which ranges from 10 to -10 in this case.

And similarly in the x-axis, we see the range of values for the slope parameter. Here we see the values range between -100 to 200.

To create this loss surface plot, for each pair of slope and intercept, we calculate the loss value and use contour plot visualize it.

To interpret this plot, lets take an example. When the slope is around 200 or -100, we have higher loss values compared to the cases when the slope is between 0 and 100.

We also know that when slope is at around 60 and intercept is at around 0, we have the lowest loss possible.

In the right hand side, the plot shows the ground truth in light blue color and the model prediction using the current value of slope and intercept parameter. As the parameters change, this line will also get updated.

Effect of Learning Rate

Now, let’s see how the learning rate affects the convergence. In this case, the learning rate is 0.1 and we will let it run for 10 epochs. We can see the algorithm makes small updates in the parameters and ultimately converges to the lowest loss.

When the learning rate is 0.6 (see below), we see it makes bigger updates.

When learning rate is 0.8 (see below), it makes even bigger updates to the parameters and even though it almost found the parameters with lowest loss at around epoch 7, it still kept making bigger changes and kept overshooting the place with lowest loss and did’t converge even up to 50 epochs.

Conclusion

In this post we implemented our own version of gradient descent algorithm for linear regression model. However, the same concept is true for even the most complex deep neural networks! The basic idea is for each parameter of our model, we need to compute the partial derivative of loss function with respect to that parameter and then use the update rule to update the parameter’s value.

With libraries like Pytorch, Tensorflow, JAX etc. we do not even have to compute the gradients since they are automatically calculated by the libraries. However, it is important to understand the idea behind the algorithm which I hope I have helped you understand gradient descent algorithm a bit more than you did before reading this post.

That is all I wanted to share. Please let me know if you find any mistakes in this post. Thanks for reading.

Training Named Entity Recognition model with custom data using Huggingface Transformer

2022-04-13T08:29:00+00:00

Introduction

The goal of Named entity recognition is to classify each token (word) in a sentence into certain class. The most common NER systems available freely in the Internet can identify PERSON, LOCATION, ORGANIZATION etc. There are several applications of NER and can be a part of your NLP pipeline for numerous tasks. For example

Identifying ingredients in a recipie to facilitate filtering of recipies by ingredients
Identifying name of people, location, email, bank accounts etc for data anonymization
Extracting address, contact details etc. from texts
Extracting product attributes from product descriptions

As an example, consider a product title “Technos 39 Inch Curved Smart LED TV E39DU2000 With Wallmount”. The possible entities in this sentence could be

entity	value
brand	Technos
display_size	39 Inch
display_type	LED

Since existing NER models and openly available datasets might not be suitable for your task, we need to create a dataset of our own. Compared to other problems such as classification, I find annotating data for NER to be quite daunting and usage of several GUI based annotation tools are necessary. In this post, I will show how we can create dataset for NER quite easily and train a model using Huggingface transformers library.

You will need to install the following libraries to follow along

pip install -q datasets transformers

Data preparation

To annotate data for NER, you need to specify to which class each word in the sentence belongs to. Existing datasets available on the Internet are in various formats such as CoNLL which I believe are not easy to digest for human beings. I find the format used by Rasa to be quite easy to create/read for humans.

If we consider the example sentence from above, then our annotated sentence becomes

Original: Technos 39 Inch Curved Smart LED TV E39DU2000 With Wallmount

Annotated: [Technos](brand) [39 Inch](display_size) Curved Smart [LED](display_type) TV E39DU2000 With Wallmount

Another example,

Original: I come from Kathmandu valley, Nepal

Annotated: I come from [Kathmandu valley,](location) [Nepal](location)

The format is simple, you put the entities inside square brackets and immediately after the square brackets you specify the name of the entity inside small brackets.

The code below will take an annotated text as input and returns a list of tuples where the first item is the value of the entity and the second item is the entity name. If a token as not been annotated, the the token will have class O to indicate it does not belong to any entity.

import re
def get_tokens_with_entities(raw_text: str):
    # split the text by spaces only if the space does not occur between square brackets
    # we do not want to split "multi-word" entity value yet
    raw_tokens = re.split(r"\s(?![^\[]*\])", raw_text)

    # a regex for matching the annotation according to our notation [entity_value](entity_name)
    entity_value_pattern = r"\[(?P.+?)\]\((?P.+?)\)"
    entity_value_pattern_compiled = re.compile(entity_value_pattern, flags=re.I|re.M)

    tokens_with_entities = []

    for raw_token in raw_tokens:
        match = entity_value_pattern_compiled.match(raw_token)
        if match:
            raw_entity_name, raw_entity_value = match.group("entity"), match.group("value")

            # we prefix the name of entity differently
            # B- indicates beginning of an entity
            # I- indicates the token is not a new entity itself but rather a part of existing one
            for i, raw_entity_token in enumerate(re.split("\s", raw_entity_value)):
                entity_prefix = "B" if i == 0 else "I"
                entity_name = f"{entity_prefix}-{raw_entity_name}"
                tokens_with_entities.append((raw_entity_token, entity_name))
        else:
            tokens_with_entities.append((raw_token, "O"))

    return tokens_with_entities

Let’s try some inputs

print(get_tokens_with_entities("I come from [Kathmandu valley,](location) [Nepal](location)"))
# [('I', 'O'), ('come', 'O'), ('from', 'O'), ('Kathmandu', 'B-location'), ('valley,', 'I-location'), ('Nepal', 'B-location')]

print(get_tokens_with_entities("[Technos](brand) [39 Inch](display_size) Curved Smart [LED](display_type) TV E39DU2000 With Wallmount"))
# [('Technos', 'B-brand'), ('39', 'B-display_size'), ('Inch', 'I-display_size'), ('Curved', 'O'), ('Smart', 'O'), ('LED', 'B-display_type'), ('TV', 'O'), ('E39DU2000', 'O'), ('With', 'O'), ('Wallmount', 'O')]

So far it looks good. We can have entity values that span multiple words and and we can have any kind of entity names.

But we still are not done yet. Transformer models typically use limited vocabulary size and therefore cannot know all the words in existence. So in case there are some words in our dataset which the model does not currently know about then that word is splitted into multiple “sub-words”. There are several tokenization scehems such as WordPiece, BytePairEncoding etc. used by different models. If a token from our annotation is splitted into multiple sub-words then our annotation becomes misaliged. We need to take care of this as well. Let me show you an example of what I mean.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# note that I purposefully misspell Kathmandu to Kathamanduu
sample_input = "I come from [Kathmanduu valley,](location) [Nepal](location)"
tokens, entities = list(zip(*get_tokens_with_entities(sample_input)))
tokenized_input = tokenizer(tokens, is_split_into_words=True)
print("Original tokens           : ", tokens)
print("After subword tokenization: ", tokenizer.convert_ids_to_tokens(tokenized_input['input_ids']))
# Original tokens           :  ('I', 'come', 'from', 'Kathmanduu', 'valley,', 'Nepal')
# After subword tokenization:  ['[CLS]', 'i', 'come', 'from', 'kathmandu', '##u', 'valley', ',', 'nepal', '[SEP]']

We can see from the output after tokenization, the number of tokens are different than our original list of tokens. Depending on the tokenizer model we use, it adds several “special tokens” at the beginning or at the end. Also note that the tokenizer model does not know about the word “kathamanduu”, so it splitted it into two tokens “kathmandu” and “##u”. We need to align the labels from the original token/label pairs to the “new tokens”. This is also explained here

To make things eaier, I created a class called NERDataMaker which takes care of all the stuff we mentioned above and returns a datasets.Dataset object which can be directly passed to huggingface’s Trainer class. You can find the implementation in this gist.

For this demo, I’ve created a small dataset to extract product attributes from product descriptions posted in e-commerce websites.

raw_text = """
[40"](display_size) [LED](display_type) TV
Specifications: [16″](display_size) HD READY [LED](display_type) TV.
[1 Year](warranty) Warranty
Rowa [29"](display_size) [LED](display_type) TV
Highlights:- 48"Full HD [LED](display_type) TV Triple Protection
[80cm](display_size) (32) HD Flat TV K4000 Series 4
[32"](display_size) LED, [2 yrs](warranty) full warranty, All care protection, Integrated Sound Station- Tweeter/20w, Family tv 2.0, Louvre Desing, Mega dynamic contract ratio, Hyper real engine, USB movie
CG 32D0003 [LED](display_type) TV
Screen Size : [43″](display_size)
Resolution : 1920*1080p
Response time : [8ms](response_time)
USB : Yes (Music+Photo+Movie)
Analog AV Out : Yes
Power Supply : 110~240V 50-60Hz
WEGA [32 Inch](display_size) SMART DLED TV HI Sound Double Glass - (Black)
Model: [32"](display_size) Smart DLED TV HI Sound
Hisense HX32N2176 [32"Inch](display_size) Full HD [Led](display_type) Tv
[32 Inch](display_size) [1366x768](display_resolution) pixels HD LED TV
[43 inch](display_size) [LED](display_type) TV
[2 Years](warranty) Warranty & 1 Year Service Warranty
[1920 X 1080](display_resolution) Full HD
[Technos](brand) [39 Inch](display_size) Curved Smart [LED](display_type) TV E39DU2000 With Wallmount
24″ Led Display Stylish Display Screen resolution : [1280 × 720](display_resolution) (HD Ready) USB : Yes VGS : Yes
Technos 24K5 [24 Inch](display_size) LED TV
Technos Led Tv [18.5″ Inch](display_size) (1868tw)
[18.5 inch](display_size) stylish LED dsiplay [1280 x 720p](display_resolution) HD display 2 acoustic speaker USB and HDMI port Technos brand
15.6 ” Led Display Display Screen resolution : 1280 720 (HD Ready) USB : Yes VGS : Yes HDMI : Yes Screen Technology : [led](display_type)
Model:CG55D1004U
Screen Size: [55"](display_size)
Resolution: [3840x2160p](display_resolution)
Power Supply: 100~240 V/AC
Sound Output (RMS): 8W + 8W
Warranty: [3 Years](warranty) wrranty
"""

dm = NERDataMaker(raw_text.split("\n"))
print(f"total examples = {len(dm)}")
print(dm[0:3])

# total examples = 35
# [{'id': 0, 'ner_tags': [0], 'tokens': ['']}, {'id': 1, 'ner_tags': [2, 3, 0], 'tokens': ['40"', 'LED', 'TV']}, {'id': 2, 'ner_tags': [0, 2, 0, 0, 3, 0], 'tokens': ['Specifications:', '16″', 'HD', 'READY', 'LED', 'TV.']}]

Now that we have our “data maker” ready, we can finally train the model.

Model training

For this demo, I’ll use distilbert-base-uncased model. The dm object contains few properties which we pass to the AutoModelForTokenClassification.from_pretrained method.

from transformers import AutoTokenizer, DataCollatorForTokenClassification, AutoModelForTokenClassification, TrainingArguments, Trainer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(dm.unique_entities), id2label=dm.id2label, label2id=dm.label2id)

Finally we can configure training arguments, create a datasets.Dataset object and a Trainer object to train the model. I am evaluating on training data just for the demo. Please create a proper dataset for evaluation

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=40,
    weight_decay=0.01,
)

train_ds = dm.as_hf_dataset(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=train_ds, # eval on training set! ONLY for DEMO!!
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

The “validation loss” decreased to around 0.03 after 40 epochs. Although the validation loss here is calculated on the training data itself so don’t consider this number to represent actual performance of the model on unseen data. I posted the number here just so that you can compare the results if you are following along.

To use the trained model for inference, we will use pipeline from the transformers library to easily get the predictions.

from transformers import pipeline
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # pass device=0 if using gpu
pipe("""2 year warrantee Samsung 40 inch LED TV, 1980 x 1080 resolution""")

[{'end': 6,
  'entity_group': 'warranty',
  'score': 0.53562486,
  'start': 0,
  'word': '2 year'},
 {'end': 32,
  'entity_group': 'display_size',
  'score': 0.92803776,
  'start': 25,
  'word': '40 inch'},
 {'end': 36,
  'entity_group': 'display_type',
  'score': 0.7992602,
  'start': 33,
  'word': 'led'},
 {'end': 52,
  'entity_group': 'display_resolution',
  'score': 0.7081752,
  'start': 41,
  'word': '1980 x 1080'}]

Even though I purposefully misspelled the word “warranty”, the model was still able to find out the warranty of this product is “2 year”. I think the results are promising and we can create robust NER models that can handle noisy data if trained with sufficiently large number of examples.

Conclusion

In this post we created a simple and easy way to annotate our data for NER and also solved the problem of label alignment due to sub-word tokenization scheme that many transformer models use. Finally we also trained the model using Trainer class and used pipeline to easily use the trained model for inference.

If you liked this post then please share it with others. If there are any errors please let me know.

Image search using Image to Image Similarity

2022-03-29T12:29:00+00:00

Introduction

We are all familiar with text search which returns document similar to our query. It is also possible to perform similar search but with images! In this post we will explore how we can implement an image search similar to Google’s reverse image search. There are several applications of image search. For example an e-commerce website could allow users to upload a picture of a shirt their friends are wearing and using image search, it can find similar shirts from its catalog. It can also be used to find visually similar images for recommendation engine or to find duplicates.

Note: I’ve used Jupyer notebook to run the code in this post, so you might find some Jupyter notebook specific commands here and there. Full source code is available here.

Implementation

The basic approach for any neural network based search application is as follows:

Indexing existing images in catalog

graph LR; input((Images))-->Vectorizer-->|vectors|db[(VectorsDB)];

We need to “index” our images into a vector database. I’m using the term database loosely here. It can be a in-memory numpy array or other applications like OpenSearch, Milvus, FAISS etc. that support saving vectors and performing Nearest Neighbors search.

For every image, we need to extract feature vector using some model. Deep neural networks are a good choice to extract these features. For this demo, I’ll use inception_resnet_v2 model from Tensoflow Hub as a feature extractor/vectorizer.

Query time

graph LR; input((Image))-->Vectorizer-->Vector Vector-->KNNSearch db[(VectorsDB)]---KNNSearch KNNSearch-->output[Similar Images]

During query time, we have an input image. We again use the same vectorizer to extract feature vector and perform Nearest Neighbor search in the VectorDB. For this demo, the VectorDB is just an in-memory numpy array and KNNSearch is an instance of sklearn.neighbors.NearestNeighbors. For production use case, generally OpenSearch or FAISS can act as VectorDB as well as perform KNN search.

First let’s load all the required libraries.

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds
import tensorflow_hub as hub
import functools

print(tf.__version__) # 2.4.1
print(tfds.__version__) # 4.5.2
print(hub.__version__) # 0.10.0

Dataset

We’ll use “imagenette” dataset. It is a subset of 10 easily classified classes from Imagenet dataset. It was prepared by Jeremy Howard and its homepage can be found here. I chose this dataset because the pre-trained models that we find in the Internet is generally trained on ImageNet dataset and such models can extract meaningful feature vectors out of these images. If you load another dataset for e.g. images of chest x-rays or images of clothing items, then the model will not produce meaningful vectors as it has never seen those kind of images.

The code below loads “imagenette” using Tensorflow Datasets library. All we’ve done is resize the image into desired size and normalize the pixel values to be between 0 and 1.

ds = tfds.load("imagenette")
def extract_image(example):
    image = example['image']
    return image

def preprocess_image(image, height, width):
    image = tf.image.resize_with_crop_or_pad(image, target_height=height, target_width=width)
    image = tf.cast(image, tf.float32) / 255.0
    return image


def get_image_batches(batch_size=128, height=256, width=256):
    partial_preprocess_image = functools.partial(preprocess_image, height=height, width=width)
    train_ds = ds['train']
    train_ds = ( train_ds.map(extract_image)
                .map(partial_preprocess_image)
                .cache()
                .shuffle(buffer_size=1000)
                .batch(batch_size)
                .prefetch(tf.data.AUTOTUNE)
                )
    return train_ds


BATCH_SIZE = 32
IMG_WIDTH = IMG_HEIGHT = 256
train_ds = get_image_batches(batch_size=BATCH_SIZE, height=IMG_HEIGHT, width=IMG_WIDTH) 

Tensorflow Datasets is a powerful library with lot of features and can handle huge amount of datasets that do not fit in the memory. However, for the purposes of this demo, let’s load about 640 images into memory.

images = np.array([img for batch in train_ds.take(20) for img in batch]) # take 20 batches 
print(images.shape) # (640, 256, 256, 3)

Vectorizing images

Now that we have the images, we need to extract feature vectors. We’ll load a model that was trained on ImageNet dataset as our vectorizer. We also let the model know what image size to expect when “predicting”.

vectorizer = tf.keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/imagenet/inception_resnet_v2/feature_vector/5", trainable=False)
])
vectorizer.build([None, IMG_HEIGHT, IMG_WIDTH, 3])

The code above will download the model from Tensorflow Hub if it is not already downloaded and load it in memory. Now extracting vectors is as simple as calling predict method of the model.

features = vectorizer.predict(images, batch_size=BATCH_SIZE)
print(features.shape) # (640, 1536)

From the output, we see that for each image, we have a feature vector of size 1536.

Finding similar Images

Now comes the fun part - performing image search! As explained earlier, we’ll use sklearn library to create a NearestNeighbors model and use it to find similar images.

from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=5, metric="cosine")
knn.fit(features)

That’s it! We can now use knn object to find nearest neighbors of any given input feature vector. The following code shows how an input image can be used to find similar images and plot it for visualization.

image = images[10] # take an existing image or create a numpy array from PIL image
image = np.expand_dims(image, 0) # add a batch dimension
feature = vectorizer.predict(image)

distances, nbors = knn.kneighbors(feature)
# output is a tuple of list of distances and list nbors of each image
# so we take the first entry from those lists since we have only one image
distances, nbors = distances[0], nbors[0]

nbor_images = [images[i] for i in nbors]
fig, axes = plt.subplots(1, len(nbors)+1, figsize=(10, 5))

for i in range(len(nbor_images)+1):
    ax = axes[i]
    ax.set_axis_off()
    if i == 0:
        ax.imshow(image.squeeze(0))
        ax.set_title("Input")
    else:
        ax.imshow(nbor_images[i-1])
        # we get cosine distance, to convert to similarity we do 1 - cosine_distance
        ax.set_title(f"Sim: {1 - distances[i-1]:.2f}")

In the figure above, the first column is input image and the remaining images are the results from KNN search. The first similar image (2nd column) is exactly same as the input because we used the same image as input. This also serves as a sanity check. Looking at the results, it looks suprisingly good. All the images are of a petrol (gas) station.

However, we should keep in mind that the model was trained on ImageNet data and for this demo we used Imagenette dataset, which is a subset of ImageNet dataset. Also, Imagenette contains images from classes which are easily classified. For e.g. there are images of dogs, golf balls, people holding fish, fuel station, garbage trucks, houses etc. These images are visually distinct from each other and is relatively easy for a model.

This is not always the case in real world though. For example images of monitors and televisions look pretty much identical. In this case, the model should somehow be trained to see the difference between a tv and monitor and I doubt the pre-trained model would be able to perform well on images from different domain without finetuning.

Here are more input images and similar looking images

To explore more, I also created a small Jupyter widget. You can use the controls shown in the screen to play around.

def show_similar_images(start_image_idx, n_inputs=5, n_neighbors=10):
    input_images = images[start_image_idx:start_image_idx+n_inputs]
    features = vectorizer.predict(input_images)
    knn_output = knn.kneighbors(features, n_neighbors=n_neighbors)
    
    images_with_distances_and_nbors = zip(input_images, *knn_output)
    
    fig, axes = plt.subplots(len(input_images), n_neighbors+1, figsize=(20, len(input_images)*3))
    
    for i, (image, distances, nbors) in enumerate(images_with_distances_and_nbors):
        for j in range(n_neighbors+1):
            ax = axes[i, j]
            img = image if j==0 else images[nbors[j-1]]
            if j == 0:
                ax.set_title("Input Image")
            else:
                ax.set_title(f"Sim: {1-distances[j-1]:.2f}")
            ax.set_axis_off()
            ax.imshow(img)

    fig.savefig("02-image-search-grid.png")

import ipywidgets as w
w.interact(show_similar_images, 
    start_image_idx=w.IntSlider(max=len(images)-1, continuous_update=False),
    n_inputs=w.IntSlider(min=2, value=5, max=10, continuous_update=False),
    n_neighbors=w.IntSlider(min=2, value=5, max=10, continuous_update=False),
)

Conclusion

In this post we saw how we can implement a simple image search. We use a pre-trained model to generate vectors out of the images so this will not necessarily work for images from all domains. There is still a lot to do to put this in a production setup. In future posts we will explore how we can use OpenSearch (ElasticSearch) to store the vectors and do KNN search and also fine tune a pre-trained model to our domain.