In this tutorial, I will show you how to compare two texts for similarity using Cosine Similarity between transformer embeddings. By leveraging the power of Hugging Face Transformers and PyTorch, we will be able to quickly and accurately determine the degree of similarity between two pieces of text.
Prerequisites: Python, PyTorch, Transformers
Step 1:
The first step is to import the BertTokenizer (or any tokenizer from transformers) class from the transformers library and create a tokenizer by calling the from_pretrained() method with the ‘bert-base-uncased’ pre-trained model. You may use any transformer model instead of this and use it with zero or few changes. Here, The tokenizer is used to encode the text inputs in a format that can be used as input to the BERT model. If you are using a different tokenizer, make sure the model you are importing correspond to it.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
Step 2:
In this step, the two input texts, text1 and text2, are tokenized using the tokenizer instance created in Step 1. The padding and truncation arguments are set to True and we are going to return pytorch tensors. We use padding and truncation to ensure that the input tensors have the same shape and can be passed to the BERT model.
# Example sentences
text1 = "John likes to eat apples and oranges for breakfast."
text2 = "John likes to eat pancakes for breakfast."
input1 = tokenizer(text1, padding=True, truncation=True, return_tensors='pt')
input2 = tokenizer(text2, padding=True, truncation=True, return_tensors='pt')
Step 3:
Next, we import the BertModel class from the transformers library (you can import it first, I am importing it here just to ensure better understanding) and create an instance of it by calling the from_pretrained() method with the bert-base-uncased pre-trained model as an argument. If you are using a different model, load the respective model that will accept the tokens made with the tokenizer as well. The model is used to encode the input tokens and obtain their embeddings.
from transformers import BertModel
model = BertModel.from_pretrained('bert-base-uncased')
Step 4:
In this step, we pass the tokenized inputs created in Step 2 to the BERT model with the model we have loaded in the previous tep. The with torch.no_grad() context ensures that the computations performed do not have any gradients, as we are not training the model. The outputs of the model are then used to obtain the mean of the last hidden state, which represents the embedding of the entire input sequence.
with torch.no_grad():
out1 = model(**input1)
out2 = model(**input2)
output_embedding1 = out1.last_hidden_state.mean(dim=1)
output_embedding2 = out2.last_hidden_state.mean(dim=1)
Step 5:
Finally, we import the nn.CosineSimilarity class from PyTorch and create an instance of it. We then pass the embeddings obtained in Step 4 to the cosine_similarity() method of the nn.CosineSimilarity instance to obtain the cosine similarity between the two input texts, which represents their similarity. Please note that you might need to adjust your dim argument based on the model/embedding. For example for BERT model the dim will be 1.
import torch.nn as nn
cosine_similarity = nn.CosineSimilarity(dim=1)
text_similarity = cosine_similarity(output_embedding1, output_embedding2)
For the given example, text_similarity will output tensor([0.8963]) which tells us that the given sentences are 89.63% similar.
When using the cosine similarity function from the torch.nn library, a similarity score closer to one indicates higher similarity between the two texts, with a score of one indicating that the texts are identical. However, if a different cosine similarity function such as the one from scipy is used, a similarity score of zero indicates identical texts, and the closer the score is to zero, the higher the similarity between the two texts.
And we are done! We have used Hugging Face Transformers and PyTorch to find the similarity between two sentences.