Sentiment Analysis

2022-10-01 332 words 2 minutes views

Contents

This project is all about sentiment analysis as the title suggests. We are going to do so by using a state of the art model, called BERT (developed by Google).

We’ll start by installing amazing library, primarily applied in Natural Language Processing (NLP).

  ! pip install transformers

Importing some other packages and libraries:

  import pandas as pd
  import numpy as np
  import requests
  from bs4 import BeautifulSoup
  import re
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
  import torch

For our task we’ll use bert-base-multilingual-uncased-sentiment.

As stated in the official HuggingFace website:

This a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5).

  model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
  tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

All our tokenizer will do is to convert a sentence to a sequence of numbers.

The function bellow is our main function in this code.


def sentiment_to_number(review):
    tokens = tokenizer.encode(review, return_tensors='pt')
    result = model(tokens)
    return int(torch.argmax(result.logits))+1

How it works:

We will understand how it works with the help of an example:

Step 1:

The tokenizer.encode returns a torch.Tensor object containing numbers that represents the given string.

Step 2:

Calling the model with the token from Step 1 will result in a SequenceClassifierOutput type object. The important part here is the tensor containing numbers which represents the probability of a particular class being the sentiment that corresponds the most to the review given.

Step 3:

torch.argmax will return the index of the biggest probability [0,4], than adding a 1 to it will give us a score between 1 to 5.

Scraping the data from ‘Rotten Tomatos’:

r = requests.get('https://www.rottentomatoes.com/m/the_lion_king_2019/reviews')
soup = BeautifulSoup(r.text, 'html.parser')
regex = re.compile('the_review')
results = soup.find_all('div', {'class':regex})
reviews = [result.text.strip() for result in results]

Creating pandas dataframe:

df = pd.DataFrame(np.array(reviews), columns=['review'])

Adding a score column for each review:

df['sentiment'] = df['review'].apply(lambda x: sentiment_score(x))

Another small trick we can use, adding background-color for the ‘sentiment’ column:

df.style.background_gradient(cmap='RdYlGn').set_properties(subset=['review'], **{'width': '300'})