Transformer Sentiment Analysis With PyTorch

Transformer Sentiment Analysis | PyTorch Implementation Guide

Transformer-Based Sentiment Analysis Implementation

Introduction

This guide provides a complete implementation of a binary sentiment classifier using a Transformer architecture in PyTorch. The model classifies sentences as either positive (1) or negative (0) sentiment.

Key Features of This Implementation:
  • Custom Transformer encoder architecture with 4 layers and 4 attention heads
  • Dropout regularization (rate=0.2) to prevent overfitting
  • Early stopping during training with patience of 10 epochs
  • L2 regularization (weight decay=1e-5)
  • Vocabulary-based text preprocessing with padding
  • 80/20 train/test split with random state for reproducibility
Learning Objectives: After studying this implementation, you should be able to:
  • Understand the Transformer architecture for NLP tasks
  • Preprocess text data for Transformer models
  • Implement a classification head on top of a Transformer encoder
  • Apply regularization techniques in PyTorch
  • Evaluate model performance on sentiment analysis

Data Definition

The dataset consists of 25 positive and 25 negative example sentences for binary sentiment classification.

Positive and Negative Sentences
positive_sentences = [
    "The weather today is absolutely perfect.",
    "I love spending time with my friends.",
    "She always makes me laugh and feel happy.",
    # ... (additional positive sentences)
]

negative_sentences = [
    "I feel like I'm running out of time.",
    "The workload is overwhelming right now.",
    # ... (additional negative sentences)
]
Dataset Characteristics:
  • Balance: Equal number of positive and negative examples (25 each)
  • Length: Sentences vary from 5 to 15 words
  • Content: Everyday language expressing clear sentiment
Note on Dataset Size: This is a small dataset designed for demonstration purposes. For production use, you would typically want thousands or millions of examples. The small size helps illustrate the training process quickly but may lead to overfitting.

Data Preprocessing

Text preprocessing is essential to prepare raw text for machine learning models. Here we implement several key transformations:

Text Cleaning Function
def preprocess(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Convert to lowercase
    text = text.lower()
    return text
Preprocessing Steps:
  • Punctuation Removal: Uses string translation to strip all punctuation
  • Lowercasing: Converts all text to lowercase for consistency
Vocabulary Creation
def create_vocab(data):
    # Tokenize the sentences
    tokens = [sentence.split() for sentence in data]
    # Flatten the list of tokens
    tokens = [item for sublist in tokens for item in sublist]
    # Create a Counter object to count word frequencies
    counter = Counter(tokens)
    # Create vocabulary dictionary with word to index mapping
    vocab = {word: idx+1 for idx, (word, _) in enumerate(counter.most_common())}
    vocab['<PAD>'] = 0  # Add padding token
    return vocab

max_length = 15  # Maximum sequence length
Vocabulary Construction:
  • Tokenization: Splits sentences into individual words
  • Frequency Counting: Uses Counter to track word occurrences
  • Index Assignment: Maps words to numerical indices (1-based)
  • Special Tokens: Adds <PAD> token at index 0 for padding
  • Sequence Length: Sets maximum length to 15 tokens (longer sequences truncated)
Text to Tensor Conversion
def sentence_to_tensor(sentence, vocab, max_length=15):
    # Tokenize the sentence
    tokens = sentence.split()
    # Convert words to vocabulary indices
    indices = [vocab.get(token, 0) for token in tokens]  # 0 for unknown tokens
    # Pad or truncate the sequence
    if len(indices) < max_length:
        indices += [0] * (max_length - len(indices))  # Pad with 0
    else:
        indices = indices[:max_length]  # Truncate to max_length
    return torch.tensor(indices)
Tensor Conversion Process:
  • Tokenization: Splits sentence into words
  • Index Lookup: Converts words to their vocabulary indices
  • Unknown Words: Maps out-of-vocabulary words to 0 (padding index)
  • Padding/Truncation: Ensures all sequences are exactly max_length
  • Tensor Creation: Converts final sequence to PyTorch tensor
Vocabulary Tips:
  • In production, you might want to set a minimum word frequency threshold
  • Consider using subword tokenization for better handling of rare words
  • For larger datasets, pre-trained embeddings can boost performance

Model Architecture

The model uses a Transformer encoder followed by a classification head. Here's the detailed implementation:

Model Architecture Diagram

[Input] → [Embedding Layer] → [Transformer Encoder] → [Flatten] → [FC Layer] → [Output]

Transformer Classifier Implementation
class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_heads, num_layers, 
                 hidden_dim, num_classes, dropout_rate=0.2):
        super(TransformerClassifier, self).__init__()
        
        # Embedding layer converts token indices to vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # Transformer encoder layers
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embedding_dim,       # Dimension of embeddings
            nhead=num_heads,            # Number of attention heads
            dim_feedforward=hidden_dim, # Dimension of feedforward network
            batch_first=True,           # Input shape: (batch, seq, feature)
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        # Regularization
        self.dropout = nn.Dropout(dropout_rate)
        
        # Classification head
        self.fc = nn.Linear(embedding_dim * max_length, hidden_dim)
        self.out = nn.Linear(hidden_dim, num_classes)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        # Embed the input tokens
        embedded = self.embedding(x)
        
        # Process through transformer
        transformer_out = self.transformer(embedded)
        
        # Flatten for classification
        flattened = transformer_out.view(transformer_out.size(0), -1)
        
        # Fully connected layers with dropout
        hidden = torch.relu(self.fc(flattened))
        hidden = self.dropout(hidden)
        
        # Final output
        output = self.out(hidden)
        return self.sigmoid(output)
Architecture Components:
  1. Embedding Layer: Maps token indices to dense vectors (32-dim)
  2. Transformer Encoder:
    • 4 layers with 4 attention heads each
    • Hidden dimension of 64 units
    • Batch-first processing
  3. Regularization: Dropout with rate 0.2
  4. Classification Head:
    • Flattened transformer outputs
    • One hidden layer with ReLU activation
    • Sigmoid output for binary classification
Model Initialization
# Initialize the model with specified parameters
model = TransformerClassifier(
    vocab_size=len(vocab),   # Size of vocabulary
    embedding_dim=32,        # Dimension of token embeddings
    num_heads=4,            # Number of attention heads
    num_layers=4,           # Number of transformer layers
    hidden_dim=64,          # Dimension of hidden layers
    num_classes=1,          # Binary classification
    dropout_rate=0.2        # Dropout probability
)
Hyperparameter Choices:
ParameterValueRationale
embedding_dim32Balance between capacity and computational cost
num_heads4Standard for small models, divides embedding_dim evenly
num_layers4Deep enough to learn complex patterns
hidden_dim64Sufficient capacity for this task
dropout_rate0.2Moderate regularization

Training Process

The training process includes loss function definition, optimizer setup, and the training loop with early stopping.

Training Configuration
# Binary Cross Entropy loss for classification
criterion = nn.BCELoss()

# Adam optimizer with weight decay (L2 regularization)
optimizer = optim.Adam(
    model.parameters(),
    lr=0.0005,          # Learning rate
    weight_decay=1e-5    # L2 regularization strength
)

# Early stopping parameters
patience = 10           # Number of epochs to wait before stopping
best_loss = float('inf') # Track best validation loss
patience_counter = 0    # Count epochs without improvement
Training Setup:
  • Loss Function: BCELoss is appropriate for binary classification with sigmoid output
  • Optimizer: Adam with:
    • Low learning rate (0.0005) for stable training
    • Weight decay (1e-5) for L2 regularization
  • Early Stopping: Prevents overfitting by monitoring validation loss
Training Loop
number_epochs = 200  # Maximum training epochs

for epoch in range(number_epochs):
    # Training phase
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    outputs = model(X_train.long()).squeeze()
    loss = criterion(outputs, y_train.float())
    
    # Backward pass and optimize
    loss.backward()
    optimizer.step()
    
    # Validation phase
    model.eval()
    with torch.no_grad():
        val_outputs = model(X_test.long()).squeeze()
        val_loss = criterion(val_outputs, y_test.float())
    
    # Early stopping check
    if val_loss < best_loss:
        best_loss = val_loss
        patience_counter = 0
    else:
        patience_counter += 1
    
    # Stop if no improvement for 'patience' epochs
    if patience_counter >= patience:
        print(f"Early stopping at epoch {epoch+1}")
        break
    
    # Print progress
    print(f"Epoch {epoch+1}/{number_epochs} - "
          f"Train Loss: {loss.item():.4f} - "
          f"Val Loss: {val_loss.item():.4f}")
Training Dynamics:
  1. Training Mode: Set model to train mode (enables dropout)
  2. Forward Pass: Compute predictions and loss
  3. Backward Pass: Compute gradients and update weights
  4. Validation: Evaluate on test set without gradient computation
  5. Early Stopping: Monitor validation loss for improvement
Training Tips:
  • Monitor both training and validation loss to detect overfitting
  • Consider learning rate scheduling if loss plateaus
  • For larger models, gradient clipping can help stability
  • Experiment with different batch sizes if memory permits

Model Evaluation

After training, we evaluate the model's performance on the test set.

Evaluation Code
# Set model to evaluation mode
model.eval()

# Disable gradient calculation for inference
with torch.no_grad():
    # Get model predictions
    test_outputs = model(X_test.long()).squeeze()
    
    # Convert probabilities to binary predictions (threshold 0.5)
    predicted_labels = (test_outputs > 0.5).float()
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test.numpy(), predicted_labels.numpy())
    
    # Print results
    print(f"Test Accuracy: {accuracy:.4f}")
Evaluation Metrics:
  • Accuracy: Percentage of correct predictions
  • Threshold: 0.5 for binary classification
  • Evaluation Mode: Disables dropout and batch normalization
Potential Additional Metrics:
  • Precision, Recall, F1-score
  • Confusion matrix
  • ROC curve and AUC

Additional Resources

Further Learning

Improvement Ideas

  • Use pre-trained word embeddings (GloVe, Word2Vec)
  • Implement learning rate scheduling
  • Add gradient clipping
  • Experiment with different attention mechanisms

Complete Implementation

For reference, here's the complete code with all imports:

Full Implementation
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import string
from collections import Counter

# [All previous code sections combined...]

# Final evaluation
model.eval()
with torch.no_grad():
    output_test = model(X_test.long()).squeeze()
    predicted_labels = (output_test > 0.5).float()
    accuracy = accuracy_score(y_test.numpy(), predicted_labels.numpy())
    print(f"Test Accuracy: {accuracy:.4f}")

Transformer Sentiment Analysis Implementation Guide | © 2023 | Educational Purpose

Note: This implementation is designed for educational purposes and demonstrates core concepts.

Comments

Popular posts from this blog

Tech Duos For Web Development

CIFAR-10 Dataset Classification Using Convolutional Neural Networks (CNNs) With PyTorch

Long-short-term-memory (LSTM) Word Prediction With PyTorch