Transformer Sentiment Analysis | PyTorch Implementation Guide

Transformer-Based Sentiment Analysis Implementation

Introduction

This guide provides a complete implementation of a binary sentiment classifier using a Transformer architecture in PyTorch. The model classifies sentences as either positive (1) or negative (0) sentiment.

Key Features of This Implementation:

Custom Transformer encoder architecture with 4 layers and 4 attention heads
Dropout regularization (rate=0.2) to prevent overfitting
Early stopping during training with patience of 10 epochs
L2 regularization (weight decay=1e-5)
Vocabulary-based text preprocessing with padding
80/20 train/test split with random state for reproducibility

Learning Objectives: After studying this implementation, you should be able to:

Understand the Transformer architecture for NLP tasks
Preprocess text data for Transformer models
Implement a classification head on top of a Transformer encoder
Apply regularization techniques in PyTorch
Evaluate model performance on sentiment analysis

Data Definition

The dataset consists of 25 positive and 25 negative example sentences for binary sentiment classification.

Positive and Negative Sentences

positive_sentences = [
    "The weather today is absolutely perfect.",
    "I love spending time with my friends.",
    "She always makes me laugh and feel happy.",
    # ... (additional positive sentences)
]

negative_sentences = [
    "I feel like I'm running out of time.",
    "The workload is overwhelming right now.",
    # ... (additional negative sentences)
]

Dataset Characteristics:

Balance: Equal number of positive and negative examples (25 each)
Length: Sentences vary from 5 to 15 words
Content: Everyday language expressing clear sentiment

Note on Dataset Size: This is a small dataset designed for demonstration purposes. For production use, you would typically want thousands or millions of examples. The small size helps illustrate the training process quickly but may lead to overfitting.

Data Preprocessing

Text preprocessing is essential to prepare raw text for machine learning models. Here we implement several key transformations:

Text Cleaning Function

def preprocess(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Convert to lowercase
    text = text.lower()
    return text

Preprocessing Steps:

Punctuation Removal: Uses string translation to strip all punctuation
Lowercasing: Converts all text to lowercase for consistency

Vocabulary Creation

def create_vocab(data):
    # Tokenize the sentences
    tokens = [sentence.split() for sentence in data]
    # Flatten the list of tokens
    tokens = [item for sublist in tokens for item in sublist]
    # Create a Counter object to count word frequencies
    counter = Counter(tokens)
    # Create vocabulary dictionary with word to index mapping
    vocab = {word: idx+1 for idx, (word, _) in enumerate(counter.most_common())}
    vocab['<PAD>'] = 0  # Add padding token
    return vocab

max_length = 15  # Maximum sequence length

Vocabulary Construction:

Tokenization: Splits sentences into individual words
Frequency Counting: Uses Counter to track word occurrences
Index Assignment: Maps words to numerical indices (1-based)
Special Tokens: Adds <PAD> token at index 0 for padding
Sequence Length: Sets maximum length to 15 tokens (longer sequences truncated)

Text to Tensor Conversion

def sentence_to_tensor(sentence, vocab, max_length=15):
    # Tokenize the sentence
    tokens = sentence.split()
    # Convert words to vocabulary indices
    indices = [vocab.get(token, 0) for token in tokens]  # 0 for unknown tokens
    # Pad or truncate the sequence
    if len(indices) < max_length:
        indices += [0] * (max_length - len(indices))  # Pad with 0
    else:
        indices = indices[:max_length]  # Truncate to max_length
    return torch.tensor(indices)

Tensor Conversion Process:

Tokenization: Splits sentence into words
Index Lookup: Converts words to their vocabulary indices
Unknown Words: Maps out-of-vocabulary words to 0 (padding index)
Padding/Truncation: Ensures all sequences are exactly max_length
Tensor Creation: Converts final sequence to PyTorch tensor

Vocabulary Tips:

In production, you might want to set a minimum word frequency threshold
Consider using subword tokenization for better handling of rare words
For larger datasets, pre-trained embeddings can boost performance

Model Architecture

The model uses a Transformer encoder followed by a classification head. Here's the detailed implementation:

Model Architecture Diagram

[Input] → [Embedding Layer] → [Transformer Encoder] → [Flatten] → [FC Layer] → [Output]

Transformer Classifier Implementation

class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_heads, num_layers, 
                 hidden_dim, num_classes, dropout_rate=0.2):
        super(TransformerClassifier, self).__init__()
        
        # Embedding layer converts token indices to vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # Transformer encoder layers
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embedding_dim,       # Dimension of embeddings
            nhead=num_heads,            # Number of attention heads
            dim_feedforward=hidden_dim, # Dimension of feedforward network
            batch_first=True,           # Input shape: (batch, seq, feature)
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        # Regularization
        self.dropout = nn.Dropout(dropout_rate)
        
        # Classification head
        self.fc = nn.Linear(embedding_dim * max_length, hidden_dim)
        self.out = nn.Linear(hidden_dim, num_classes)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        # Embed the input tokens
        embedded = self.embedding(x)
        
        # Process through transformer
        transformer_out = self.transformer(embedded)
        
        # Flatten for classification
        flattened = transformer_out.view(transformer_out.size(0), -1)
        
        # Fully connected layers with dropout
        hidden = torch.relu(self.fc(flattened))
        hidden = self.dropout(hidden)
        
        # Final output
        output = self.out(hidden)
        return self.sigmoid(output)

Architecture Components:

Embedding Layer: Maps token indices to dense vectors (32-dim)
Transformer Encoder:
- 4 layers with 4 attention heads each
- Hidden dimension of 64 units
- Batch-first processing
Regularization: Dropout with rate 0.2
Classification Head:
- Flattened transformer outputs
- One hidden layer with ReLU activation
- Sigmoid output for binary classification

Model Initialization

# Initialize the model with specified parameters
model = TransformerClassifier(
    vocab_size=len(vocab),   # Size of vocabulary
    embedding_dim=32,        # Dimension of token embeddings
    num_heads=4,            # Number of attention heads
    num_layers=4,           # Number of transformer layers
    hidden_dim=64,          # Dimension of hidden layers
    num_classes=1,          # Binary classification
    dropout_rate=0.2        # Dropout probability
)

Hyperparameter Choices:

Parameter	Value	Rationale
embedding_dim	32	Balance between capacity and computational cost
num_heads	4	Standard for small models, divides embedding_dim evenly
num_layers	4	Deep enough to learn complex patterns
hidden_dim	64	Sufficient capacity for this task
dropout_rate	0.2	Moderate regularization

Training Process

The training process includes loss function definition, optimizer setup, and the training loop with early stopping.

Training Configuration

# Binary Cross Entropy loss for classification
criterion = nn.BCELoss()

# Adam optimizer with weight decay (L2 regularization)
optimizer = optim.Adam(
    model.parameters(),
    lr=0.0005,          # Learning rate
    weight_decay=1e-5    # L2 regularization strength
)

# Early stopping parameters
patience = 10           # Number of epochs to wait before stopping
best_loss = float('inf') # Track best validation loss
patience_counter = 0    # Count epochs without improvement

Training Setup:

Loss Function: BCELoss is appropriate for binary classification with sigmoid output
Optimizer: Adam with:
- Low learning rate (0.0005) for stable training
- Weight decay (1e-5) for L2 regularization
Early Stopping: Prevents overfitting by monitoring validation loss

Training Loop

number_epochs = 200  # Maximum training epochs

for epoch in range(number_epochs):
    # Training phase
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    outputs = model(X_train.long()).squeeze()
    loss = criterion(outputs, y_train.float())
    
    # Backward pass and optimize
    loss.backward()
    optimizer.step()
    
    # Validation phase
    model.eval()
    with torch.no_grad():
        val_outputs = model(X_test.long()).squeeze()
        val_loss = criterion(val_outputs, y_test.float())
    
    # Early stopping check
    if val_loss < best_loss:
        best_loss = val_loss
        patience_counter = 0
    else:
        patience_counter += 1
    
    # Stop if no improvement for 'patience' epochs
    if patience_counter >= patience:
        print(f"Early stopping at epoch {epoch+1}")
        break
    
    # Print progress
    print(f"Epoch {epoch+1}/{number_epochs} - "
          f"Train Loss: {loss.item():.4f} - "
          f"Val Loss: {val_loss.item():.4f}")

Training Dynamics:

Training Mode: Set model to train mode (enables dropout)
Forward Pass: Compute predictions and loss
Backward Pass: Compute gradients and update weights
Validation: Evaluate on test set without gradient computation
Early Stopping: Monitor validation loss for improvement

Training Tips:

Monitor both training and validation loss to detect overfitting
Consider learning rate scheduling if loss plateaus
For larger models, gradient clipping can help stability
Experiment with different batch sizes if memory permits

Model Evaluation

After training, we evaluate the model's performance on the test set.

Evaluation Code

# Set model to evaluation mode
model.eval()

# Disable gradient calculation for inference
with torch.no_grad():
    # Get model predictions
    test_outputs = model(X_test.long()).squeeze()
    
    # Convert probabilities to binary predictions (threshold 0.5)
    predicted_labels = (test_outputs > 0.5).float()
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test.numpy(), predicted_labels.numpy())
    
    # Print results
    print(f"Test Accuracy: {accuracy:.4f}")

Evaluation Metrics:

Accuracy: Percentage of correct predictions
Threshold: 0.5 for binary classification
Evaluation Mode: Disables dropout and batch normalization

Potential Additional Metrics:

Precision, Recall, F1-score
Confusion matrix
ROC curve and AUC

Additional Resources

Further Learning

Improvement Ideas

Use pre-trained word embeddings (GloVe, Word2Vec)
Implement learning rate scheduling
Add gradient clipping
Experiment with different attention mechanisms

Complete Implementation

For reference, here's the complete code with all imports:

Full Implementation

import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import string
from collections import Counter

# [All previous code sections combined...]

# Final evaluation
model.eval()
with torch.no_grad():
    output_test = model(X_test.long()).squeeze()
    predicted_labels = (output_test > 0.5).float()
    accuracy = accuracy_score(y_test.numpy(), predicted_labels.numpy())
    print(f"Test Accuracy: {accuracy:.4f}")

code exercises

Transformer Sentiment Analysis With PyTorch