Transformer Sentiment Analysis With PyTorch
Transformer-Based Sentiment Analysis Implementation
Introduction
This guide provides a complete implementation of a binary sentiment classifier using a Transformer architecture in PyTorch. The model classifies sentences as either positive (1) or negative (0) sentiment.
- Custom Transformer encoder architecture with 4 layers and 4 attention heads
- Dropout regularization (rate=0.2) to prevent overfitting
- Early stopping during training with patience of 10 epochs
- L2 regularization (weight decay=1e-5)
- Vocabulary-based text preprocessing with padding
- 80/20 train/test split with random state for reproducibility
- Understand the Transformer architecture for NLP tasks
- Preprocess text data for Transformer models
- Implement a classification head on top of a Transformer encoder
- Apply regularization techniques in PyTorch
- Evaluate model performance on sentiment analysis
Data Definition
The dataset consists of 25 positive and 25 negative example sentences for binary sentiment classification.
positive_sentences = [
"The weather today is absolutely perfect.",
"I love spending time with my friends.",
"She always makes me laugh and feel happy.",
# ... (additional positive sentences)
]
negative_sentences = [
"I feel like I'm running out of time.",
"The workload is overwhelming right now.",
# ... (additional negative sentences)
]
- Balance: Equal number of positive and negative examples (25 each)
- Length: Sentences vary from 5 to 15 words
- Content: Everyday language expressing clear sentiment
Data Preprocessing
Text preprocessing is essential to prepare raw text for machine learning models. Here we implement several key transformations:
def preprocess(text):
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Convert to lowercase
text = text.lower()
return text
- Punctuation Removal: Uses string translation to strip all punctuation
- Lowercasing: Converts all text to lowercase for consistency
def create_vocab(data):
# Tokenize the sentences
tokens = [sentence.split() for sentence in data]
# Flatten the list of tokens
tokens = [item for sublist in tokens for item in sublist]
# Create a Counter object to count word frequencies
counter = Counter(tokens)
# Create vocabulary dictionary with word to index mapping
vocab = {word: idx+1 for idx, (word, _) in enumerate(counter.most_common())}
vocab['<PAD>'] = 0 # Add padding token
return vocab
max_length = 15 # Maximum sequence length
- Tokenization: Splits sentences into individual words
- Frequency Counting: Uses Counter to track word occurrences
- Index Assignment: Maps words to numerical indices (1-based)
- Special Tokens: Adds
<PAD>
token at index 0 for padding - Sequence Length: Sets maximum length to 15 tokens (longer sequences truncated)
def sentence_to_tensor(sentence, vocab, max_length=15):
# Tokenize the sentence
tokens = sentence.split()
# Convert words to vocabulary indices
indices = [vocab.get(token, 0) for token in tokens] # 0 for unknown tokens
# Pad or truncate the sequence
if len(indices) < max_length:
indices += [0] * (max_length - len(indices)) # Pad with 0
else:
indices = indices[:max_length] # Truncate to max_length
return torch.tensor(indices)
- Tokenization: Splits sentence into words
- Index Lookup: Converts words to their vocabulary indices
- Unknown Words: Maps out-of-vocabulary words to 0 (padding index)
- Padding/Truncation: Ensures all sequences are exactly max_length
- Tensor Creation: Converts final sequence to PyTorch tensor
- In production, you might want to set a minimum word frequency threshold
- Consider using subword tokenization for better handling of rare words
- For larger datasets, pre-trained embeddings can boost performance
Model Architecture
The model uses a Transformer encoder followed by a classification head. Here's the detailed implementation:
Model Architecture Diagram
[Input] → [Embedding Layer] → [Transformer Encoder] → [Flatten] → [FC Layer] → [Output]
class TransformerClassifier(nn.Module):
def __init__(self, vocab_size, embedding_dim, num_heads, num_layers,
hidden_dim, num_classes, dropout_rate=0.2):
super(TransformerClassifier, self).__init__()
# Embedding layer converts token indices to vectors
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# Transformer encoder layers
encoder_layer = nn.TransformerEncoderLayer(
d_model=embedding_dim, # Dimension of embeddings
nhead=num_heads, # Number of attention heads
dim_feedforward=hidden_dim, # Dimension of feedforward network
batch_first=True, # Input shape: (batch, seq, feature)
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
# Regularization
self.dropout = nn.Dropout(dropout_rate)
# Classification head
self.fc = nn.Linear(embedding_dim * max_length, hidden_dim)
self.out = nn.Linear(hidden_dim, num_classes)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
# Embed the input tokens
embedded = self.embedding(x)
# Process through transformer
transformer_out = self.transformer(embedded)
# Flatten for classification
flattened = transformer_out.view(transformer_out.size(0), -1)
# Fully connected layers with dropout
hidden = torch.relu(self.fc(flattened))
hidden = self.dropout(hidden)
# Final output
output = self.out(hidden)
return self.sigmoid(output)
- Embedding Layer: Maps token indices to dense vectors (32-dim)
- Transformer Encoder:
- 4 layers with 4 attention heads each
- Hidden dimension of 64 units
- Batch-first processing
- Regularization: Dropout with rate 0.2
- Classification Head:
- Flattened transformer outputs
- One hidden layer with ReLU activation
- Sigmoid output for binary classification
# Initialize the model with specified parameters
model = TransformerClassifier(
vocab_size=len(vocab), # Size of vocabulary
embedding_dim=32, # Dimension of token embeddings
num_heads=4, # Number of attention heads
num_layers=4, # Number of transformer layers
hidden_dim=64, # Dimension of hidden layers
num_classes=1, # Binary classification
dropout_rate=0.2 # Dropout probability
)
Parameter | Value | Rationale |
---|---|---|
embedding_dim | 32 | Balance between capacity and computational cost |
num_heads | 4 | Standard for small models, divides embedding_dim evenly |
num_layers | 4 | Deep enough to learn complex patterns |
hidden_dim | 64 | Sufficient capacity for this task |
dropout_rate | 0.2 | Moderate regularization |
Training Process
The training process includes loss function definition, optimizer setup, and the training loop with early stopping.
# Binary Cross Entropy loss for classification
criterion = nn.BCELoss()
# Adam optimizer with weight decay (L2 regularization)
optimizer = optim.Adam(
model.parameters(),
lr=0.0005, # Learning rate
weight_decay=1e-5 # L2 regularization strength
)
# Early stopping parameters
patience = 10 # Number of epochs to wait before stopping
best_loss = float('inf') # Track best validation loss
patience_counter = 0 # Count epochs without improvement
- Loss Function: BCELoss is appropriate for binary classification with sigmoid output
- Optimizer: Adam with:
- Low learning rate (0.0005) for stable training
- Weight decay (1e-5) for L2 regularization
- Early Stopping: Prevents overfitting by monitoring validation loss
number_epochs = 200 # Maximum training epochs
for epoch in range(number_epochs):
# Training phase
model.train()
optimizer.zero_grad()
# Forward pass
outputs = model(X_train.long()).squeeze()
loss = criterion(outputs, y_train.float())
# Backward pass and optimize
loss.backward()
optimizer.step()
# Validation phase
model.eval()
with torch.no_grad():
val_outputs = model(X_test.long()).squeeze()
val_loss = criterion(val_outputs, y_test.float())
# Early stopping check
if val_loss < best_loss:
best_loss = val_loss
patience_counter = 0
else:
patience_counter += 1
# Stop if no improvement for 'patience' epochs
if patience_counter >= patience:
print(f"Early stopping at epoch {epoch+1}")
break
# Print progress
print(f"Epoch {epoch+1}/{number_epochs} - "
f"Train Loss: {loss.item():.4f} - "
f"Val Loss: {val_loss.item():.4f}")
- Training Mode: Set model to train mode (enables dropout)
- Forward Pass: Compute predictions and loss
- Backward Pass: Compute gradients and update weights
- Validation: Evaluate on test set without gradient computation
- Early Stopping: Monitor validation loss for improvement
- Monitor both training and validation loss to detect overfitting
- Consider learning rate scheduling if loss plateaus
- For larger models, gradient clipping can help stability
- Experiment with different batch sizes if memory permits
Model Evaluation
After training, we evaluate the model's performance on the test set.
# Set model to evaluation mode
model.eval()
# Disable gradient calculation for inference
with torch.no_grad():
# Get model predictions
test_outputs = model(X_test.long()).squeeze()
# Convert probabilities to binary predictions (threshold 0.5)
predicted_labels = (test_outputs > 0.5).float()
# Calculate accuracy
accuracy = accuracy_score(y_test.numpy(), predicted_labels.numpy())
# Print results
print(f"Test Accuracy: {accuracy:.4f}")
- Accuracy: Percentage of correct predictions
- Threshold: 0.5 for binary classification
- Evaluation Mode: Disables dropout and batch normalization
- Precision, Recall, F1-score
- Confusion matrix
- ROC curve and AUC
Additional Resources
Further Learning
- Original Transformer Paper (Vaswani et al.)
- PyTorch Transformer Documentation
- The Annotated Transformer
Improvement Ideas
- Use pre-trained word embeddings (GloVe, Word2Vec)
- Implement learning rate scheduling
- Add gradient clipping
- Experiment with different attention mechanisms
Complete Implementation
For reference, here's the complete code with all imports:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import string
from collections import Counter
# [All previous code sections combined...]
# Final evaluation
model.eval()
with torch.no_grad():
output_test = model(X_test.long()).squeeze()
predicted_labels = (output_test > 0.5).float()
accuracy = accuracy_score(y_test.numpy(), predicted_labels.numpy())
print(f"Test Accuracy: {accuracy:.4f}")
Comments