Neural Network Architectures Overview
Neural Network Architectures Overview
1. Artificial Neural Network (ANN)
The fundamental building block of deep learning, consisting of interconnected nodes organized in layers.
Architecture
- Input Layer: Receives the raw input data
- Hidden Layers: 1 or more layers that transform inputs through weights and activation functions
- Output Layer: Produces the final prediction or classification
Key Equations
output = activation(Wx + b)
where:
W = weight matrix
x = input vector
b = bias vector
activation = nonlinear function (ReLU, sigmoid, tanh)
Advantages
- Universal function approximator
- Simple to implement
- Good for structured data
Limitations
- Poor performance with unstructured data (images, text)
- No spatial or temporal awareness
- Can be computationally expensive for large inputs
Resources: PyTorch ANN Tutorial | TensorFlow ANN Tutorial
2. Convolutional Neural Network (CNN)
Specialized architecture for processing grid-like data such as images, with built-in translation invariance.
Core Components
- Convolutional Layers: Apply filters to extract features (edges, textures, patterns)
- Pooling Layers: Reduce spatial dimensions (max pooling, average pooling)
- Fully Connected Layers: Final classification/regression layers
- Common Architectures: LeNet-5, AlexNet, VGG, ResNet, EfficientNet
Key Operations
# 2D convolution operation
output[b, i, j, :] = sum_{di, dj, k} (
input[b, strides[1] * i + di, strides[2] * j + dj, k] *
filter[di, dj, k, :]
) + bias[:]
Advantages
- Parameter sharing reduces number of parameters
- Automatic feature extraction
- Excellent for image/video processing
Limitations
- Computationally intensive
- Requires large datasets for training
- Not ideal for sequential data
Resources: PyTorch CNN Tutorial | Stanford CNN Course
3. Recurrent Neural Network (RNN)
Designed for sequential data by maintaining a hidden state that captures information about previous elements in the sequence.
Architecture Variants
- Vanilla RNN: Basic recurrent unit with simple hidden state update
- Bidirectional RNN: Processes sequence both forward and backward
- Deep RNN: Multiple stacked recurrent layers
Key Equations
h_t = activation(W_xh * x_t + W_hh * h_{t-1} + b_h)
y_t = W_hy * h_t + b_y
where:
h_t = hidden state at time t
x_t = input at time t
y_t = output at time t
W_* = weight matrices
b_* = bias vectors
Advantages
- Can process variable-length sequences
- Shares parameters across time steps
- Maintains memory of previous inputs
Limitations
- Suffers from vanishing/exploding gradients
- Difficulty learning long-range dependencies
- Computationally sequential (hard to parallelize)
4. Long Short-Term Memory (LSTM)
Advanced RNN variant designed to overcome the vanishing gradient problem through gated mechanisms.
Core Components
- Forget Gate: Decides what information to discard from cell state
- Input Gate: Updates the cell state with new information
- Output Gate: Determines what to output based on cell state
- Cell State: The "memory" that carries information through the sequence
Key Equations
f_t = σ(W_f · [h_{t-1}, x_t] + b_f) # Forget gate
i_t = σ(W_i · [h_{t-1}, x_t] + b_i) # Input gate
o_t = σ(W_o · [h_{t-1}, x_t] + b_o) # Output gate
C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C) # Candidate cell state
C_t = f_t * C_{t-1} + i_t * C̃_t # New cell state
h_t = o_t * tanh(C_t) # New hidden state
Note: GRU (Gated Recurrent Unit) is a simplified variant of LSTM that combines the forget and input gates into a single "update gate" and merges the cell state and hidden state.
Resources: PyTorch LSTM Tutorial | Understanding LSTMs
5. Generative Adversarial Network (GAN)
A framework for training generative models through an adversarial process involving two competing networks.
Architecture Components
- Generator (G): Creates synthetic data from random noise
- Discriminator (D): Distinguishes between real and generated data
- Training Process: Minimax game where G tries to fool D while D tries to correctly classify
Objective Function
min_G max_D V(D,G) = E_{x~p_data(x)}[log D(x)] + E_{z~p_z(z)}[log(1 - D(G(z))]
where:
x = real data sample
z = random noise vector
G(z) = generated sample
D(·) = discriminator's probability estimate
GAN Variants
Type | Description | Applications |
---|---|---|
DCGAN | Deep Convolutional GAN with architectural constraints | Image generation |
CycleGAN | Uses cycle consistency for unpaired image-to-image translation | Style transfer, photo enhancement |
StyleGAN | Controls fine details through style-based generation | High-quality face generation |
WGAN | Uses Wasserstein distance for more stable training | General purpose generation |
Resources: PyTorch DCGAN Tutorial | Original GAN Paper
6. Radial Basis Function Network (RBFN)
A feedforward network with radial basis activation functions in the hidden layer, particularly effective for function approximation and classification.
Architecture
- Input Layer: Receives feature vector
- Hidden Layer: Uses radial basis functions (typically Gaussian) centered at specific points
- Output Layer: Linear combination of hidden layer outputs
Key Equations
ϕ(x) = exp(-β||x - c||²) # Gaussian RBF
where:
c = center vector
β = spread parameter
Output: y = Σ w_i * ϕ_i(x) + b
Advantages
- Fast training (often single pass)
- Good interpolation capabilities
- Can approximate any continuous function
Limitations
- Curse of dimensionality with high-dim inputs
- Requires careful selection of centers
- Not as flexible as MLPs for complex patterns
Resources: Wikipedia RBFN | RBFN Overview
7. Reinforcement Learning (RL)
A learning paradigm where an agent learns to make decisions by interacting with an environment to maximize cumulative reward.
Key Components
- Agent: The learner/decision maker
- Environment: The world with which the agent interacts
- State (s): Current situation of the agent
- Action (a): Decision made by the agent
- Reward (r): Feedback from the environment
- Policy (π): Strategy that the agent employs
Approaches
Method | Description | Example Algorithms |
---|---|---|
Value-Based | Learn value function to derive policy | Q-Learning, DQN |
Policy-Based | Directly optimize the policy | REINFORCE, PPO |
Model-Based | Learn model of environment dynamics | Dyna, MBMF |
Actor-Critic | Combine value and policy approaches | A3C, SAC |
Resources: PyTorch RL | OpenAI Spinning Up
8. Autoencoders
Unsupervised neural networks that learn efficient data encodings by training to reconstruct their inputs.
Architecture
- Encoder: Maps input to latent space representation (bottleneck)
- Decoder: Reconstructs input from latent representation
- Objective: Minimize reconstruction error (e.g., MSE, cross-entropy)
Variants
- Undercomplete: Bottleneck layer has lower dimension than input
- Sparse: Adds sparsity constraint to activations
- Denoising: Trained to reconstruct clean inputs from corrupted versions
- Variational (VAE): Learns probabilistic latent space
- Contractive: Adds penalty on encoder's Jacobian
Advantages
- Unsupervised feature learning
- Dimensionality reduction
- Anomaly detection
- Data denoising
Limitations
- May learn trivial identity function
- Quality depends on architecture choices
- VAEs can produce blurry reconstructions
Resources: TensorFlow Autoencoder | From Autoencoder to Beta-VAE
9. Transfer Learning
Technique where knowledge gained from solving one problem is applied to a different but related problem.
Approaches
- Feature Extraction: Use pre-trained model as fixed feature extractor
- Fine-Tuning: Unfreeze some layers and continue training
- Domain Adaptation: Adapt model to new domain with different data distribution
- Multi-Task Learning: Train on multiple related tasks simultaneously
Common Pre-trained Models
Domain | Models | Datasets |
---|---|---|
Computer Vision | ResNet, EfficientNet, VGG, MobileNet | ImageNet, COCO |
NLP | BERT, GPT, RoBERTa, T5 | Wikipedia, BookCorpus, Common Crawl |
Speech | Wav2Vec2, HuBERT | LibriSpeech, Common Voice |
Tip: When using transfer learning, start with small learning rates (typically 10x smaller than normal) as the pre-trained weights are usually already quite good.
Resources: PyTorch Transfer Learning | Keras Transfer Learning
10. Transformers
Attention-based architecture that has revolutionized natural language processing and beyond.
Key Components
- Self-Attention: Computes weighted sum of all input elements
- Multi-Head Attention: Multiple attention heads learn different patterns
- Positional Encoding: Injects information about token positions
- Feedforward Layers: Applied position-wise
- Layer Normalization: Stabilizes training
Attention Mechanism
Attention(Q, K, V) = softmax(QK^T/√d_k)V
where:
Q = Query matrix
K = Key matrix
V = Value matrix
d_k = dimension of keys
Transformer Variants
- Encoder-Only: BERT, RoBERTa (good for classification)
- Decoder-Only: GPT family (good for generation)
- Encoder-Decoder: T5, BART (good for translation)
- Vision Transformers: ViT, DeiT (apply to images)
Resources: PyTorch Transformer | Illustrated Transformer
11. Graph Neural Networks (GNN)
Specialized neural networks designed to operate on graph-structured data.
Core Concepts
- Node Embedding: Representation of each node in latent space
- Message Passing: Nodes aggregate information from neighbors
- Graph-Level Tasks: Predict properties of entire graphs
- Node-Level Tasks: Predict properties of individual nodes
- Edge-Level Tasks: Predict properties of edges
Popular Variants
Type | Description | Use Cases |
---|---|---|
GCN | Graph Convolutional Network | Node classification |
GAT | Graph Attention Network | Tasks requiring edge importance |
GraphSAGE | General inductive framework | Large-scale graphs |
GIN | Graph Isomorphism Network | Graph classification |
Resources: PyTorch Geometric | GNN Introduction
12. Capsule Networks
Alternative to CNNs that aim to better model hierarchical relationships in data.
Key Features
- Capsules: Groups of neurons that represent specific entities
- Dynamic Routing: Agreement mechanism between capsules
- Pose Matrices: Capture spatial relationships
- Equivariance: Preserves spatial information
Advantages
- Better at recognizing spatial hierarchies
- More robust to affine transformations
- Requires less training data
Limitations
- Computationally expensive
- Not as widely adopted as CNNs
- Limited large-scale implementations
13. Spiking Neural Networks (SNN)
Biologically inspired models that more closely mimic actual neural activity.
Key Characteristics
- Temporal Coding: Information encoded in spike timing
- Event-Driven: Only active when spikes occur
- Leaky Integrate-and-Fire: Common neuron model
- STDP: Spike-timing-dependent plasticity learning rule
Applications
- Neuromorphic computing
- Low-power edge devices
- Brain-computer interfaces
- Computational neuroscience
Resources: snnTorch | SNN Review
14. Neural Ordinary Differential Equations (Neural ODEs)
Framework that models continuous-depth neural networks using differential equations.
Key Ideas
- Treat neural network as continuous transformation
- Use ODE solver for forward pass
- Adjoint method for efficient backpropagation
- Memory efficiency compared to very deep discrete networks
Applications
- Time series modeling
- Density estimation
- Continuous normalizing flows
- Irregularly sampled data
Resources: Neural ODEs Paper | torchdiffeq
15. Mixture of Experts (MoE)
Sparse model architecture where different parts of the network (experts) specialize in different inputs.
Key Components
- Experts: Specialized sub-networks
- Gating Network: Routes inputs to relevant experts
- Sparsity: Only a few experts active per input
- Capacity: Can scale to enormous sizes efficiently
Applications
- Large language models (e.g., Google's Switch Transformer)
- Multi-task learning
- Domain-specific specialization
Resources: Switch Transformers | MoE Implementation
Comments