Neural Network Architectures Overview

1. Artificial Neural Network (ANN)

The fundamental building block of deep learning, consisting of interconnected nodes organized in layers.

Architecture

Input Layer: Receives the raw input data
Hidden Layers: 1 or more layers that transform inputs through weights and activation functions
Output Layer: Produces the final prediction or classification

Key Equations

output = activation(Wx + b)
where:
W = weight matrix
x = input vector
b = bias vector
activation = nonlinear function (ReLU, sigmoid, tanh)

Advantages

Universal function approximator
Simple to implement
Good for structured data

Limitations

Poor performance with unstructured data (images, text)
No spatial or temporal awareness
Can be computationally expensive for large inputs

Resources: PyTorch ANN Tutorial | TensorFlow ANN Tutorial

2. Convolutional Neural Network (CNN)

Specialized architecture for processing grid-like data such as images, with built-in translation invariance.

Core Components

Convolutional Layers: Apply filters to extract features (edges, textures, patterns)
Pooling Layers: Reduce spatial dimensions (max pooling, average pooling)
Fully Connected Layers: Final classification/regression layers
Common Architectures: LeNet-5, AlexNet, VGG, ResNet, EfficientNet

Key Operations

# 2D convolution operation
output[b, i, j, :] = sum_{di, dj, k} (
    input[b, strides[1] * i + di, strides[2] * j + dj, k] *
    filter[di, dj, k, :]
) + bias[:]

Advantages

Parameter sharing reduces number of parameters
Automatic feature extraction
Excellent for image/video processing

Limitations

Computationally intensive
Requires large datasets for training
Not ideal for sequential data

Resources: PyTorch CNN Tutorial | Stanford CNN Course

3. Recurrent Neural Network (RNN)

Designed for sequential data by maintaining a hidden state that captures information about previous elements in the sequence.

Architecture Variants

Vanilla RNN: Basic recurrent unit with simple hidden state update
Bidirectional RNN: Processes sequence both forward and backward
Deep RNN: Multiple stacked recurrent layers

Key Equations

h_t = activation(W_xh * x_t + W_hh * h_{t-1} + b_h)
y_t = W_hy * h_t + b_y

where:
h_t = hidden state at time t
x_t = input at time t
y_t = output at time t
W_* = weight matrices
b_* = bias vectors

Advantages

Can process variable-length sequences
Shares parameters across time steps
Maintains memory of previous inputs

Limitations

Suffers from vanishing/exploding gradients
Difficulty learning long-range dependencies
Computationally sequential (hard to parallelize)

Resources: PyTorch RNN Tutorial | The Unreasonable Effectiveness of RNNs

4. Long Short-Term Memory (LSTM)

Advanced RNN variant designed to overcome the vanishing gradient problem through gated mechanisms.

Core Components

Forget Gate: Decides what information to discard from cell state
Input Gate: Updates the cell state with new information
Output Gate: Determines what to output based on cell state
Cell State: The "memory" that carries information through the sequence

Key Equations

f_t = σ(W_f · [h_{t-1}, x_t] + b_f)  # Forget gate
i_t = σ(W_i · [h_{t-1}, x_t] + b_i)  # Input gate
o_t = σ(W_o · [h_{t-1}, x_t] + b_o)  # Output gate
C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C)  # Candidate cell state
C_t = f_t * C_{t-1} + i_t * C̃_t      # New cell state
h_t = o_t * tanh(C_t)                # New hidden state

Note: GRU (Gated Recurrent Unit) is a simplified variant of LSTM that combines the forget and input gates into a single "update gate" and merges the cell state and hidden state.

Resources: PyTorch LSTM Tutorial | Understanding LSTMs

5. Generative Adversarial Network (GAN)

A framework for training generative models through an adversarial process involving two competing networks.

Architecture Components

Generator (G): Creates synthetic data from random noise
Discriminator (D): Distinguishes between real and generated data
Training Process: Minimax game where G tries to fool D while D tries to correctly classify

Objective Function

min_G max_D V(D,G) = E_{x~p_data(x)}[log D(x)] + E_{z~p_z(z)}[log(1 - D(G(z))]

where:
x = real data sample
z = random noise vector
G(z) = generated sample
D(·) = discriminator's probability estimate

GAN Variants

Type	Description	Applications
DCGAN	Deep Convolutional GAN with architectural constraints	Image generation
CycleGAN	Uses cycle consistency for unpaired image-to-image translation	Style transfer, photo enhancement
StyleGAN	Controls fine details through style-based generation	High-quality face generation
WGAN	Uses Wasserstein distance for more stable training	General purpose generation

Resources: PyTorch DCGAN Tutorial | Original GAN Paper

6. Radial Basis Function Network (RBFN)

A feedforward network with radial basis activation functions in the hidden layer, particularly effective for function approximation and classification.

Architecture

Input Layer: Receives feature vector
Hidden Layer: Uses radial basis functions (typically Gaussian) centered at specific points
Output Layer: Linear combination of hidden layer outputs

Key Equations

ϕ(x) = exp(-β||x - c||²)  # Gaussian RBF
where:
c = center vector
β = spread parameter

Output: y = Σ w_i * ϕ_i(x) + b

Advantages

Fast training (often single pass)
Good interpolation capabilities
Can approximate any continuous function

Limitations

Curse of dimensionality with high-dim inputs
Requires careful selection of centers
Not as flexible as MLPs for complex patterns

Resources: Wikipedia RBFN | RBFN Overview

7. Reinforcement Learning (RL)

A learning paradigm where an agent learns to make decisions by interacting with an environment to maximize cumulative reward.

Key Components

Agent: The learner/decision maker
Environment: The world with which the agent interacts
State (s): Current situation of the agent
Action (a): Decision made by the agent
Reward (r): Feedback from the environment
Policy (π): Strategy that the agent employs

Approaches

Method	Description	Example Algorithms
Value-Based	Learn value function to derive policy	Q-Learning, DQN
Policy-Based	Directly optimize the policy	REINFORCE, PPO
Model-Based	Learn model of environment dynamics	Dyna, MBMF
Actor-Critic	Combine value and policy approaches	A3C, SAC

Resources: PyTorch RL | OpenAI Spinning Up

8. Autoencoders

Unsupervised neural networks that learn efficient data encodings by training to reconstruct their inputs.

Architecture

Encoder: Maps input to latent space representation (bottleneck)
Decoder: Reconstructs input from latent representation
Objective: Minimize reconstruction error (e.g., MSE, cross-entropy)

Variants

Undercomplete: Bottleneck layer has lower dimension than input
Sparse: Adds sparsity constraint to activations
Denoising: Trained to reconstruct clean inputs from corrupted versions
Variational (VAE): Learns probabilistic latent space
Contractive: Adds penalty on encoder's Jacobian

Advantages

Unsupervised feature learning
Dimensionality reduction
Anomaly detection
Data denoising

Limitations

May learn trivial identity function
Quality depends on architecture choices
VAEs can produce blurry reconstructions

Resources: TensorFlow Autoencoder | From Autoencoder to Beta-VAE

9. Transfer Learning

Technique where knowledge gained from solving one problem is applied to a different but related problem.

Approaches

Feature Extraction: Use pre-trained model as fixed feature extractor
Fine-Tuning: Unfreeze some layers and continue training
Domain Adaptation: Adapt model to new domain with different data distribution
Multi-Task Learning: Train on multiple related tasks simultaneously

Common Pre-trained Models

Domain	Models	Datasets
Computer Vision	ResNet, EfficientNet, VGG, MobileNet	ImageNet, COCO
NLP	BERT, GPT, RoBERTa, T5	Wikipedia, BookCorpus, Common Crawl
Speech	Wav2Vec2, HuBERT	LibriSpeech, Common Voice

Tip: When using transfer learning, start with small learning rates (typically 10x smaller than normal) as the pre-trained weights are usually already quite good.

Resources: PyTorch Transfer Learning | Keras Transfer Learning

10. Transformers

Attention-based architecture that has revolutionized natural language processing and beyond.

Key Components

Self-Attention: Computes weighted sum of all input elements
Multi-Head Attention: Multiple attention heads learn different patterns
Positional Encoding: Injects information about token positions
Feedforward Layers: Applied position-wise
Layer Normalization: Stabilizes training

Attention Mechanism

Attention(Q, K, V) = softmax(QK^T/√d_k)V

where:
Q = Query matrix
K = Key matrix
V = Value matrix
d_k = dimension of keys

Transformer Variants

Encoder-Only: BERT, RoBERTa (good for classification)
Decoder-Only: GPT family (good for generation)
Encoder-Decoder: T5, BART (good for translation)
Vision Transformers: ViT, DeiT (apply to images)

Resources: PyTorch Transformer | Illustrated Transformer

11. Graph Neural Networks (GNN)

Specialized neural networks designed to operate on graph-structured data.

Core Concepts

Node Embedding: Representation of each node in latent space
Message Passing: Nodes aggregate information from neighbors
Graph-Level Tasks: Predict properties of entire graphs
Node-Level Tasks: Predict properties of individual nodes
Edge-Level Tasks: Predict properties of edges

Popular Variants

Type	Description	Use Cases
GCN	Graph Convolutional Network	Node classification
GAT	Graph Attention Network	Tasks requiring edge importance
GraphSAGE	General inductive framework	Large-scale graphs
GIN	Graph Isomorphism Network	Graph classification

Resources: PyTorch Geometric | GNN Introduction

12. Capsule Networks

Alternative to CNNs that aim to better model hierarchical relationships in data.

Key Features

Capsules: Groups of neurons that represent specific entities
Dynamic Routing: Agreement mechanism between capsules
Pose Matrices: Capture spatial relationships
Equivariance: Preserves spatial information

Advantages

Better at recognizing spatial hierarchies
More robust to affine transformations
Requires less training data

Limitations

Computationally expensive
Not as widely adopted as CNNs
Limited large-scale implementations

Resources: Dynamic Routing Between Capsules | Capsule Networks Implementation

13. Spiking Neural Networks (SNN)

Biologically inspired models that more closely mimic actual neural activity.

Key Characteristics

Temporal Coding: Information encoded in spike timing
Event-Driven: Only active when spikes occur
Leaky Integrate-and-Fire: Common neuron model
STDP: Spike-timing-dependent plasticity learning rule

Applications

Neuromorphic computing
Low-power edge devices
Brain-computer interfaces
Computational neuroscience

Resources: snnTorch | SNN Review

14. Neural Ordinary Differential Equations (Neural ODEs)

Framework that models continuous-depth neural networks using differential equations.

Key Ideas

Treat neural network as continuous transformation
Use ODE solver for forward pass
Adjoint method for efficient backpropagation
Memory efficiency compared to very deep discrete networks

Applications

Time series modeling
Density estimation
Continuous normalizing flows
Irregularly sampled data

Resources: Neural ODEs Paper | torchdiffeq

15. Mixture of Experts (MoE)

Sparse model architecture where different parts of the network (experts) specialize in different inputs.

Key Components

Experts: Specialized sub-networks
Gating Network: Routes inputs to relevant experts
Sparsity: Only a few experts active per input
Capacity: Can scale to enormous sizes efficiently

Applications

Large language models (e.g., Google's Switch Transformer)
Multi-task learning
Domain-specific specialization

Resources: Switch Transformers | MoE Implementation