Neural Network Architectures Overview

Neural Network Architectures Overview

Neural Network Architectures Overview

1. Artificial Neural Network (ANN)

The fundamental building block of deep learning, consisting of interconnected nodes organized in layers.

Architecture

  • Input Layer: Receives the raw input data
  • Hidden Layers: 1 or more layers that transform inputs through weights and activation functions
  • Output Layer: Produces the final prediction or classification

Key Equations

output = activation(Wx + b)
where:
W = weight matrix
x = input vector
b = bias vector
activation = nonlinear function (ReLU, sigmoid, tanh)

Advantages

  • Universal function approximator
  • Simple to implement
  • Good for structured data

Limitations

  • Poor performance with unstructured data (images, text)
  • No spatial or temporal awareness
  • Can be computationally expensive for large inputs

2. Convolutional Neural Network (CNN)

Specialized architecture for processing grid-like data such as images, with built-in translation invariance.

Core Components

  • Convolutional Layers: Apply filters to extract features (edges, textures, patterns)
  • Pooling Layers: Reduce spatial dimensions (max pooling, average pooling)
  • Fully Connected Layers: Final classification/regression layers
  • Common Architectures: LeNet-5, AlexNet, VGG, ResNet, EfficientNet

Key Operations

# 2D convolution operation
output[b, i, j, :] = sum_{di, dj, k} (
    input[b, strides[1] * i + di, strides[2] * j + dj, k] *
    filter[di, dj, k, :]
) + bias[:]

Advantages

  • Parameter sharing reduces number of parameters
  • Automatic feature extraction
  • Excellent for image/video processing

Limitations

  • Computationally intensive
  • Requires large datasets for training
  • Not ideal for sequential data

3. Recurrent Neural Network (RNN)

Designed for sequential data by maintaining a hidden state that captures information about previous elements in the sequence.

Architecture Variants

  • Vanilla RNN: Basic recurrent unit with simple hidden state update
  • Bidirectional RNN: Processes sequence both forward and backward
  • Deep RNN: Multiple stacked recurrent layers

Key Equations

h_t = activation(W_xh * x_t + W_hh * h_{t-1} + b_h)
y_t = W_hy * h_t + b_y

where:
h_t = hidden state at time t
x_t = input at time t
y_t = output at time t
W_* = weight matrices
b_* = bias vectors

Advantages

  • Can process variable-length sequences
  • Shares parameters across time steps
  • Maintains memory of previous inputs

Limitations

  • Suffers from vanishing/exploding gradients
  • Difficulty learning long-range dependencies
  • Computationally sequential (hard to parallelize)

4. Long Short-Term Memory (LSTM)

Advanced RNN variant designed to overcome the vanishing gradient problem through gated mechanisms.

Core Components

  • Forget Gate: Decides what information to discard from cell state
  • Input Gate: Updates the cell state with new information
  • Output Gate: Determines what to output based on cell state
  • Cell State: The "memory" that carries information through the sequence

Key Equations

f_t = σ(W_f · [h_{t-1}, x_t] + b_f)  # Forget gate
i_t = σ(W_i · [h_{t-1}, x_t] + b_i)  # Input gate
o_t = σ(W_o · [h_{t-1}, x_t] + b_o)  # Output gate
C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C)  # Candidate cell state
C_t = f_t * C_{t-1} + i_t * C̃_t      # New cell state
h_t = o_t * tanh(C_t)                # New hidden state

Note: GRU (Gated Recurrent Unit) is a simplified variant of LSTM that combines the forget and input gates into a single "update gate" and merges the cell state and hidden state.

5. Generative Adversarial Network (GAN)

A framework for training generative models through an adversarial process involving two competing networks.

Architecture Components

  • Generator (G): Creates synthetic data from random noise
  • Discriminator (D): Distinguishes between real and generated data
  • Training Process: Minimax game where G tries to fool D while D tries to correctly classify

Objective Function

min_G max_D V(D,G) = E_{x~p_data(x)}[log D(x)] + E_{z~p_z(z)}[log(1 - D(G(z))]

where:
x = real data sample
z = random noise vector
G(z) = generated sample
D(·) = discriminator's probability estimate

GAN Variants

Type Description Applications
DCGAN Deep Convolutional GAN with architectural constraints Image generation
CycleGAN Uses cycle consistency for unpaired image-to-image translation Style transfer, photo enhancement
StyleGAN Controls fine details through style-based generation High-quality face generation
WGAN Uses Wasserstein distance for more stable training General purpose generation

6. Radial Basis Function Network (RBFN)

A feedforward network with radial basis activation functions in the hidden layer, particularly effective for function approximation and classification.

Architecture

  • Input Layer: Receives feature vector
  • Hidden Layer: Uses radial basis functions (typically Gaussian) centered at specific points
  • Output Layer: Linear combination of hidden layer outputs

Key Equations

ϕ(x) = exp(-β||x - c||²)  # Gaussian RBF
where:
c = center vector
β = spread parameter

Output: y = Σ w_i * ϕ_i(x) + b

Advantages

  • Fast training (often single pass)
  • Good interpolation capabilities
  • Can approximate any continuous function

Limitations

  • Curse of dimensionality with high-dim inputs
  • Requires careful selection of centers
  • Not as flexible as MLPs for complex patterns

7. Reinforcement Learning (RL)

A learning paradigm where an agent learns to make decisions by interacting with an environment to maximize cumulative reward.

Key Components

  • Agent: The learner/decision maker
  • Environment: The world with which the agent interacts
  • State (s): Current situation of the agent
  • Action (a): Decision made by the agent
  • Reward (r): Feedback from the environment
  • Policy (π): Strategy that the agent employs

Approaches

Method Description Example Algorithms
Value-Based Learn value function to derive policy Q-Learning, DQN
Policy-Based Directly optimize the policy REINFORCE, PPO
Model-Based Learn model of environment dynamics Dyna, MBMF
Actor-Critic Combine value and policy approaches A3C, SAC

8. Autoencoders

Unsupervised neural networks that learn efficient data encodings by training to reconstruct their inputs.

Architecture

  • Encoder: Maps input to latent space representation (bottleneck)
  • Decoder: Reconstructs input from latent representation
  • Objective: Minimize reconstruction error (e.g., MSE, cross-entropy)

Variants

  • Undercomplete: Bottleneck layer has lower dimension than input
  • Sparse: Adds sparsity constraint to activations
  • Denoising: Trained to reconstruct clean inputs from corrupted versions
  • Variational (VAE): Learns probabilistic latent space
  • Contractive: Adds penalty on encoder's Jacobian

Advantages

  • Unsupervised feature learning
  • Dimensionality reduction
  • Anomaly detection
  • Data denoising

Limitations

  • May learn trivial identity function
  • Quality depends on architecture choices
  • VAEs can produce blurry reconstructions

9. Transfer Learning

Technique where knowledge gained from solving one problem is applied to a different but related problem.

Approaches

  • Feature Extraction: Use pre-trained model as fixed feature extractor
  • Fine-Tuning: Unfreeze some layers and continue training
  • Domain Adaptation: Adapt model to new domain with different data distribution
  • Multi-Task Learning: Train on multiple related tasks simultaneously

Common Pre-trained Models

Domain Models Datasets
Computer Vision ResNet, EfficientNet, VGG, MobileNet ImageNet, COCO
NLP BERT, GPT, RoBERTa, T5 Wikipedia, BookCorpus, Common Crawl
Speech Wav2Vec2, HuBERT LibriSpeech, Common Voice

Tip: When using transfer learning, start with small learning rates (typically 10x smaller than normal) as the pre-trained weights are usually already quite good.

10. Transformers

Attention-based architecture that has revolutionized natural language processing and beyond.

Key Components

  • Self-Attention: Computes weighted sum of all input elements
  • Multi-Head Attention: Multiple attention heads learn different patterns
  • Positional Encoding: Injects information about token positions
  • Feedforward Layers: Applied position-wise
  • Layer Normalization: Stabilizes training

Attention Mechanism

Attention(Q, K, V) = softmax(QK^T/√d_k)V

where:
Q = Query matrix
K = Key matrix
V = Value matrix
d_k = dimension of keys

Transformer Variants

  • Encoder-Only: BERT, RoBERTa (good for classification)
  • Decoder-Only: GPT family (good for generation)
  • Encoder-Decoder: T5, BART (good for translation)
  • Vision Transformers: ViT, DeiT (apply to images)

11. Graph Neural Networks (GNN)

Specialized neural networks designed to operate on graph-structured data.

Core Concepts

  • Node Embedding: Representation of each node in latent space
  • Message Passing: Nodes aggregate information from neighbors
  • Graph-Level Tasks: Predict properties of entire graphs
  • Node-Level Tasks: Predict properties of individual nodes
  • Edge-Level Tasks: Predict properties of edges

Popular Variants

Type Description Use Cases
GCN Graph Convolutional Network Node classification
GAT Graph Attention Network Tasks requiring edge importance
GraphSAGE General inductive framework Large-scale graphs
GIN Graph Isomorphism Network Graph classification

12. Capsule Networks

Alternative to CNNs that aim to better model hierarchical relationships in data.

Key Features

  • Capsules: Groups of neurons that represent specific entities
  • Dynamic Routing: Agreement mechanism between capsules
  • Pose Matrices: Capture spatial relationships
  • Equivariance: Preserves spatial information

Advantages

  • Better at recognizing spatial hierarchies
  • More robust to affine transformations
  • Requires less training data

Limitations

  • Computationally expensive
  • Not as widely adopted as CNNs
  • Limited large-scale implementations

13. Spiking Neural Networks (SNN)

Biologically inspired models that more closely mimic actual neural activity.

Key Characteristics

  • Temporal Coding: Information encoded in spike timing
  • Event-Driven: Only active when spikes occur
  • Leaky Integrate-and-Fire: Common neuron model
  • STDP: Spike-timing-dependent plasticity learning rule

Applications

  • Neuromorphic computing
  • Low-power edge devices
  • Brain-computer interfaces
  • Computational neuroscience

Resources: snnTorch | SNN Review

14. Neural Ordinary Differential Equations (Neural ODEs)

Framework that models continuous-depth neural networks using differential equations.

Key Ideas

  • Treat neural network as continuous transformation
  • Use ODE solver for forward pass
  • Adjoint method for efficient backpropagation
  • Memory efficiency compared to very deep discrete networks

Applications

  • Time series modeling
  • Density estimation
  • Continuous normalizing flows
  • Irregularly sampled data

15. Mixture of Experts (MoE)

Sparse model architecture where different parts of the network (experts) specialize in different inputs.

Key Components

  • Experts: Specialized sub-networks
  • Gating Network: Routes inputs to relevant experts
  • Sparsity: Only a few experts active per input
  • Capacity: Can scale to enormous sizes efficiently

Applications

  • Large language models (e.g., Google's Switch Transformer)
  • Multi-task learning
  • Domain-specific specialization

Comments

Popular posts from this blog

Tech Duos For Web Development

CIFAR-10 Dataset Classification Using Convolutional Neural Networks (CNNs) With PyTorch

Long-short-term-memory (LSTM) Word Prediction With PyTorch