LSTM Music Generator Documentation | AI Music Composition Guide

LSTM Music Generator Documentation

This document explains a Python program that generates music using an LSTM neural network trained on MIDI files.

Overview
Requirements
Getting Started
Code Explanation
How It Works
Limitations
Potential Improvements

Overview

This program uses a Long Short-Term Memory (LSTM) neural network to learn patterns from MIDI music files and generate new musical sequences. The implementation includes:

Loading and parsing MIDI files using music21 library
Preprocessing musical notes into sequences for training
An LSTM-based neural network architecture
Training the model to predict the next note in a sequence
Generating new music by sampling from the model's predictions

Requirements

To run this program, you'll need:

Python 3.6+
Required Python packages:
- torch (PyTorch)
- numpy
- music21
- glob
- pickle
MIDI files for training (place them in the same directory as the script)
Optional: CUDA-enabled GPU for faster training (PyTorch will automatically use GPU if available)

You can install the required packages using pip:

pip install torch numpy music21

Getting Started

Place your MIDI files in the same directory as the script (or specify the path)
Run the script: python music_generator.py
The script will:
- Load and process the MIDI files
- Train the LSTM model
- Generate a new MIDI file called generated_music.mid
Open the generated MIDI file with any music player or DAW

Note: The quality of generated music depends on:

The quantity and quality of training MIDI files
The training parameters (epochs, sequence length, etc.)
The complexity of the musical patterns in the training data

Code Explanation

Configuration

SEQUENCE_LENGTH = 100  # Length of input sequences
BATCH_SIZE = 64        # Number of sequences per batch
EPOCHS = 50            # Number of training epochs
HIDDEN_SIZE = 256      # Size of LSTM hidden layers
NUM_LAYERS = 2         # Number of LSTM layers
LEARNING_RATE = 1e-3   # Learning rate for optimizer
DROPOUT = 0.3          # Dropout rate for regularization
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

These configuration parameters control the training process and model architecture. You can adjust them based on your needs:

Increase SEQUENCE_LENGTH to capture longer musical patterns
Increase HIDDEN_SIZE and NUM_LAYERS for a more complex model (requires more data and computation)
Adjust LEARNING_RATE if training is unstable or too slow
Increase EPOCHS for better training (but watch for overfitting)

Model Architecture

class MusicGenerator(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers, dropout):
        super(MusicGenerator, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
                            batch_first=True, dropout=dropout)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden):
        out, hidden = self.lstm(x, hidden)
        out = out[:, -1, :]
        out = self.fc(out)
        return out, hidden

    def init_hidden(self, batch_size):
        return (torch.zeros(self.num_layers, batch_size, self.hidden_size, device=DEVICE),
                torch.zeros(self.num_layers, batch_size, self.hidden_size, device=DEVICE))

The model consists of:

LSTM layers: Process sequential data and maintain hidden state
Fully connected layer: Maps LSTM output to prediction probabilities
Hidden state initialization: Provides starting state for LSTM

The forward pass takes an input sequence and hidden state, processes it through the LSTM, and returns predictions for the next note.

Data Preparation

Loading Notes

def load_notes(midi_path="*.mid"):
    notes = []
    files = glob.glob(midi_path)
    for file in files:
        midi = converter.parse(file)
        try:
            parts = instrument.partitionByInstrument(midi)
            elements = parts.parts[0].recurse()
        except:
            elements = midi.flat.notes

        for element in elements:
            if isinstance(element, note.Note):
                notes.append(str(element.pitch))
            elif isinstance(element, chord.Chord):
                notes.append('.'.join(str(n) for n in element.normalOrder))
    return notes

This function:

Finds all MIDI files in the specified path
Parses each file using music21
Extracts notes and chords (chords are represented as dot-separated note values)
Returns a list of all notes/chords in sequence

Preparing Sequences

def prepare_sequences(notes):
    pitchnames = sorted(set(notes))
    note_to_int = {n: i for i, n in enumerate(pitchnames)}
    n_vocab = len(pitchnames)

    network_input = []
    network_output = []
    for i in range(len(notes) - SEQUENCE_LENGTH):
        seq_in = notes[i:i + SEQUENCE_LENGTH]
        seq_out = notes[i + SEQUENCE_LENGTH]
        network_input.append([note_to_int[n] for n in seq_in])
        network_output.append(note_to_int[seq_out])

    n_patterns = len(network_input)
    network_input = np.array(network_input).reshape(n_patterns, SEQUENCE_LENGTH, 1) / float(n_vocab)
    network_output = np.array(network_output)

    return network_input, network_output, note_to_int

This function:

Creates a vocabulary of unique notes/chords
Maps each note/chord to an integer
Creates input sequences of SEQUENCE_LENGTH and corresponding output (next note)
Normalizes input between 0 and 1

Training Process

def train_network():
    notes = load_notes()
    network_input, network_output, note_to_int = prepare_sequences(notes)
    n_vocab = len(note_to_int)

    X = torch.from_numpy(network_input).float().to(DEVICE)
    y = torch.from_numpy(network_output).long().to(DEVICE)

    model = MusicGenerator(input_size=1, hidden_size=HIDDEN_SIZE,
                           output_size=n_vocab, num_layers=NUM_LAYERS,
                           dropout=DROPOUT).to(DEVICE)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

    n_batches = (len(X) // BATCH_SIZE)

    model.train()
    for epoch in range(1, EPOCHS+1):
        epoch_loss = 0.0
        hidden = model.init_hidden(BATCH_SIZE)

        for b in range(n_batches):
            start = b * BATCH_SIZE
            end = start + BATCH_SIZE
            inputs = X[start:end]
            targets = y[start:end]

            optimizer.zero_grad()
            outputs, hidden = model(inputs, hidden)
            hidden = (hidden[0].detach(), hidden[1].detach())

            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()

        avg_loss = epoch_loss / n_batches if n_batches else 0
        print(f"Epoch {epoch}/{EPOCHS} Loss: {avg_loss:.4f}")

    torch.save(model.state_dict(), 'music_generator.pth')
    with open('note_to_int.pickle', 'wb') as f:
        pickle.dump(note_to_int, f)

The training process:

Loads and prepares the data
Initializes the model, loss function, and optimizer
Trains in batches for the specified number of epochs
Saves the trained model and note-to-int mapping

Music Generation

def generate_music(model_path='music_generator.pth',
                   note_dict_path='note_to_int.pickle',
                   gen_length=500):
    with open(note_dict_path, 'rb') as f:
        note_to_int = pickle.load(f)
    int_to_note = {i: n for n, i in note_to_int.items()}

    model = MusicGenerator(input_size=1, hidden_size=HIDDEN_SIZE,
                           output_size=len(note_to_int), num_layers=NUM_LAYERS,
                           dropout=0).to(DEVICE)
    model.load_state_dict(torch.load(model_path, map_location=DEVICE))
    model.eval()

    notes = load_notes()
    network_input, _, _ = prepare_sequences(notes)
    start_idx = np.random.randint(0, len(network_input))
    pattern = list((network_input[start_idx] * len(note_to_int)).astype(int).flatten())

    generated = []
    hidden = model.init_hidden(1)

    for _ in range(gen_length):
        seq = np.array(pattern[-SEQUENCE_LENGTH:]).reshape(1, SEQUENCE_LENGTH, 1) / float(len(note_to_int))
        seq_tensor = torch.from_numpy(seq).float().to(DEVICE)

        with torch.no_grad():
            output, hidden = model(seq_tensor, hidden)
            hidden = (hidden[0].detach(), hidden[1].detach())
            probs = nn.functional.softmax(output.view(-1), dim=0).cpu().numpy()
            index = np.random.choice(range(len(note_to_int)), p=probs)

        pattern.append(index)
        generated.append(int_to_note[index])

    output_notes = []
    for pattern in generated:
        if '.' in pattern:
            parts = pattern.split('.')
            notes_in_chord = [note.Note(int(p)) for p in parts]
            for n in notes_in_chord:
                n.storedInstrument = instrument.Piano()
            new_chord = chord.Chord(notes_in_chord)
            output_notes.append(new_chord)
        else:
            new_note = note.Note(pattern)
            new_note.storedInstrument = instrument.Piano()
            output_notes.append(new_note)

    midi_stream = stream.Stream(output_notes)
    midi_stream.write('midi', fp='generated_music.mid')

The generation process:

Loads the trained model and note mappings
Selects a random starting sequence from the training data
Generates new notes one at a time by:
- Feeding the current sequence through the model
- Sampling from the output probabilities
- Adding the new note to the sequence
Converts the generated notes back to MIDI format
Saves the result as a MIDI file

How It Works

The program works by learning statistical patterns in sequences of musical notes:

Pattern Recognition: The LSTM learns which notes/chords tend to follow other notes/chords
Sequence Prediction: Given a sequence of notes, the model predicts probabilities for the next note
Creative Generation: By sampling from these probabilities and feeding predictions back as input, the model generates new sequences

This approach is similar to how language models generate text, but applied to musical notes instead of words.

Limitations

Simple representation: Only captures pitch information (no duration, velocity, etc.)
Short-term patterns: Limited by the SEQUENCE_LENGTH parameter
Quality depends on training data: Needs diverse, high-quality MIDI files
No musical structure: Doesn't explicitly model musical form (verse, chorus, etc.)

Potential Improvements

Add timing information: Include note durations in the model
Multi-track generation: Model different instruments/voices
Transformer architecture: Replace LSTM with a more modern architecture
Conditional generation: Generate in specific styles or keys
Post-processing: Apply music theory rules to improve results

Final Note: This is a basic implementation that demonstrates the concept. For professional results, consider more sophisticated architectures and larger, curated datasets.

code exercises

Music Generation With PyTorch