Music Generation With PyTorch

LSTM Music Generator Documentation | AI Music Composition Guide

LSTM Music Generator Documentation

This document explains a Python program that generates music using an LSTM neural network trained on MIDI files.

Table of Contents

Overview

This program uses a Long Short-Term Memory (LSTM) neural network to learn patterns from MIDI music files and generate new musical sequences. The implementation includes:

  • Loading and parsing MIDI files using music21 library
  • Preprocessing musical notes into sequences for training
  • An LSTM-based neural network architecture
  • Training the model to predict the next note in a sequence
  • Generating new music by sampling from the model's predictions

Requirements

To run this program, you'll need:

  • Python 3.6+
  • Required Python packages:
    • torch (PyTorch)
    • numpy
    • music21
    • glob
    • pickle
  • MIDI files for training (place them in the same directory as the script)
  • Optional: CUDA-enabled GPU for faster training (PyTorch will automatically use GPU if available)
You can install the required packages using pip:
pip install torch numpy music21

Getting Started

  1. Place your MIDI files in the same directory as the script (or specify the path)
  2. Run the script: python music_generator.py
  3. The script will:
    • Load and process the MIDI files
    • Train the LSTM model
    • Generate a new MIDI file called generated_music.mid
  4. Open the generated MIDI file with any music player or DAW
Note: The quality of generated music depends on:
  • The quantity and quality of training MIDI files
  • The training parameters (epochs, sequence length, etc.)
  • The complexity of the musical patterns in the training data

Code Explanation

Configuration

SEQUENCE_LENGTH = 100  # Length of input sequences
BATCH_SIZE = 64        # Number of sequences per batch
EPOCHS = 50            # Number of training epochs
HIDDEN_SIZE = 256      # Size of LSTM hidden layers
NUM_LAYERS = 2         # Number of LSTM layers
LEARNING_RATE = 1e-3   # Learning rate for optimizer
DROPOUT = 0.3          # Dropout rate for regularization
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

These configuration parameters control the training process and model architecture. You can adjust them based on your needs:

  • Increase SEQUENCE_LENGTH to capture longer musical patterns
  • Increase HIDDEN_SIZE and NUM_LAYERS for a more complex model (requires more data and computation)
  • Adjust LEARNING_RATE if training is unstable or too slow
  • Increase EPOCHS for better training (but watch for overfitting)

Model Architecture

class MusicGenerator(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers, dropout):
        super(MusicGenerator, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
                            batch_first=True, dropout=dropout)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden):
        out, hidden = self.lstm(x, hidden)
        out = out[:, -1, :]
        out = self.fc(out)
        return out, hidden

    def init_hidden(self, batch_size):
        return (torch.zeros(self.num_layers, batch_size, self.hidden_size, device=DEVICE),
                torch.zeros(self.num_layers, batch_size, self.hidden_size, device=DEVICE))

The model consists of:

  • LSTM layers: Process sequential data and maintain hidden state
  • Fully connected layer: Maps LSTM output to prediction probabilities
  • Hidden state initialization: Provides starting state for LSTM

The forward pass takes an input sequence and hidden state, processes it through the LSTM, and returns predictions for the next note.

Data Preparation

Loading Notes

def load_notes(midi_path="*.mid"):
    notes = []
    files = glob.glob(midi_path)
    for file in files:
        midi = converter.parse(file)
        try:
            parts = instrument.partitionByInstrument(midi)
            elements = parts.parts[0].recurse()
        except:
            elements = midi.flat.notes

        for element in elements:
            if isinstance(element, note.Note):
                notes.append(str(element.pitch))
            elif isinstance(element, chord.Chord):
                notes.append('.'.join(str(n) for n in element.normalOrder))
    return notes

This function:

  • Finds all MIDI files in the specified path
  • Parses each file using music21
  • Extracts notes and chords (chords are represented as dot-separated note values)
  • Returns a list of all notes/chords in sequence

Preparing Sequences

def prepare_sequences(notes):
    pitchnames = sorted(set(notes))
    note_to_int = {n: i for i, n in enumerate(pitchnames)}
    n_vocab = len(pitchnames)

    network_input = []
    network_output = []
    for i in range(len(notes) - SEQUENCE_LENGTH):
        seq_in = notes[i:i + SEQUENCE_LENGTH]
        seq_out = notes[i + SEQUENCE_LENGTH]
        network_input.append([note_to_int[n] for n in seq_in])
        network_output.append(note_to_int[seq_out])

    n_patterns = len(network_input)
    network_input = np.array(network_input).reshape(n_patterns, SEQUENCE_LENGTH, 1) / float(n_vocab)
    network_output = np.array(network_output)

    return network_input, network_output, note_to_int

This function:

  • Creates a vocabulary of unique notes/chords
  • Maps each note/chord to an integer
  • Creates input sequences of SEQUENCE_LENGTH and corresponding output (next note)
  • Normalizes input between 0 and 1

Training Process

def train_network():
    notes = load_notes()
    network_input, network_output, note_to_int = prepare_sequences(notes)
    n_vocab = len(note_to_int)

    X = torch.from_numpy(network_input).float().to(DEVICE)
    y = torch.from_numpy(network_output).long().to(DEVICE)

    model = MusicGenerator(input_size=1, hidden_size=HIDDEN_SIZE,
                           output_size=n_vocab, num_layers=NUM_LAYERS,
                           dropout=DROPOUT).to(DEVICE)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

    n_batches = (len(X) // BATCH_SIZE)

    model.train()
    for epoch in range(1, EPOCHS+1):
        epoch_loss = 0.0
        hidden = model.init_hidden(BATCH_SIZE)

        for b in range(n_batches):
            start = b * BATCH_SIZE
            end = start + BATCH_SIZE
            inputs = X[start:end]
            targets = y[start:end]

            optimizer.zero_grad()
            outputs, hidden = model(inputs, hidden)
            hidden = (hidden[0].detach(), hidden[1].detach())

            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()

        avg_loss = epoch_loss / n_batches if n_batches else 0
        print(f"Epoch {epoch}/{EPOCHS} Loss: {avg_loss:.4f}")

    torch.save(model.state_dict(), 'music_generator.pth')
    with open('note_to_int.pickle', 'wb') as f:
        pickle.dump(note_to_int, f)

The training process:

  • Loads and prepares the data
  • Initializes the model, loss function, and optimizer
  • Trains in batches for the specified number of epochs
  • Saves the trained model and note-to-int mapping

Music Generation

def generate_music(model_path='music_generator.pth',
                   note_dict_path='note_to_int.pickle',
                   gen_length=500):
    with open(note_dict_path, 'rb') as f:
        note_to_int = pickle.load(f)
    int_to_note = {i: n for n, i in note_to_int.items()}

    model = MusicGenerator(input_size=1, hidden_size=HIDDEN_SIZE,
                           output_size=len(note_to_int), num_layers=NUM_LAYERS,
                           dropout=0).to(DEVICE)
    model.load_state_dict(torch.load(model_path, map_location=DEVICE))
    model.eval()

    notes = load_notes()
    network_input, _, _ = prepare_sequences(notes)
    start_idx = np.random.randint(0, len(network_input))
    pattern = list((network_input[start_idx] * len(note_to_int)).astype(int).flatten())

    generated = []
    hidden = model.init_hidden(1)

    for _ in range(gen_length):
        seq = np.array(pattern[-SEQUENCE_LENGTH:]).reshape(1, SEQUENCE_LENGTH, 1) / float(len(note_to_int))
        seq_tensor = torch.from_numpy(seq).float().to(DEVICE)

        with torch.no_grad():
            output, hidden = model(seq_tensor, hidden)
            hidden = (hidden[0].detach(), hidden[1].detach())
            probs = nn.functional.softmax(output.view(-1), dim=0).cpu().numpy()
            index = np.random.choice(range(len(note_to_int)), p=probs)

        pattern.append(index)
        generated.append(int_to_note[index])

    output_notes = []
    for pattern in generated:
        if '.' in pattern:
            parts = pattern.split('.')
            notes_in_chord = [note.Note(int(p)) for p in parts]
            for n in notes_in_chord:
                n.storedInstrument = instrument.Piano()
            new_chord = chord.Chord(notes_in_chord)
            output_notes.append(new_chord)
        else:
            new_note = note.Note(pattern)
            new_note.storedInstrument = instrument.Piano()
            output_notes.append(new_note)

    midi_stream = stream.Stream(output_notes)
    midi_stream.write('midi', fp='generated_music.mid')

The generation process:

  • Loads the trained model and note mappings
  • Selects a random starting sequence from the training data
  • Generates new notes one at a time by:
    • Feeding the current sequence through the model
    • Sampling from the output probabilities
    • Adding the new note to the sequence
  • Converts the generated notes back to MIDI format
  • Saves the result as a MIDI file

How It Works

The program works by learning statistical patterns in sequences of musical notes:

  1. Pattern Recognition: The LSTM learns which notes/chords tend to follow other notes/chords
  2. Sequence Prediction: Given a sequence of notes, the model predicts probabilities for the next note
  3. Creative Generation: By sampling from these probabilities and feeding predictions back as input, the model generates new sequences

This approach is similar to how language models generate text, but applied to musical notes instead of words.

Limitations

  • Simple representation: Only captures pitch information (no duration, velocity, etc.)
  • Short-term patterns: Limited by the SEQUENCE_LENGTH parameter
  • Quality depends on training data: Needs diverse, high-quality MIDI files
  • No musical structure: Doesn't explicitly model musical form (verse, chorus, etc.)

Potential Improvements

  • Add timing information: Include note durations in the model
  • Multi-track generation: Model different instruments/voices
  • Transformer architecture: Replace LSTM with a more modern architecture
  • Conditional generation: Generate in specific styles or keys
  • Post-processing: Apply music theory rules to improve results
Final Note: This is a basic implementation that demonstrates the concept. For professional results, consider more sophisticated architectures and larger, curated datasets.

Comments

Popular posts from this blog

Tech Duos For Web Development

CIFAR-10 Dataset Classification Using Convolutional Neural Networks (CNNs) With PyTorch

Long-short-term-memory (LSTM) Word Prediction With PyTorch