A Tutorial Series for Software Developers, Data Scientists, and Data Center Managers
This is the 22nd article in the Hands-On AI Developer Journey Tutorial Series and it focuses on the first steps in creating a deep learning model for music generation, choosing an appropriate model, and preprocessing the data.
This project uses the BachBot* model1 to harmonize a melody that has been through the emotion-modulation algorithm.
Music Generation—Thinking About the Problem
The first step in solving many problems using artificial intelligence is reducing it to a fundamental problem that is solvable by artificial intelligence. One such problem is sequence prediction, which is used in translation and natural language processing applications. Our task of music generation can be reduced to a sequence prediction problem, where we are predicting a sequence of musical notes.
Choosing a Model
There are a number of types of neural networks to consider for a model: feedforward neural network, recurrent neural network, and the long short-term memory network.
Neurons are the basic abstractions that are combined to form neural networks. Essentially, a neuron is a function that takes in an input and returns an output.
Figure 1: A neuron1.
Layers of neurons that take in the same input and have their outputs concatenated can be combined to make a feedforward neural network. A feedforward achieves strong performance due to the composition of nonlinear activation functions throughout many layers (described as deep).
Figure 2: A feedforward neural network1.
A feedforward neural network works well in a wide variety of applications. However, a drawback that prevents it from being useful in the music composition (sequence prediction) task is that it requires a fixed input dimension (music can vary in length). Furthermore, feedforward neural networks do not account for previous inputs, which makes it not very useful for the sequence prediction task! A model that is better suited for this task is the recurrent neural network (RNN).
RNNs solve both of these issues by introducing connections between the hidden nodes so that the nodes in the next time step can receive information from the previous time step.
Figure 3: An unrolled view on an RNN1.
As can be seen in the figure, each neuron now takes in both an input from the previous layer, and the previous time point.
A technical problem faced by RNNs with larger input sequences is the vanishing gradient problem,meaning that influence from earlier time steps are quickly lost. This is a problem in music composition as there important long-term dependencies that need to be addressed.
A modification to the RNN called long short-term memory (LSTM) can be used to solve the vanishing gradient problem. It does this by introducing memory cells that are carefully controlled by three types of gates. Click here for details on Understanding LSTM Networks3.
Thus, BachBot proceeded by using an LSTM model.
Music is a very complex art form and includes dimensions of pitch, rhythm, tempo, dynamics, articulation, and others. To simplify music for the purpose of this project, only pitch and duration were considered. Furthermore, each chorale was transposed to the key of C major or A minor, and note lengths were time quantized (rounded) to the nearest semiquaver (16th note). These steps were taken to reduce the complexity and improve performance while preserving the essence of the music. Key and time normalizations were done using the music21* library4.
"""Converts into the key of C major or A minor.
Adapted from https://gist.github.com/aldous-rey/68c6c43450517aa47474
# conversion tables: e.g. Ab -> C is up 4 semitones, D -> A is down 5 semitones
majors = dict([("A-", 4),("A", 3),("B-", 2),("B", 1),("C", 0),("C#",-1), ("D-", -1),("D", -2),("E-", -3),("E", -4),("F", -5),("F#",6), ("G-", 6), ("G", 5)])
minors = dict([("A-", 1),("A", 0),("B-", -1),("B", -2),("C", -3),("C#",-4), ("D-", -4),("D", -5),("E-", 6),("E", 5),("F", 4),("F#",3), ("G-",3),("G", 2)])
# transpose score
key = score.analyze('key')
if key.mode == "major":
halfSteps = majors[key.tonic.name]
elif key.mode == "minor":
halfSteps = minors[key.tonic.name]
tScore = score.transpose(halfSteps)
# transpose key signature
for ks in tScore.flat.getKeySignatures():
Figure 4: Code to standardize key signatures of the corpus into either C major or A minor 2.
Quantizing to the nearest semiquaver was done using the music21’s function, Stream.quantize(). Below is a comparison of the statistics about the dataset before and after preprocessing:
Figure 5: Use of each pitch class before (left) and after preprocessing (right). Pitch class refers to pitch without regard for octave 1.
Figure 6: Note occurrence positions before (left) and after preprocessing (right)1.
As can be seen in Figure 5, transposition of key into C major and A minor had a large impact on the pitch class used in the corpus. In particular there are increased counts for the pitches in C major and A minor (C, D, E, F, G, A, B). There are smaller peaks at F# and G# due to their presence in the ascending version of A melodic minor (A, B, C, D, E, F#, and G#). On the other hand, time quantization had a considerably smaller effect. This is due to the high resolution of quantization (analogous to rounding to many significant figures).
Once the data has been preprocessed, the chorales needed to be encoded into a format that can be easily processed by an RNN. The format that is required is a sequence of tokens. The BachBot project opted for encoding at note level (each token represents a note) instead of the chord level (each token represents a chord). This decision reduced the vocabulary size from 1284 potential chords to 128 potential notes, which improves performance.
An original encoding scheme was created for the BachBot project 1. A chorale is broken down into semiquaver time steps, which are called frames. Each frame contains a sequence of tuples representing the musical instrument digital interface (MIDI) pitch value of the note, and whether it is tied to a previous note at the same pitch (note, tie). Notes within a frame are ordered by descending pitch (soprano → alto → tenor → bass). Each frame may also have a fermata that signals the end of a phrase, represented by (.). START and END symbols are appended to the beginning and end of each chorale. These symbols cause the model to initialize itself and allow the user to determine when a composition is finished.
Figure 7: Example encoding of two chords. Each chord is a quaver in duration, and the second one has a fermata. ‘|||’ represents the end of a frame1.
def encode_score(score, keep_fermatas=True, parts_to_mask=):
Encodes a music21 score into a List of chords, where each chord is represented with
a (Fermata :: Bool, List[(Note :: Integer, Tie :: Bool)]).
If `keep_fermatas` is True, all `has_fermata`s will be False.
All tokens from parts in `parts_to_mask` will have output tokens `BLANK_MASK_TXT`.
Time is discretized such that each crotchet occupies `FRAMES_PER_CROTCHET` frames.
encoded_score = 
for chord in (score
.notesAndRests): # aggregate parts, remove markup
# expand chord/rest s.t. constant timestep between frames
encoded_score.extend((int(chord.quarterLength * FRAMES_PER_CROTCHET)) * [])
has_fermata = (keep_fermatas) and any(map(lambda e: e.isClassOrSubclass(('Fermata',)), chord.expressions))
encoded_chord = 
# TODO: sorts Soprano, Bass, Alto, Tenor without breaking ties
# c = chord.sortAscending()
# sorted_notes = [c[-1], c] + c[1:-1]
# for note in sorted_notes:
for note in chord:
if parts_to_mask and note.pitch.groups in parts_to_mask:
has_tie = note.tie is not None and note.tie.type != 'start'
# repeat pitches to expand chord into multiple frames
# all repeated frames when expanding a chord should be tied
encoded_score.extend((int(chord.quarterLength * FRAMES_PER_CROTCHET) - 1) * [
map(lambda note: BLANK_MASK_TXT if note == BLANK_MASK_TXT else (note, True), encoded_chord))
Figure 8: Code used to encode a music21* score using the specified encoding scheme2.
This article discussed some of the early steps in implementing a deep learning model using BachBot as an example. In particular, it discussed the advantages of RNN/LSTM for music composition (which is fundamentally a problem in sequence prediction), and the critical steps of data preprocessing and encoding. Because the steps taken for preprocessing and encoding are different in each project, we hope that the considerations described in this article will be helpful.
Check out the next article for information about training/testing the LSTM model for music generation and how this model is altered to create a model that completes and harmonizes a melody.
For more such intel IoT resources and tools from Intel, please visit the Intel® Developer Zone