Chord Classification using Neural Networks

  |   Source

I'm currently working on classifying chords from audio using neural networks. This post gives an overview of the project, how it works, and the (soon) showcases the final product.

This is a work in progress, and will be updated regularly.


This projects aims to:

  • Classify chords being played into a microphone live.
  • Classify audio from a video hosted online and see the chords as the video is played.

I'm going to use neural networks for this, mainly because I want to develop a working knowledge of deep learning.

Obtaining Training Data

In order to train a machine learning algorithm, I need training data, i.e. audio snippets labeled with the correct chord. I thought of 2 sources:

  • Tape myself.
  • Use youtube play along videos. These videos intended for people to play along with the video diplay the chords in real time. I can read this chord in order to correctly label the audio.

Here's an example of the type of video I'm referring to:

For details on how this works, see this post about youtube chord OCR, or see the code on github.

Using these two techniques, I obtained a several hours of labeled audio.

Training a Classifier

Notes, and by extension chords, are directly related to the frequencies of a signal. Therefore, features are based on the Fourier transform of the signal. I use a short time Fourier transform of the incoming signal, keeping only frequencies that can be produced by a Ukulele. Detailed description and code is available in this ipython notebook.

First, I trained the classifier using chord only recordings and used cross validation to evaluate its performance, and then tested it on songs. I trained a support vector machine, and several different neural network architectures.


Once the data is split into training and test sets. Using scikit learn, training the SVM is accomplished using:

svm = sklearn.svm.LinearSVC(C=1), y_train)
print metrics.classification_report(y_test, pipe.predict(x_test))

            """precision    recall  f1-score   support
avg / total       0.95      0.95      0.95      1392"""

The SVM performs quite well. However, when tested on songs, performance was much poorer.

print metrics.classification_report(all_labels, all_predictions)
            """precision    recall  f1-score   support
avg / total       0.69      0.48      0.53       404"""


To normalize the signal:

# Divides the signal by the absolute value of it's highest peak.
# If there is no peak (all 0), return False
def normalize(signal, inplace = True):
   min_v = min(signal)
   max_v = max(signal)
   peak = max(abs(min_v), abs(max_v))
   if peak == 0:
       return False
   signal /= peak
   return signal
# Divides the signal by the absolute value of it's highest peak.
# If there is no peak (all 0), return False
# If inplace, returns None
def normalize(signal, inplace = True):
    max_norm = scipy.linalg.norm(signal, np.inf)
    if max_norm == 0:
        return False
        if inplace:
            np.divide(signal, float(max_norm), signal)
            return np.divide(signal, float(max_norm))