Youtube Chord OCR

  |   Source

I am working on building a chord classifier using neural networks. In order to train a machine learning algorithm, I need training data, i.e. audio snippets labeled with the correct chord. I thought of 2 sources:

  • Tape myself.
  • Use youtube play along videos. These videos intended for people to play along with the video diplay the chords in real time. Using OCR I can read this chord in order to correctly label the audio.

Here's an example of the type of video I'm referring to:

This post describes the process of taking youtube videos and creating a training set. The code can be found on github. You can also view the code in an ipython notebook. I also explain how I used profiling to optimize my code.

Optical Character Recognition

There are several free open source OCR libraries, most notably tesseract and GNU Ocrad. I ended up using ocrad because it produced better results on my data.

I decided to focus on one youtube channel, UkuleleUnderground, so that the videos were all consistant. With some minor modifications, the code can be extended to other channels.

The first step is to use OCR to read all of the characters in the frame. Then I'll find out which one is the active chord.


Images are stored as a 3 dimentional array of integers between 0 and 255: x, y, color (red, green and blue). For OCR, there are a few preprocessing steps that can improve results:

  • Truncating: Remove parts of the image that don't have the text I want to read.
  • Binarizing: Convert pixels that are below a set threshold to 0, and pixels that above that threshold to 255.

The original frame looks like this:


After preprocessing, you get:



Now I can run the actual OCR program on that image. The output is a text file with information about the characters found in the image:

  • A box around each character found
  • What the character is (or when unsure, what the possibilities are)
# Ocr Results File. Created by GNU Ocrad version 0.24
source file ocr.PPM
total text blocks 1
text block 1 0 0 1280 150
lines 1
line 1 chars 20 height 31
67  19 33 39; 2, 'C'1, 'c'0
105  33 35 25; 1, 'm'0
143  34 22 24; 1, 'a'0
164  25 12 42; 1, 'j'0
180  26 21 32; 1, '7'0
201  28 221 30; 2, ' '1, '   '0
422  20 32 38; 1, 'D'0
458  34 34 24; 1, 'm'0

In this case, I'm interested in the chords, so I need to combine consecutive characters that form a chord. Usually the chords are separated by a space, but not always, so first I read all of the characters into a string:


Then use regex to match the chords:

NOTE_REGEX = re.compile(r"(([A-G]|[ABEG]♭|[CF]♯?)(maj|min|[Mm+°])?6?(aug|d[io]m|ø)?7?)")

for note_match in NOTE_REGEX.finditer(ocr_results_string):

This gives us the list of chords:

['Cmaj7', 'Dm', 'Em', 'Dm', 'Em', 'Dm']

I also need to know which chord is currently active. For unactive chords, the chord is written in black which means the pixels are all at 0, but the active chord is written in blue which means only green and red are 0 while blue is maxed out at 255. Thus, if we look a box containing all of the chords' characters, and inside count how many black pixels there are compared to how many non white pixels, we will find:

  • For the active chord: very few black pixels but many non-white pixels.
  • For the inactive chords: As many black pixels as non-white pixels.

This gives us a simple way to select the active chord. The first step is calculating a box around the chord. Fortunately the ocr program provides boxes around each character, so I just need to combine the boxes for characters in the same chord together:

for note_match in NOTE_REGEX.finditer(ocr_results_string):

This gives us the list of boxes around the chords:

[[67, 19, 134, 48],
  [422, 20, 70, 38],
  [715, 20, 67, 38],
  [920, 20, 70, 38],
  [1032, 20, 68, 38],
  [1142, 20, 70, 38]]

Now I can calculate the ratio:

[1044.5, 1.0, 1.0, 1.0, 1.0, 1.0]

I expect all chords except 1 to have a value around 1, and 1 chord - the active one - to be larger. In this case, the active chord is the first one, i.e. Cmaj7.

Create Training Data

Now that I can read the know which chord is active in a frame, I can just loop through each frame in the video to see which chord is active in each frame. If a video has 7200 frames, I'll get a list of length 7200, which has the active chord, or false if there wasn't any. For example, 40 frames of one of the videos looks like this:

['Em', 'Em', 'Em', 'Em', 'Em', 'Em', 'Em', 'Em', 'Em', 'Em', 'Em', 'Dm', 'Dm', 'Dm', 'Dm', 'Dm', 'Dm', 'Dm', 'Dm', 'Dm', 'Dm', 'Dm', False, False, False, False, False, False, False, False, False, 'Cmaj7', 'Cmaj7', 'Cmaj7', 'Cmaj7', 'Cmaj7', 'Cmaj7', 'Cmaj7', 'Cmaj7', 'Cmaj7']

I'm interested in intervals with the same chord, so I can write this as a list with the first and last frame of a given chord:

[ ((602, 620), 'Em'), ((621, 631), 'Dm'), ((641, 697), 'Cmaj7'),]

Now I can do some cleanup on this list. Sometimes 1 frame will not be read correctly, leading to a break in an interval. Sometimes, a chord will be misread leading to 1 bad chord. For example:

# apparent break in continuous chord
[ ((7053, 7064), 'E'), ((7070, 7079), 'E'), ]

# wrong chord - D should be Dm7 or False
[ ((3319, 3329), 'Dm7'), ((3330, 3330), 'D'), ((3338, 3395), 'Cmaj7'), ]

To remove these errors, I merge intervals that are the same chord and separated by only a few false frames, and drop intervals that are too short.

Now all that's left to do is extract the audio from the video and split it according to the intervals found.

This gives me large and varied labeled data set.

Increase Performance using Profiling

Because of the length of each video - a few thousand frames - and the number of videos I plan on analyzing (and because optimizing code is fun) I want a fairly fast analysis. In order to see how long the code takes, and which parts take longer, I use a technique called profiling. Profiling gives detailed statistics on all python functions executed during the profiled execution. In python, the standard library package cProfile allows for very easy profiling:

import cProfile'foo()')

In order to easily navigate the profiler output, I used a python packaged called cprofilev.

Original Performance

Let's run the profiler when analyzing 500 frames:


It takes about 55 ms analyze 1 frame, so for a 4 minute video with about 30 frames per second (7200 frames), it takes about nearly 7 minutes.

Most of the time is spent waiting for the OCR subprocess to finish (posix.waitpid and combined take up 21.5s of 27.7 s). In order to speed up the code, ideally I would call the OCR subprocess less frequently.

Optimization 1

In general, consecutive frames have the same chords. So by checking if the image changed first, and only using OCR when they did, I can decrease the number of calls to ocr, and thus the total time. To check if two frames are the same, I can simply take the difference between all the pixels and check that they are all equal to zero.

Let's see what effect that had:


The total time is down to 3.4s, an 88% speedup! We can also see that it takes almost 0.9s just to read the frames, so it won't be possible to speed up the code much more. Now, the largest time sink is count_nonzero, which was used to compute the difference between the frames to solve the previous problem. In addition, counting the black and non white ratio takes a lot of time too (0.44s). These were both implemented using numpy:

import numpy as np

# count differences:
diff_image = np.subtract(image_arr,previous_frame)
nb_differences = np.count_nonzero(diff_image)

# count black and non white pixels:
sub_section = np.mean(sub_section, axis=2)
black_count = np.sum(sub_section == np.array(0))
non_white_count = np.sum(sub_section != np.array(255))

Optimization 2

Numpy implements much of the calculations in C in order to be fast, however numpy has a bit of overhead. In addition, counting the different pixels requires 2 passes through the array (first to calculate the differences, then to count the non-zero pixels), while counting the black and the non white pixels in requires going through the array three times (first to compute the mean of the colors, then to count the black pixels, and finally to count the non white pixels). Let's try to speed them up a bit more using numba. Numba is a python package which, from the overview:

generates optimized machine code from pure Python code using the LLVM compiler infrastructure.

It makes it very easy to write very fast code. I'll combine all the passes through the arrays into 1 function.

from numba import jit
# returns the count of black pixels (0) and non white pixels (!= 255), after calculating the mean of the 3rd dimension
def black_count_non_white_count(arr):
    black_count = 0
    non_white_count = 0
    D0, D1, D2 = arr.shape
    for i in range(D0):
        for j in range(D1):
            val = 0
            for k in range(D2):
                val += arr[i,j,k]
            val /= 3
            if val != 255:
                non_white_count += 1
                if val == 0:
                    black_count += 1
    return non_white_count, black_count

def count_differences(arr1, arr2):
    count = 0
    D0, D1, D2 = arr1.shape
    for i in range(D0):
        for j in range(D1):
            for k in range(D2):
                if arr1[i,j,k] != arr2[i,j,k]:
                    count += 1
    return count

# this function will appear in the profiler
def black_count_non_white_count_util(arr):
    return black_count_non_white_count_numba(arr)

# this function will appear in the profiler
def count_differences_util(arr1, arr2):
    return count_differences_numba(arr1, arr2)

black_count_non_white_count_numba = jit(black_count_non_white_count)
count_differences_numba = jit(count_differences)

The total time is down to 2.56s for 500 frames. Now counting the differences only takes 0.76s versus 0.9s before (but that was without counting the time to calculate the image differences), and counting the black and non white pixels only takes 0.09s, down from 0.44s.

Optimization Results

After this little bit of optimization, 1 frame takes 5.1ms, down from 55ms, so a full video takes just under 40s.

At this point, I'm satisfied with the execution time. Further optimization would yield smaller gains in performance, and it would take longer to implement them. Still, the final profiling hints what to pursue if further optimization were required:

  • Continue to speed up the detection of changes in the frames (only count the differences up to a threshold, and maybe its possible to check for changes in a smaller subsection of the image).
  • Reduce the size of the image used with OCR (by stripping the empty borders).