Nolan: Status Report 2 – Team C6: Earworm

Nolan Hiehle

Capstone Status Report 2

This week we had a very helpful meeting with Professor Dannenberg. In it, we described our research up to this point and some different strategies we were looking at. Professor Dannenberg suggested we use Chroma Feature, an audio processing method for turning a spectrogram into a quantized vector of the 12 notes on a musical scale. While this obscures things like instruments, it’s a pretty solid way to turn raw audio (mp3 files or whatever) into some sort of informative musical thing to play around with.

It seems that most melody identification research is pretty lacking (at least in regards to melody extraction from a raw audio file as opposed to a MIDI), so we’re currently pursuing an algorithm that involves matching a sung query to an entire song with all instruments, as opposed to extracting a melody and matching against that.

Professor Dannenberg suggests creating a chroma feature of the songs to be matched against, then creating a chroma feature of the sung query sample, and creating a similarity matrix between them at every point in time. Some sort of pattern recognition could then potentially be applied to this matrix to look for a match somewhere between the sung query and the original song.

Some drawbacks to this include that this method is not key or tempo-invariant: for example, a singer singing all the correct intervals between notes, but starting on the wrong note (very easy to do for someone who does not have perfect pitch and does not know a song well) would not generate a match, since we’re matching pitches directly against each other. We do have the option of rotating the sung chroma vector 12 times and comparing against each, but it’s possible this could generate a lot of noise.

Similarly, this method is sensitive to tempo. Someone singing the right notes, but a little bit too fast or too slow could very easily not get a match. This is partially because of the way chroma feature works: we will get a 1d vector of pitches and choose some sampling rate (maybe a few times per second?)–but each sample will just be a snapshot of ALL notes present in the song at that moment, with no notion of a “line of melody” or a note changing. Because of this, a sped-up version of a melody could look pretty different than a slower version. Similarly, we could try to brute force fix this problem by taking our melody queries and ALSO running our search method on some number of sped-up chroma features and some number of slowed-down chroma features.

The sung sample will contain harmonics in its chroma vectors, so it may be a good idea to remove these–this is something we should determine with testing.

At this point, we need to build this system and test. We’ll start by just looking at similarity matrices manually to see if there are promising results before we start to build a pattern recognition system. Some researchers have achieved very good results using convolutional neural networks on similarity matrices for cover song identification, but this could be challenging to tune to an acceptable standard (and also might just be overkill for a demo-based capstone).

Leave a Reply Cancel reply