This week we decided to split our engineering efforts and pursue two song identification strategies in parallel. This will give us a greater chance of success, and it will let our two strategies cross-verify each other.
Our first strategy will draw from Roger Dannenberg’s MUSART query by humming work from ~2008. This will do some signal analysis of the sung samples and extract notes and rhythms. Then, these notes and rhythms will be matched against a library of MIDI files using some of the algorithms from the paper. In order to do this, we need to extract the melodies from each MIDI file. The query by humming paper we’re referencing used a program called ThemeExtractor that analyzes songs and searches for repeated patterns, returning a list of potential melodies. Unfortunately, ThemeExtractor is no longer available. We have found a CMU researcher (thanks professor dannenberg) who’s currently doing work with automated MIDI analysis, and has a program that should be able to extract melodies. This method will have the advantage of being key-invariant and and tempo-invariant: a user singing too fast and on the wrong starting note should still be able to find a match.
Our second strategy will be to match an mp3 file to the sung sample. This has the obvious advantage of being usable on any song–most songs don’t have a MIDI version of them easily available, and there’s no way to automatically convert a song into a MIDI file. This would let us expand our library to basically any arbitrary size instead of being limited to songs with available MIDIs. To implement this, we will use the strategy I outlined in my last status report: convert both the sung sample and original song into a chroma feature matrix and run some cross-similarity analysis. Some researchers have had success using Convolutional neural nets on cross-similarity matrices of songs compared with cover versions, but this problem is slightly different: we should expect to see much less similarity, and only across a small time window. We will definitely explore CNNs, but we will also explore some other pattern recognition algorithms.
Once we have more data on how both of these algorithms are working, we can decide how we want to integrate them into an app and work on visualizations of our algorithms for our own debugging purposes and for the end user’s enjoyment (e.g. a cool animation showing you how we found your song).