Status Report: Nolan

The neural network is trained on our database of 63 songs (these are sort of arbitrarily chosen, I went on a youtubeToWav converter downloading spree).

To reiterate, the model is as follows: A song will be recorded in .wav format, then converted into a chroma feature (actually, a CENS format, which includes some more normalization and smoothing). This CENS is used to produce a cross-similarity matrix with every song in the model’s CENS. These matrices are all classified (the classifier outputs the probability that a matrix represents a match), and the highest-probability-of-match songs are ranked. Currently, the network’s mean squared error is about 1.72%.

 

Before the demo, I’m cleaning up the integration and making sure that everything can smoothly connect with the visualization webapp and with anja’s dynamic time warping. Since my neural network is in python/keras and my preprocessing is in MATLAB, I’m using MATLAB’s engine for python to integrate those.

Nolan: Status Report 10

I’ve got the pre-processing and creation of cross-similarity matrices down to a point where I feel happy with it. This week I created a set of .wav files of songs to train a model off of. I have 66 songs selected, so with two team members singing each song, there will be almost 9000 cross-similarity matrices to train off of: 132 matches and several thousand mismatches. This week I will be training the model.

Nolan: Status Report 7

I’m still trying to work out the preprocessing for chroma feature cross-similarity analysis.

I actually received a very prompt reply from one of the members of the korean team whose work I’m building off of.  He suggests adding pitch transformation to transpose the two samples into the same key before comparing them. This, in his words, is critical. I have a general idea of what should be done based on the paper they’ve cited that used this concept–the Optimal Transformation Index (basically, it takes the the chroma features, , averages it to find the most common key it seems to be in, and then shifts one of the two so that these match).

Obviously, the main potential flaw in this approach from the beginning is that we’ll be matching the average pitches of ONLY the melody with the average pitches of the ENTIRE song, creating a lot of noise. Even a perfectly correct singer may not be able to generate very promising matches. I’m looking into OTI features, and I’ve found a good way to benchmark this approach in general: I can compare a song to its isolated vocal track–if the original singer in the same studio recording can’t generate something that seems matchable vs. the original song, this method is probably not tenable.

The researcher I contacted actually suggested the opposite approach from the idea we originally got from Prof. Dannenberg–analyze the waveform to estimate melody from the audio directly, then compare melody -> melody.

From this, I’ve recieved some deep-learning based suggestions and a few that won’t require training, which might make some preliminary testing easier. Next week I’ll be looking at RPCA and YIN for vocal extraction and melody recognition.

Nolan: Status Report 6

This week I prepared for our demo. We will be able to show cross-similarity matrices between various songs and some samples, and have some evaluation present for tuning chroma features to provide clearer and more distinct patterns in the cross-similarity matrices. Hopefully, some heuristic evaluations of the similarity will show some trends that could allow us to pick a song out based on a sung sample. Failing this, we can look at a classifier that can determine the difference between a match and a mismatch from sung sample-> actual song waveform.

 

 

Nolan: Status Report 5

This week I begun using Chroma Toolbox in MATLAB. I’m hoping to get some practice using it and soon have a demo showing a similarity matrix between a sung sample and a reference song. Looking at a few of these and comparing them to similarity matrices of unrelated samples should let us know if the CNN algorithm will be viable.

Since most of our algorithms will probably be implemented in MATLAB, it might not be feasible to publish our project as a mobile app, but a desktop version might still be doable. Luckily, the problem has less to do with the computing power required and more of ease of development: there’s no reason that our algorithm COULDN’T be ported to a mobile device, and whatever we do produce could be redeveloped if it seems commercially viable.

Nolan: Status Report 4

This week I spent the first half of the week working on the design document. Our team fleshed out a lot of technical and planning details we hadn’t considered yet, so it was useful to realize decisions we would need to make and have some conversations about them in advance.

 

I spent the second half of the week preparing for a midterm and working on booth, but I hope to have a deliverable for the chroma feature cross-similarity path by the end of spring break. I’ll be working on converting either a waveform in mp3 format or a sung sample to a chroma feature matrix and creating a cross-similarity matrix between them. This will get us to the next step, in which we’ll evaluate the matrices we see and start to think about if pattern matching on them to classify is feasible.

Nolan: Status Report 3

This week I began working on my arm of the project, the chroma feature similarity matrix analysis. Since the first step is building chroma features (also known as chromagrams), I’ve started looking into available toolboxes/code for creating these. Most of the existing work seems to be in MATLAB, so if I want to use an existing chromogram library I’ll have to decide between working in matlab and compiling to c++ or simply drawing inspiration from the libraries and building my own implementation.  Even within chroma feature extraction, there are lots of design parameters to consider. There will be a choice between how the chroma vector is constructed (a series of filters with different cutoffs, or fourier analysis and binning are both viable options). On top of this, Pre and post-processing can dramatically alter the features of a chroma vector. The feature rate is also a relevant consideration: how many times per second do we want to record a chromagram?

Some relevant pre-and post-processing tricks to consider:

accounting for different tunings. The toolbox tries several offsets of <1 semitones and picks whichever one is ‘most suitable’. If we simply use the same bins for all recordings we may not need to worry about this? but also, a variation of this could be used to provide some key-invariance.

Normalization to remove dynamics–dynamics might actually be useful in identifying a song. We should probably test with and without this processing variant.

“flattening” the vectors using logarithmic features–this accounts for the fact that sound intensity is experienced logarithmically, and changes the relative intensity of notes in a given sample.

logarithmic compression and a discrete cosine transform to discard timbre information and attempt to get only the pitch info

Windowing different samples together and downsampling to smooth out the chroma feature in the time dimension–this could help obscure some local tempo variations, but its unclear right now if that’s something we want for this project. This does offer a way to change the tempo of a chroma feature, so we may want to use this if we try to build in tempo-invariance.

As it turns out, these researchers have done some work in audio matching (essentially what we’re doing) using chroma feature, and suggest some settings for their chroma toolbox that should lead to better performance, so that’s a great place for us to start.

an important paper from this week:

https://www.audiolabs-erlangen.de/content/05-fau/professor/00-mueller/03-publications/2011_MuellerEwert_ChromaToolbox_ISMIR.pdf

http://resources.mpi-inf.mpg.de/MIR/chromatoolbox/

http://resources.mpi-inf.mpg.de/MIR/chromatoolbox/2005_MuellerKurthClausen_AudioMatching_ISMIR.pdf

Team Status Report 3

This week we decided to split our engineering efforts and pursue two song identification strategies in parallel. This will give us a greater chance of success, and it will let our two strategies cross-verify each other.

Our first strategy will draw from Roger Dannenberg’s MUSART query by humming work from ~2008. This will do some signal analysis of the sung samples and extract notes and rhythms. Then, these notes and rhythms will be matched against a library of MIDI files using some of the algorithms from the paper. In order to do this, we need to extract the melodies from each MIDI file. The query by humming paper we’re referencing used a program called ThemeExtractor that analyzes songs and searches for repeated patterns, returning a list of potential melodies.  Unfortunately, ThemeExtractor is no longer available. We have found a CMU researcher (thanks professor dannenberg) who’s currently doing work with automated MIDI analysis, and has a program that should be able to extract melodies. This method will have the advantage of being key-invariant and and tempo-invariant: a user singing too fast and on the wrong starting note should still be able to find a match.

 

Our second strategy will be to match an mp3 file to the sung sample. This has the obvious advantage of being usable on any song–most songs don’t have a MIDI version of them easily available, and there’s no way to automatically convert a song into a MIDI file. This would let us expand our library to basically any arbitrary size instead of being limited to songs with available MIDIs. To implement this, we will use the strategy I outlined in my last status report: convert both the sung sample and original song into a chroma feature matrix and run some cross-similarity analysis. Some researchers have had success using Convolutional neural nets on cross-similarity matrices of songs compared with cover versions, but this problem is slightly different: we should expect to see much less similarity, and only across a small time window. We will definitely explore CNNs, but we will also explore some other pattern recognition algorithms.

Once we have more data on how both of these algorithms are working, we can decide how we want to integrate them into an app and work on visualizations of our algorithms for our own debugging purposes and for the end user’s enjoyment (e.g. a cool animation showing you how we found your song).

Nolan: Status Report 2

Nolan Hiehle

Capstone Status Report 2

 

This week we had a very helpful meeting with Professor Dannenberg. In it, we described our research up to this point and some different strategies we were looking at. Professor Dannenberg suggested we use Chroma Feature, an audio processing method for turning a spectrogram into a quantized vector of the 12 notes on a musical scale. While this obscures things like instruments, it’s a pretty solid way to turn raw audio (mp3 files or whatever) into some sort of informative musical thing to play around with.

It seems that most melody identification research is pretty lacking (at least in regards to melody extraction from a raw audio file as opposed to a MIDI), so we’re currently pursuing an algorithm that involves matching a sung query to an entire song with all instruments, as opposed to extracting a melody and matching against that.

Professor Dannenberg suggests creating a chroma feature of the songs to be matched against, then creating a chroma feature of the sung query sample, and creating a similarity matrix between them at every point in time. Some sort of pattern recognition could then potentially be applied to this matrix to look for a match somewhere between the sung query and the original song.

Some drawbacks to this include that this method is not key or tempo-invariant: for example, a singer singing all the correct intervals between notes, but starting on the wrong note (very easy to do for someone who does not have perfect pitch and does not know a song well) would not generate a match, since we’re matching pitches directly against each other. We do have the option of rotating the sung chroma vector 12 times and comparing against each, but it’s possible this could generate a lot of noise.

Similarly, this method is sensitive to tempo. Someone singing the right notes, but a little bit too fast or too slow could very easily not get a match. This is partially because of the way chroma feature works: we will get a 1d vector of pitches and choose some sampling rate (maybe a few times per second?)–but each sample will just be a snapshot of ALL notes present in the song at that moment, with no notion of a “line of melody” or a note changing. Because of this, a sped-up version of a melody could look pretty different than a slower version. Similarly, we could try to brute force fix this problem by taking our melody queries and ALSO running our search method on some number of sped-up chroma features and some number of slowed-down chroma features.

The sung sample will contain harmonics in its chroma vectors, so it may be a good idea to remove these–this is something we should determine with testing.

 

At this point, we need to build this system and test. We’ll start by just looking at similarity matrices manually to see if there are promising results before we start to build a pattern recognition system. Some researchers have achieved very good results using convolutional neural networks on similarity matrices for cover song identification, but this could be challenging to tune to an acceptable standard (and also might just be overkill for a demo-based capstone).