Team C6: Earworm – Page 4 – Carnegie Mellon ECE Capstone, Spring 2019

March 2, 2019

Wenting: Status Report 3

In the process of putting together the design presentation and design document, we have decided to take two parallel paths on our project, as described in our team status report. One will follow a similar path to the query by humming project that will match against a MIDI library, while the other will use chroma feature analysis to examine similarities between MP3s.

From the data visualization standpoint, the two approaches will be generating results in two different ways. Since the first approach will be borrowing work from other research, I am not completely sure how that will be able to be visualized – whether it will be a black box computation or whether I can extract out its process to display it. The second approach will follow what I mentioned last week with showing the similarity matrix.

While the UI design of the app will be done later, I have begun the process of deciding on its functionalities and features. Similar to the existing Shazam, users will tap to begin singing and matching. We hope to have sliders for the user to weight melody and rhythm differently depending on what they are more confident in. Once our algorithm has finished processing, it will pop up with the matched song or no match if it could not find anything. Either way, the user will be able to see some of the work that was done to match the song. The level of detail that we will show initially is yet to be determined (for example, we can include a “see more” button to see more of the data visualization aspect). The app will maintain a history of the songs that have been searched and potentially the audio files that it has previously captured, thus also maintaining a history of what has been matched to the song before as well.

Now that our design is more concrete, we have reached the phase where we are going to begin implementation to see how our methods perform. I would like to begin data visualization with some test data to see how different libraries and technologies will fit to our purposes. Also, in conjunction with Nolan, I will be looking into chroma feature analysis and using CNNs to perform matching.

March 2, 2019March 5, 2019

Team Status Report 3

This week we decided to split our engineering efforts and pursue two song identification strategies in parallel. This will give us a greater chance of success, and it will let our two strategies cross-verify each other.

Our first strategy will draw from Roger Dannenberg’s MUSART query by humming work from ~2008. This will do some signal analysis of the sung samples and extract notes and rhythms. Then, these notes and rhythms will be matched against a library of MIDI files using some of the algorithms from the paper. In order to do this, we need to extract the melodies from each MIDI file. The query by humming paper we’re referencing used a program called ThemeExtractor that analyzes songs and searches for repeated patterns, returning a list of potential melodies. Unfortunately, ThemeExtractor is no longer available. We have found a CMU researcher (thanks professor dannenberg) who’s currently doing work with automated MIDI analysis, and has a program that should be able to extract melodies. This method will have the advantage of being key-invariant and and tempo-invariant: a user singing too fast and on the wrong starting note should still be able to find a match.

Our second strategy will be to match an mp3 file to the sung sample. This has the obvious advantage of being usable on any song–most songs don’t have a MIDI version of them easily available, and there’s no way to automatically convert a song into a MIDI file. This would let us expand our library to basically any arbitrary size instead of being limited to songs with available MIDIs. To implement this, we will use the strategy I outlined in my last status report: convert both the sung sample and original song into a chroma feature matrix and run some cross-similarity analysis. Some researchers have had success using Convolutional neural nets on cross-similarity matrices of songs compared with cover versions, but this problem is slightly different: we should expect to see much less similarity, and only across a small time window. We will definitely explore CNNs, but we will also explore some other pattern recognition algorithms.

Once we have more data on how both of these algorithms are working, we can decide how we want to integrate them into an app and work on visualizations of our algorithms for our own debugging purposes and for the end user’s enjoyment (e.g. a cool animation showing you how we found your song).

February 24, 2019February 24, 2019

Wenting: Status Report 2

This week our team met with Professor Roger Dannenberg to get his insight into our ideas and what would be more plausible for our project. He told us some more details about the query by humming project and how its poor performance resulted in it not being a product that could go on the market. The act of breaking down music into separate lines is actually quite difficult since it’s hard to determine instrument lines and such. He mentioned F0 estimation which tries to find the primary note at each instance, which is essentially melody tracking, but also has mixed results. His suggestions have shifted our path more towards looking into using similarity matrices to match the songs. Similarity matrices essentially plot one song against another, and searching for diagonals locates the places where the two songs match up.

The similarity matrix idea also helps to play into the data visualization aspect of our project. Matrices are inherently visual so if we end up using this technique then the work that has been done by the matching algorithm through the matrix can be shown.

A point brought up during our weekly meeting with Professor Savvides and the TAs was that including a data visualization portion would help with debugging as well. I read this article which addresses some of the existing software that people use for similar things, such as Tensorflow Graph Visualizer. We will probably end up using those softwares, or at the very least employing techniques from them.

Unfortunately I have technically fallen behind schedule. I did not anticipate how much more research and planning we would have to do before moving on to other tasks, such as designing the data format. However, I have furthered my understanding of the problem at hand sooner rather than later, so I have a better idea of what will work now rather than having to turn back to try a different solution too late.

For next week, I would like to have a more concrete plan of how we are going to go about implementing both the audio analysis and the matching algorithm and explore how that will interact with the data visualization aspect of the project.

February 24, 2019March 5, 2019

Anja Status Report 2 + Team Status Report 2

Anja Kalaba Team C6:

This week my team and I spoke with Professor Dannenberg and were able to toss out the idea of dynamic library processing. It was settled that the final experiment that would be attempted before doing an optimal replication of the Query By Humming Paper would be trying to use similarity matrices comparing instrument samples to ensemble samples to verify the instrument’s membership in the ensemble. As a more reliable approach, MIDI file preprocessing was decided to be appropriate. I also worked on the design slides, it was settled I would be presenting our design on Monday.
I would say progress is on schedule.
Deliverables for this week hope to include a valid small test sample of the similarity matrix instrument membership verification, and then from the results of this a final decision about the route to go for algorithm and library/database composition.

Team C6 Status:

Most significant risks would be deciding to go with the similarity matrix approach and finding it isn’t applicable to voices. Our contingency plan is a full implementation of the Query By Humming paper, which seems credible and has fine results. If this also is found to be unstable, our plan is to provide a data visualization scheme to at least display the rigorous algorithmic work done under the hood.
No changes were made to design, still the consideration portion (seen above).
No Team schedule changes.

February 24, 2019February 24, 2019

Nolan: Status Report 2

Nolan Hiehle

Capstone Status Report 2

This week we had a very helpful meeting with Professor Dannenberg. In it, we described our research up to this point and some different strategies we were looking at. Professor Dannenberg suggested we use Chroma Feature, an audio processing method for turning a spectrogram into a quantized vector of the 12 notes on a musical scale. While this obscures things like instruments, it’s a pretty solid way to turn raw audio (mp3 files or whatever) into some sort of informative musical thing to play around with.

It seems that most melody identification research is pretty lacking (at least in regards to melody extraction from a raw audio file as opposed to a MIDI), so we’re currently pursuing an algorithm that involves matching a sung query to an entire song with all instruments, as opposed to extracting a melody and matching against that.

Professor Dannenberg suggests creating a chroma feature of the songs to be matched against, then creating a chroma feature of the sung query sample, and creating a similarity matrix between them at every point in time. Some sort of pattern recognition could then potentially be applied to this matrix to look for a match somewhere between the sung query and the original song.

Some drawbacks to this include that this method is not key or tempo-invariant: for example, a singer singing all the correct intervals between notes, but starting on the wrong note (very easy to do for someone who does not have perfect pitch and does not know a song well) would not generate a match, since we’re matching pitches directly against each other. We do have the option of rotating the sung chroma vector 12 times and comparing against each, but it’s possible this could generate a lot of noise.

Similarly, this method is sensitive to tempo. Someone singing the right notes, but a little bit too fast or too slow could very easily not get a match. This is partially because of the way chroma feature works: we will get a 1d vector of pitches and choose some sampling rate (maybe a few times per second?)–but each sample will just be a snapshot of ALL notes present in the song at that moment, with no notion of a “line of melody” or a note changing. Because of this, a sped-up version of a melody could look pretty different than a slower version. Similarly, we could try to brute force fix this problem by taking our melody queries and ALSO running our search method on some number of sped-up chroma features and some number of slowed-down chroma features.

The sung sample will contain harmonics in its chroma vectors, so it may be a good idea to remove these–this is something we should determine with testing.

At this point, we need to build this system and test. We’ll start by just looking at similarity matrices manually to see if there are promising results before we start to build a pattern recognition system. Some researchers have achieved very good results using convolutional neural networks on similarity matrices for cover song identification, but this could be challenging to tune to an acceptable standard (and also might just be overkill for a demo-based capstone).

February 20, 2019March 5, 2019

Proposal Presentation

February 16, 2019February 24, 2019

Team Status Report 1

What are the most significant risks that could jeopardize the success of the project? How are these risks being managed? What contingency plans are ready?
The biggest risk of our project is the matching algorithm performing very poorly. As elaborated further in the next point, existing papers have shown that query by humming, a similar project, has poor performance. We will have a very limited library of songs that we will match to. Our contingency plan for this is to add a data visualization aspect to the project to show the work that the matching algorithm has done, even if it cannot actually match it to a song. For example, it can display highlighted portions of the melody that it found to be a match or close match.
Were any changes made to the existing design of the system (requirements, block diagram, system spec, etc)? Why was this change necessary, what costs does the change incur, and how will these costs be mitigated going forward?
We might not do dynamic processing on a flexible library anymore. Some papers indicate that preprocessing with the helpful use of MIDI file information (a broken down record of pitches and rhythms for each instrument in a piece) is widely used and much simpler. The cost here means that the focus of our project is to either analyze and process mp3 files or to do better data representation and search algorithms.

The cost here is mitigated because our learning will be maximal — we can try out innovations to the existing algorithms since there isn’t much to lose.

Realizing the potential lack of accuracy with the project, we decided to add the visualization component to the project, so that users can at least see the inner workings of our algorithms.
Provide an updated schedule if changes have occurred.
Our schedule is mostly the same, but we have updated our Gantt chart below with the changes highlighted. After further evaluation, machine learning may not be needed so we have generalized ML-related tasks to matching algorithm, and we have also added in the data visualization to the task list.

February 16, 2019February 24, 2019

Wenting: Status Report 1

*Backlog from before websites were set up*

This week I did more thorough research into similar existing as well as related research that could be helpful for developing our solution. Roger Dannenberg, a CS professor whose primary field is computer music, has done a lot of work that is of interest for our project, including the query by humming project that we noted in our project proposal. We have been in contact with him and intend to meet with him soon.

The projects I looked into were the MUSART query by humming project and his work in structural analysis. From studying the query by humming project, I found that our intended method for analyzing songs was more difficult and had more improbable success than previously thought. Our most ambitious idea was to analyze songs by breaking it down into multiple voices by applying concepts from polyphonic pitch tracking, but this project simply used pre-existing MIDI files for that instead of analyzing the raw audio of the songs. Also, looking at the performance of their system, our goal for 75% accuracy in recognizing songs may not happen. In order to counteract the possibility of our system not being able to match the song, I came up with the idea to do some sort of data visualization. Even if we are unable to find a match, the algorithm will have done some manner of work to try and match it. I would like to include that in the results of the query to demonstrate that it did, in fact, try to do something. An example of what it might show is a highlighted portion of melody that it matched between the input query and an existing song.

While browsing Professor Dannenberg’s work, I stumbled upon his work in structural analysis. The purpose of these projects was to make models that could analyze songs and provide an explanation of it, i.e. “the song is in ‘AABA’ form.” Our project’s intent is not that, but I figured the analysis methods from these projects were relevant to what we are trying to do. Much of this work was in looking for patterns in the music to discover the structure of the song, including ways to transcribe melody using monophonic pitch tracking, chroma representation, and polyphonic transcription. We will likely be doing something similar in order to extract information from the input sounds.

I am currently on schedule, but the research phase will probably continue into next week as we explore all our options and discuss further with professors such as Roger Dannenberg.

I’d like to have some semblance of our data format design, and I intend to do research into data visualization, though that will depend on the data format and matching method.

January 30, 2019February 20, 2019

Hello world!

Welcome to Earworm! Ever have a song stuck in your head but didn’t know the name or artist? Could only sings a small clip of the melody? With our app, users will be able to hum or sing an excerpt of the lead vocals of a song and have it identified for them.

Our GOALS:

Design a well formatted and full data structure to house musical properties of input vocal queries
Come up with a dynamic mp3 file musical analysis that extracts features into the uniform data structure
Design an efficient search algorithm
Create a small but varied music library/database
Achieve at least a 50% success rate with query matches/identifications
Provide the user with a data visualization output during their use of the app – this will describe the inner workings of the algorithm to give the user a better sense of which parts of their query matched to the song, and show how the identified song was selected