I’m still trying to work out the preprocessing for chroma feature cross-similarity analysis.
I actually received a very prompt reply from one of the members of the korean team whose work I’m building off of. He suggests adding pitch transformation to transpose the two samples into the same key before comparing them. This, in his words, is critical. I have a general idea of what should be done based on the paper they’ve cited that used this concept–the Optimal Transformation Index (basically, it takes the the chroma features, , averages it to find the most common key it seems to be in, and then shifts one of the two so that these match).
Obviously, the main potential flaw in this approach from the beginning is that we’ll be matching the average pitches of ONLY the melody with the average pitches of the ENTIRE song, creating a lot of noise. Even a perfectly correct singer may not be able to generate very promising matches. I’m looking into OTI features, and I’ve found a good way to benchmark this approach in general: I can compare a song to its isolated vocal track–if the original singer in the same studio recording can’t generate something that seems matchable vs. the original song, this method is probably not tenable.
The researcher I contacted actually suggested the opposite approach from the idea we originally got from Prof. Dannenberg–analyze the waveform to estimate melody from the audio directly, then compare melody -> melody.
From this, I’ve recieved some deep-learning based suggestions and a few that won’t require training, which might make some preliminary testing easier. Next week I’ll be looking at RPCA and YIN for vocal extraction and melody recognition.