This week, I got audio segmentation up and working. After our previous conversation with Professor Sullivan, I first converted the audio signal into RMS values.
My first approach was to calculate if there was a sharp increase in the RMS. However, this caused me to incorrectly identify some spikes multiple times. Increasing the amount of time that has passed since the last identified point often caused me to miss some beginning of notes.
(image where the dots would signify the code identifying the start of a note, as you can see, it was too much)
I then realized that before notes, oftentimes the RMS would get near zero. So my next approach was to convert my code to identify when the RMS is near zero, but then when I got in a moment of silence (like a rest) I would incorrectly want to segment the silence into many different segments, which would waste a lot of time. So I tried to do a combination of the two then where I would look for when the RMS was near zero and then look for the nearest peak. Then if this nearest peak RMS minus the starting RMS (near zero) difference was greater than a specific threshold, currently 0.08 but this is still getting experimented with, then I would identify it as a correct note. While this was the most accurate approach thus far, I still ran into a bug where even in the moments of silence it would find the nearest peak, a couple of seconds away, and identify the silence as multiple beginning of notes again. I fixed this by checking how far away the peak was and making a maximum threshold.
(where the dotted blue line would be where the code identified the nearest peak and the red line is where the code is marking the near zero RMS values)
Currently this works for a sample of twinkle Twinkle Little Star. When testing this with a recording of ten little monkeys, it works if we lowered the RMS threshold, which signified that we would need to standardize our signal somehow in the future. We also noticed that with quicker notes, the RMS values don’t get as close to zero as quarter or half notes, so we might need to increase the threshold for what is considered near zero.
(red line is where code has identified beginning of notes and blue dotted line is where i manually identified the beginning of notes)