This week, I worked on trying to analyze the rhythm of just the vocals of pop songs. I used a technique for vocal separation that involves using a similarity matrix to identify repeating elements and then calculates a repeating spectrogram model using the median and extracts the repeating patterns using a time-frequency masking. The separation is shown in an example below:
Vocal separation of Edith Piaf’s La Vie en Rose
Using this method, there is still some significant bleeding of the background into the foreground. Additionally, the audio processing with vocal separation takes on average 2.5 minutes for a 5 minute song, compared to 4 seconds without the vocal separation. Even when running the rhythm detection on a single voice a cappella track, the rhythm detection does not perform as well as on for a piano or guitar song, for example, since in general the notes have less distinct onsets in singing. Thus I think users who want to play the game with songs with voice in it are better off using the original rhythm detection and refining with the beat map editor or using the editor to build a beat map from scratch.
I am on schedule and on track with our goals for the final demo. Next week, I plan to conduct some user testing of the whole system to validate our use case requirements and ensure there are no bugs in preparation for the demo.
This week, I integrated the rhythm analysis algorithm with Yuhe and Lucas’s game code. I also experimented with another method of determining the global threshold for onset strengths above which a timestamp should be counted as a note. This one used median and median absolute deviation instead of mean and standard deviation. Theoretically it would have less errors due to any outliers because the difference does not get exponentiated. This method performs similarly to the current method but has slightly more false positives at slower tempos. I also further tested the sliding window method. This one had an even higher amount of false positives at slow tempos. I believe this might be ameliorated by have the window slide more continuously rather than jumping from the first 100 frames to the second 100 frames, for example. The issue with this is that it would increase the audio processing latency, which we want to avoid. I think the method using standard deviation (in blue below) is still the best method overall.
Next week, I plan to help with integration testing and also conduct some user testing to validate that the project meets the user requirements. I also plan to help improve the visual design of the game as needed.
I learned a several new tools while working on this project. At the beginning, I learned the basics of Unity before we pivoted to Yuhe’s self-made lightweight game engine. I found it easiest to learn Unity as a beginner by watching and following along with a Youtube tutorial. By consulting documentation, I got more familiar with the numpy library, especially using it for complex plots. I learned how to use threading in Python, looking at examples on Stack Overflow, in order to create a verification program that could play a song and animate its detected notes at the same time. I also learned how to use MuseScore to compose pieces to use as tests for the rhythm analysis. For this I was able to mostly teach myself and occasionally Google any features I couldn’t find on my own.
This week, our team made progress on finalizing and debugging our subsystems as well as starting integration. Lucas added audio playback to the game loop and worked on integrating his components with Yuhe’s main menu. Yuhe worked on the beat map editor, adding waveform viewer and interactions to edit notes. Yuhe is also working on migrating the game to a Windows system in order to solve audio card reading issues when using the Linux virtual environment. Michelle continued testing and refining her rhythm analysis algorithm, moving to a new method that has yields higher accuracy, as shown below in a test on a test piano piece.
After integrating the subsystems we will do some integration tests to ensure all the components are communicating with each other correctly. There are several metrics we will need to focus on, including beat map accuracy, audio and falling tile synchronization, gameplay input latency, persistent storage validation, frame rate stability, and error handling. Both beat map alignment and input latency should be under 20ms to ensure a seamless game experience. The rhythm analysis should capture at at least 95% of the notes and have no false positives. Error handling should cover issues such as unexpected file formats, file sizes that are too large, and invalid file name inputs.
For validation of the use case requirements, we will do some iterative user testing and collect some qualitative feedback about the ease of use, difficulty of the game, accuracy of the rhythm synchronization, and overall experience. During user testing, users will upload their own choice of songs and play the game with the automated beat map and also try out the beat map editor as well. We will want to validate that the whole flow is intuitive and user-friendly.
This week, I continued testing and refining the rhythm analysis algorithm. I tested out a second version of the algorithm that more heavily weights the standard deviation of the onset strengths into determining whether to count a peak as a note or not. This version is much more accurate across various tempos, as shown in the figure below. These are the results of testing a self-made single clef piano composition. The first version would have more false positives at a very slow tempo and false negatives at a very fast tempo, the missed notes typically being any 32nd notes or some 16th notes. The second version when tested on the same piece performs much better, only missing a few 32nd notes at very fast tempos.
The verification methodology involves creating compositions in MuseScore and generating audio files to test the processing algorithm on. This way, I have an objective truth of the tempo and rhythm and can easily manipulate variables such as the instrument, dynamics, time signature, etc. and see how these affect the accuracy. Additionally, I also test the algorithm using real songs, which often have more noise and more blended sounding notes. Using a Python program, I can run the analysis on a song I uploaded, and playback the song while showing an animation that blinks on the extracted timestamps and record any missed or added notes. To verify that my subsystem meets the design requirements, the algorithm must capture at least 95% of the notes without adding any extra notes of single-instrument songs between 50 and 160 BPM.
Comparing results of V1 and V2 on a piano composition created in MuseScore
I also tested an algorithm that uses a local adaptive threshold instead of a global threshold. This version uses a sliding window so it compares onset strengths more locally, which can allow the algorithm to be more adaptive over the course of a piece especially when there are changes in dynamics. The tradeoff with this is that it can be more susceptible to noise.
I am on track with the project schedule. I think the current version is sufficient for the MVP of this subsystem, so further work will just be more extensive testing and stretch goals for more complex music. I have begun creating more compositions with even more complex rhythms, including time signature changes, which I plan to test this V2 on next week. I also will test the algorithm on pieces with drastic dynamic changes. I plan to play around with the minimum note length more as well. Since V2 is experiencing less false positives, I may be able to decrease this from the current 100ms to accommodate more complex pieces. Additionally, I want to test out a version that uses median absolute deviation instead of standard deviation to see if this outperforms V2. This method will be less sensitive to extreme peaks.
This week, I continued finetuning the audio processing algorithm. I continued testing with piano and guitar and also started testing voice and bowed instruments. These are harder to extract the rhythm from since the articulation can be a lot more legato. If we used pitch information, it may be possible to distinguish note onsets in slurs, for example, but this is most likely out of scope for our project.
Also, there was a flaw in calculating the minimum note length based on the estimated tempo because sometimes a song that most people would consider 60 BPM, librosa would estimate 120 BPM, which is technically equivalent, but then the calculated minimum note length would be much smaller and result in a lot of “double notes”, or note detections directly after one another that resulted from one more sustained note. For the game experience, I believe it is better to have more false negatives than false positives. I think having a fixed minimum note length will be a better generalization. A threshold of 0.1 seconds seems to work well.
Additionally, In preparation to integrate the music processing with the game, I added some more information to the JSON output that bridges the two parts. Based on the number of notes in for a given timestamp, the lane numbers are randomly chosen from which the tiles will fall from.
Example JSON output
My progress is on schedule. Next week, I plan to finalize my work on processing the rhythm of single-instrument tracks and meet with my teammates to integrate all of our subsystems together.
This week I continued testing my algorithm on monophonic instrumental and vocal songs with fixed or varying tempo. I ran into some upper limits with SFML in terms of how many sounds it can keep track of at a time. For longer audios, when running the test, both the background music and the generated clicks on note onsets will play perfectly for about thirty seconds before the sound starts to glitch and then goes silent and produces this error:
It seems that there is an upper bound of SFML sounds that can be active at a time and after running valgrind it looks like there are some memory leak issues too. I am still debugging this issue, trying to clear click sounds as soon as they are done playing and implementing suggestions from forums. However, this is only a problem with testing as I am trying to play probably hundreds of metronome clicks in succession, and will not be a problem with the actual game since we will only be playing the song and maybe a few sound effects. If the issue persists, it might be worthwhile to switch to a visual test. This will be closer to the gameplay experience anyway.
Next week I plan to try to get the test working again, try out a visual test method, and work with my team members on integration of all parts. Additionally, after having a discussion with my team members, we think it may be best to leave more advanced analysis of multi-instrumental songs as a stretch goal and focus on the accuracy of monophonic songs for now.
This week, we each made a lot of progress on our subsystems and started the integration process. Yuhe finished building a lightweight game engine that will much better suit our purposes than Unity, and implemented advanced UI components, a C++ to Python bridge, and a test in C++ for rhythm detection verification using SFML. Lucas worked on rewriting the gameplay code he wrote for Unity to work with the new engine, and was able to get a barebones version of the game working. Michelle worked on rhythm detection for monophonic time-varying tempo songs, which is quite accurate, and started testing fixed tempo multi-instrumental songs, which needs more work.
Beat Detection for Time-Varying TempoCore Game Loop in New Game Engine
There have been no major design changes in the past week. The most significant risk at this time to the success of our project is probably the unpredictability of the audio that the user will upload. Our design will mitigate this risk by only allowing certain file types and sizes and surfacing a user error if no tempo can be detected (i.e. the user uploaded an audio file that is not a song).
Next steps include finishing the transition to the new game engine, refining the rhythm detection of multi-instrumental songs, and implementing an in-game beatmap editor. With integration off to a good start, the team is making solid progress towards the MVP.
I started out this week with exploring how to leverage Librosa to analyze the beat of songs that have time-varying tempo. These are the results of processing, using a standard deviation of 4 BPM, an iPhone recording of high school students at chamber music camp performing Dvorak Piano Quintet No. 2, Movement III:
When running the a test that simultaneously plays the piece and a click on each estimated beat, the beats sound mostly accurate but not perfect. I then moved on to adding note onset detection in order to determine the rhythm of the piece. My current algorithm selects timestamps where the onset strength is above the 85th percentile. It then removes any timestamps that are within a 32nd note of each other, which is calculated based on the overall tempo. This works very well for monophonic songs that can have some variation in tempo. For multi-instrumental tracks, it tends to detect the rhythm of the drums if present, since these have the most clear onsets, and some of the rhythm of the other instruments or voices.
I also worked on setting up my development environment for the new game engine Yuhe built. Next week I plan to continue integrating the audio analysis with the game. I also plan to adjust the rhythm algorithm to dynamically calculate the 32nd note threshold based on the dynamic tempo, as well as experiment with different values for the standard deviation when calculating the time-varying tempo. I also would like to look into possible ways that we can improve rhythm detection in multi-instrumental songs.
This week, I worked on creating an algorithm for determining the number of notes to be generated based on the onset strength of the beat. Onset strength at time t is determined by max(0, S[f, t] – ref[f, t – lag]) where ref is S after local max filtering along the frequency axis and S is the log-power Mel spectrogram.
Since a higher onset strength implies a more intense beat, it can be better represented in the game by chords. Likewise, a weaker onset strength would generate a rest or a single notes. Generally we want more single notes than anything else, with three note chords being rarer than two note chords. These percentiles can be easily adjusted later on during user testing to figure out the best balance.
My progress is on schedule. Next week, I plan to refactor my explorations with Librosa into modular functions to be easily integrated with the game. I will also be transitioning from working on audio analysis to working on the UI of the game.
This week, I continued working on validation of fixed-tempo audio analysis. The verification method I created last week of playing metronome clicks on beat timestamps while playing the song was not ideal because of the multi-threading timing issues and then human error introduced when taking out the threading and playing the song manually attempting to start at the same time.
This week, I created an alternate version that that uses matplotlib to animate a blinking circle on the timestamps while playing the song using multi-threading. The visual alignment will be more accurate to the gameplay as well. I used 30 FPS since that is the planned frame rate of the game. Here is a short video of a test as an example: https://youtu.be/54ToPpPSpGs
When testing tempo error on the self-composed audio library where we know the ground truth of tempo and beat timestamps, faster songs of 120 BPM or greater had a tempo error of about 21ms which is just outside our tolerance of 20ms. When I tested fast songs with the visual animation verification method, the error was not very perceivable to me. Thus, I think fixing this marginal error is not a high priority and it might be justified to relax the beat alignment error tolerance slightly, at least for the MVP. Further user testing later on after integration will be needed to confirm this.
My progress is on track for our schedule. Next week I plan to wrap up fixed-tempo beat analysis and move onto basic intensity analysis which will be used to determine how many notes should be generated per beat. This is a higher priority than varying-tempo beat analysis. Testing with a wide variety of songs will be needed to finetune our algorithm for calculating the number of notes for each level for the most satisfying gaming experience.