The most significant risk that could jeopardize our project is whether or not we will be able to get our audio processing pipeline working well enough to get useable captions from the speech-to-text model we’re using.
This past week, we decided to abandon the delay-and-sum beamforming in favor of a different algorithm, Phase Difference Channel Weighting (PDCW), which Dr. Stern published. Charlie and Stella met with Dr. Stern this past week to discuss PDCW and possible reasons our previous implementation wasn’t working. On Friday, Charlie and Larry recorded new data which we will use to test the PDCW algorithm (the data had to be recorded in a particular configuration to meet the assumptions of the PDCW algorithm).
PDCW is our current plan for how to use signal processing techniques in our audio pipeline, but as a backup plan, we have a deep learning module – SpeechBrain’s SepFormer – which we can use to separate multiple speakers. We decided with Dr. Sullivan this past week that, if we go with the deep learning approach, we will test our final system’s performance on more than just two speakers.
The change to our audio processing algorithm is the only significant change we made this week to the design of our system. We have not made any further adjustments to our schedule.
On our video processing side, Larry has been able to generate time-stamped captions, and with our angle estimation working, we are close to being able to put captions on our speakers. With this progress on our video processing pipeline and with the SepFormer module as our interim audio processing pipeline, we’ve been able to start working on integrating the various parts of our system, which we wanted to start as early as possible.