The most significant risk that could jeopardize our project remains whether or not we are able to separate speech well enough to generate usable captions.
To manage this risk, we are simultaneously working on multiple approaches for speech separation. The deep learning approach is good enough for our MVP, though it may require a calibration step to match voices in the audio to people in the video. We want to avoid having a calibration step and are therefore continuing to develop a beamforming solution. Combining multiple approaches may end up being the best solution for our project. Our contingency plan, however, is to use deep learning with a calibration step. This solution is likely to work, but is also the least novel.
We have not made any significant changes to the existing design of the system. One thing we are considering now is how to determine the number of people that are currently speaking in the video. Information about the number of people currently speaking would help us avoid generating extraneous and incorrect captions. With the current prevalence of masks, we have to rely on an audio solution. Charlie is developing a clever solution that scales to 2 people, which is all we need given our project scope. We will likely integrate his solution with the audio processing code.
We have not made any further adjustments to our schedule.
One component that is now mostly working is angle estimation using the Jetson TX2 CSI camera, which we are using due to its slightly higher FOV and lower distortion.
In the above picture, the angle estimation is overlaid over the center position of the person.