Our greatest current risk is that we will encounter problems in the process of integrating our user interface with the rest of our project. Currently, the video capture, audio capture, and various processing steps are able to work together as we want. We’ve been able to test the performance of our system without the UI, but for demo day we aim to finish a website that allows the users to record their video and view the captioned output video all on the website. The website is largely finished, however, and just needs to be connected to the processing steps of our system. As a contingency plan, we can always ask the user to connect their own laptop (or one of our laptops, for demo day) to the Jetson in order to view the captioned video.
We have made several changes to our design in the past week. For one, we have finalized our decision to use the deep learning approach for speech separation in our final design, rather than using the signal processing techniques SSF and PDCW. While SSF and PDCW do noticeably enhance our speaker of interest, they don’t work well enough to give us a decent WER. We will, however, try using SSF and PDCW to pre-process the audio before passing it to the deep learning algorithm to see if that helps our system’s performance.
While the deep learning algorithm takes in only one channel of input, we still need two channels to distinguish our left from our right speaker. This means that we no longer need our full mic array and could instead use stereo mics. Because we had spent less than half of our budget before this week, we decided to use the rest to buy components for a better stereo recording. We submitted purchase request forms for a stereo audio interface, two microphones of much better quality than the ones in the circular mic array we’ve been working with, and the necessary cords to connect these parts. We hope that a better quality audio recording will help reduce our WER.
We have made no changes to our schedule.
Our project is now very near completion. The website allows file uploads, can display a video, and displays a timer for the user to see how long the recording will go. The captions display nicely over their respective speakers. (See Charlie and Larry’s status reports for more details.)
For the audio processing side, we collected a new set of recordings this past week in two separate locations: indoors in a conference room and outdoors in the CUC loggia (the open archway space along the side of the UC). In both locations, we collected the same set of 5 recordings: 1) Larry speaking alone, 2) Stella speaking alone, 3) Stella speaking with brief interruptions from Larry, 4) partial overlap of speakers (just Stella then both then just Larry), 5) full overlap of speakers. Using the data we collected, we were able to assess the performance of our system under various conditions (see Stella’s status report for further details). Once we get our new microphones, we can perform all or some of these tests again to see the change in performance.