Team Status Report for 30 April 2022

What are the most significant risks that could jeopardize the success of the project? How are these risks being managed? What contingency plans are ready?

The most significant risk that can jeopardise our project is linking the front-end (website) to the backend (processing). Larry and I will be working on this integration before our final demo day. The contingency plan is that we demo the front end and the back end separately. We already have a pipeline set up to record, split and overlay captions on the video, which is our MVP.

Were any changes made to the existing design of the system (requirements, block diagram, system spec, etc)? Why was this change necessary, what costs does the change incur, and how will these costs be mitigated going forward?

Yes. We are probably going to use a deep learning approach in separating speech given that the results we are getting is spectacular. This change was necessary as a conventional signal processing method given our microphone array is not sufficient in separating non-linear combination of speeches. Deep learning approaches are known to be able to learn and adapt to non-linear manifolds of the training instance. This is the reason why our current deep learning approach is extremely successful in separating overlapping speech.

At the moment, we decided to not do real time due to the deep learning approach limitations. Therefore, we replaced this with a website interface that allows users to upload their own recordings for text overlay.

Provide an updated schedule if changes have occurred.

No change to schedule.

This is also the place to put some photos of your progress or to brag about component you got working.

We did our final data collection on Friday with the new microphones.

The microphones are unidirectional, placed 10cm apart. Speakers are seated 30 degrees away from the principal axis for two reasons. First, it is the maximum field of view of the jetson camera. Second, it is the basis of our pure signal processing (SSF+PDCW) algorithm.

We split the audio and overlay the text on the video. This is a casual conversation between larry and charlie. As we expected, the captioning for charlie is bad given his Singaporean accent, but larry’s captioning is much better.

https://drive.google.com/file/d/1zGE7MbR_yAVq2upUYq0c2LfmLz1AmquT/view?usp=sharing

Team Status Report for 23 April 2022

Our greatest current risk is that we will encounter problems in the process of integrating our user interface with the rest of our project. Currently, the video capture, audio capture, and various processing steps are able to work together as we want. We’ve been able to test the performance of our system without the UI, but for demo day we aim to finish a website that allows the users to record their video and view the captioned output video all on the website. The website is largely finished, however, and just needs to be connected to the processing steps of our system. As a contingency plan, we can always ask the user to connect their own laptop (or one of our laptops, for demo day) to the Jetson in order to view the captioned video.

We have made several changes to our design in the past week. For one, we have finalized our decision to use the deep learning approach for speech separation in our final design, rather than using the signal processing techniques SSF and PDCW. While SSF and PDCW do noticeably enhance our speaker of interest, they don’t work well enough to give us a decent WER. We will, however, try using SSF and PDCW to pre-process the audio before passing it to the deep learning algorithm to see if that helps our system’s performance.

While the deep learning algorithm takes in only one channel of input, we still need two channels to distinguish our left from our right speaker. This means that we no longer need our full mic array and could instead use stereo mics. Because we had spent less than half of our budget before this week, we decided to use the rest to buy components for a better stereo recording. We submitted purchase request forms for a stereo audio interface, two microphones of much better quality than the ones in the circular mic array we’ve been working with, and the necessary cords to connect these parts. We hope that a better quality audio recording will help reduce our WER.

We have made no changes to our schedule.

Our project is now very near completion. The website allows file uploads, can display a video, and displays a timer for the user to see how long the recording will go. The captions display nicely over their respective speakers. (See Charlie and Larry’s status reports for more details.)

For the audio processing side, we collected a new set of recordings this past week in two separate locations: indoors in a conference room and outdoors in the CUC loggia (the open archway space along the side of the UC). In both locations, we collected the same set of 5 recordings: 1) Larry speaking alone, 2) Stella speaking alone, 3) Stella speaking with brief interruptions from Larry, 4) partial overlap of speakers (just Stella then both then just Larry), 5) full overlap of speakers. Using the data we collected, we were able to assess the performance of our system under various conditions (see Stella’s status report for further details). Once we get our new microphones, we can perform all or some of these tests again to see the change in performance.

Team Status Report for 16 April 2022

The most significant risk that could jeopardize are project is that we are not able to put together a strong user experience. We believe we have all of the parts working and relatively integrated, but we have very little experience creating a good UI. Our current plan is to create a website that lets the user record and download video. If we are unable to create a decent website, our contingency plan is to fall back on the HDMI to USB converter that we purchased. The user would then have to connect their personal laptop to the Jetson to see the output.

We are still having difficulty using signal processing techniques such as PDCW and SSF to separate speech, and are most likely moving forward with the deep learning approach. We are quite confident in the deep learning approach at this time. One risk associated with it is that it may take a very long time to run, potentially degrading the user experience. We are currently trying to work on speeding up the processing, but may have to settle for a much less responsive final product.

The only major change we have made to our system is that we are attempting to run a web server on the Jetson TX2 to provide a nice user interface. We wanted an interface that was both easy to implement and easy to use, and settled on a web server. There are no extra costs incurred by this change, and we have the time to work on it.

In our schedule, Charlie will now be working on the website for most of the remaining time. We have no other changes to our schedule.

Early progress on the website:

Team Status Report for 10 April 2022

What are the most significant risks that could jeopardize the success of the project? How are these risks being managed? What contingency plans are ready?

The current most significant risk is to integrate the camera and audio modules together. We are facing quite a bit of python compatibility issues on the Jetson. We have a naive solution (contingency), where we write a shell scripts that execute each of the module using a different version of python. However, this is extremely inefficient. Alternatively, we are thinking of resetting the Jetson or at least resetting the jetson module.

We are also having troubles using a complete digital signal processing approach in separating speech. We realised that beamforming was not sufficient in separating speech, and have turned to Phase Difference Channel Weighting (PDCW) and Suppression of Slowly-varying components and the Falling edge of the power envelope (SSF), which Stella is currently testing.

The contingency plan is a deep learning speech separation approach. Charlie has obtained significant success in using deep learning to separate speech, but the deep learning model he has used cannot be used in real-time due to processing latency. We demonstrated it during our demo during the week, and showed the results of using this deep learning approach. Professor Sullivan and Janet seem sufficiently convinced that our approach is logical.

Were any changes made to the existing design of the system (requirements, block diagram, system spec, etc)? Why was this change necessary, what costs does the change incur, and how will these costs be mitigated going forward?

Yes. We are probably going to use a deep learning approach in separating speech given that the results we are getting is spectacular. This change was necessary as a conventional signal processing method given our microphone array is not sufficient in separating non-linear combination of speeches. Deep learning approaches are known to be able to learn and adapt to non-linear manifolds of the training instance. This is the reason why our current deep learning approach is extremely successful in separating overlapping speech.

Provide an updated schedule if changes have occurred.

No change to schedule.

This is also the place to put some photos of your progress or to brag about component you got working.

The biggest thing to brag about this week is numerical values to show how successful our deep learning approach is

When one speaker is speaking (isolated case), we get a Word Error Rate (WER) of about 40%. We tested two cases where two speakers are speaking at once, and when one speaker interrupts. Using our speech separation technique, we are getting a nearly comparable WER as compared to the isolated case.

Team Status Report for 2 April 2022

The most significant risk that could jeopardize our project is whether or not we will be able to get our audio processing pipeline working well enough to get useable captions from the speech-to-text model we’re using.

This past week, we decided to abandon the delay-and-sum beamforming in favor of a different algorithm, Phase Difference Channel Weighting (PDCW), which Dr. Stern published. Charlie and Stella met with Dr. Stern this past week to discuss PDCW and possible reasons our previous implementation wasn’t working. On Friday, Charlie and Larry recorded new data which we will use to test the PDCW algorithm (the data had to be recorded in a particular configuration to meet the assumptions of the PDCW algorithm).

PDCW is our current plan for how to use signal processing techniques in our audio pipeline, but as a backup plan, we have a deep learning module – SpeechBrain’s SepFormer – which we can use to separate multiple speakers. We decided with Dr. Sullivan this past week that, if we go with the deep learning approach, we will test our final system’s performance on more than just two speakers.

The change to our audio processing algorithm is the only significant change we made this week to the design of our system. We have not made any further adjustments to our schedule.

On our video processing side, Larry has been able to generate time-stamped captions, and with our angle estimation working, we are close to being able to put captions on our speakers. With this progress on our video processing pipeline and with the SepFormer module as our interim audio processing pipeline, we’ve been able to start working on integrating the various parts of our system, which we wanted to start as early as possible.

Team Status Report for 26 March 2022

The most significant risk that could jeopardize our project remains whether or not we are able to separate speech well enough to generate usable captions.

To manage this risk, we are simultaneously working on multiple approaches for speech separation. The deep learning approach is good enough for our MVP, though it may require a calibration step to match voices in the audio to people in the video. We want to avoid having a calibration step and are therefore continuing to develop a beamforming solution. Combining multiple approaches may end up being the best solution for our project. Our contingency plan, however, is to use deep learning with a calibration step. This solution is likely to work, but is also the least novel.

We have not made any significant changes to the existing design of the system. One thing we are considering now is how to determine the number of people that are currently speaking in the video. Information about the number of people currently speaking would help us avoid generating extraneous and incorrect captions. With the current prevalence of masks, we have to rely on an audio solution. Charlie is developing a clever solution that scales to 2 people, which is all we need given our project scope. We will likely integrate his solution with the audio processing code.

We have not made any further adjustments to our schedule.

One component that is now mostly working is angle estimation using the Jetson TX2 CSI camera, which we are using due to its slightly higher FOV and lower distortion.

In the above picture, the angle estimation is overlaid over the center position of the person.

Team Status Report for 19 March 2022

What are the most significant risks that could jeopardize the success of the project? How are these risks being managed? What contingency plans are ready?

The most significant risk is the beamforming part of our system. To manage this risk, Charlie and Stella will be working together to get beamforming working. After discussing on Friday, Stella will look through her delay-and-sum beamforming code on Saturday, and try to separate a simulated wave on Sunday. If no bug is observed, Charlie will then come in and work with Stella to figure out the problems.

The contingency plan is a deep learning speech separation approach. Charlie has obtained significant success in using deep learning to separate speech, but the deep learning model he has used cannot be used in real-time due to processing latency. Nevertheless, it satisfy our MVP.

Were any changes made to the existing design of the system (requirements, block diagram, system spec, etc)? Why was this change necessary, what costs does the change incur, and how will these costs be mitigated going forward?

There is no change to our system design, unless we resort to a deep learning approach to separate mixed speech. This would be a discussion we will have in two weeks.

Provide an updated schedule if changes have occurred.

No change to schedule.

This is also the place to put some photos of your progress or to brag about component you got working.

The separated speech from deep learning by Charlie is worth bragging.

Mixed speech

Stella speech

Larry speech

Team Status Report for 26 February 2022

The most significant risk that can jeopardize our project right now is whether or not we will be able to separate speech from two sources (i.e. two people speaking) well enough for our speech-to-text model to generate separate and accurate captions for the two sources. There are two elements here that we have to work with to get well-separated speech: 1) our microphone array and 2) our speech separation algorithm. Currently, our only microphone array is still the UMA-8 circular array. Over the past week, we searched for linear mic arrays that could connect directly to our Jetson TX2 but didn’t find any. We did find two other options for mic arrays: 1) a 4-mic linear array that we can use with a Raspberry Pi, 2) a USBStreamer Kit that we can connect multiple I2S MEMS mics to, then connect the USBStreamer Kit to the Jetson TX2. The challenge with these two new options is that we would need to take in data from multiple separate mics or mic arrays and we would need to synchronize the incoming audio data properly. Our current plan remains to try and get speech separation working with our UMA-8 array, but to look for and buy backup parts in case we cannot separate speech well enough with only the UMA-8.

We have made no changes to our design since last week, but rather have been working on implementing aspects of our design. We have, however, developed new ideas for how to set up our mic array since last week, and we now have ideas for solutions that would allow us to use more mics in a linear array, should we need to pivot from our current design.

We have made two changes to our schedule this week. First, we are scheduling in more time for ourselves to implement speech separation, since this part of the project is proving more challenging than we’d initially thought. Second, we are scheduling in time to work on our design proposal (which we neglected to include in our original schedule).

This week we made progress on our image processing pipeline. We implemented webcam calibration (to correct for any distance distortion our wide-angle camera lens causes) and implemented Detectron2 image segmentation to identify different humans in an image. This coming week we will implement more parts of both our image processing and audio processing pipelines.

Team Status Report for 19 February 2022

Currently, the most significant risk that can jeopardize our project is that we may not be able to separate the speakers well enough for the Speech to Text model to produce usable captions. We spoke with Professor Sullivan about our circular microphone array, and he strongly recommended the use of a linear array for our application. There don’t seem to be any great options for prebuilt linear arrays online, as we could only find one specifically for the Raspberry Pi. The estimated shipping time for that array is a month, so for now we plan to continue working with the UMA-8. If the UMA-8 is too small for both beamforming and STFT, we will have to try building our own array out of separate microphones. This approach will add cost and potentially take a lot more time. None of us are familiar with the steps involved in recording from multiple microphones, so we hope to avoid that complication.

One of the main changes we made from the proposal presentation is the use of a Jetson TX2 for all of the processing. We wanted to limit the amount of data movement that we would have to deal with, and the Jetson TX2 also provides consistent processing and I/O capability compared to the variability of the user’s laptop. Another key design choice we made was to use an HDMI to USB video capture card to transfer our final output to the user’s laptop. We based this off of the iContact project from Fall 2020. Both of these changes should greatly simplify our design and allow us to focus on the sound processing.

Our schedule remains pretty much the same as the one presented in the proposal presentation. Instead of having to worry about circuit wiring, however, we now just have to deal with the video capture card.

We were able to successfully use the TX2 to interface with the webcam and UMA-8 through a USB hub. We have now started to work with the video and audio data of what we hope to be our final components.

Team Status Report for 12 February 2022

The most significant risk that can jeopardize our project is how to receive more than 2 channels of audio input. We are currently thinking of two solutions. The first solution is to purchase a prebuilt microphone array, and then design complex speech separation algorithms (since we cannot adjust the distance of the microphones). The second solution is to purchase a USB hub, and wire USB microphones to the hub. This allows us to design less complex beamforming algorithms, since we can vary the positions of the microphones. The current plan is to first use the prebuilt microphone array with complex algorithms, and if that does not work, we can still use our complex algorithm on top of the flexibility of altering the positions of the microphones.

One of the changes that we made is to purchase a prebuilt microphone array instead of building one ourselves. This is because we do not know how to create multichannel inputs by soldering microphones. We also adopted this suggestion from Professor Sullivan. Just in case, we purchased multiple such boards to test, so that increased our spending slightly.

No changes have been made to our current schedule that we discussed in the proposal presentation.

Photos of our current progress can be found on Charlie’s report. This week we primarily worked on our proposal presentation. Therefore, we do not have many design graphics to show.