Stella’s Status Report 30 April 2022

This past week I spent some time finalizing my prep for the final presentation, then I moved to focus on working on our poster and planning testing. I picked up our new mics, which we were able to test out on Friday. The new mics take noticeably better recordings than our UMA-8 array.

Before the Friday testing, I wrote up the tests that we should perform, how many times we should perform them, and created a spreadsheet to keep track of our tests and results so that we could get through data collection and analysis more efficiently. I decided on 3 trials per test, as this is a standard minimum number of trials for an experiment.

In our last meeting, Dr. Sullivan gave us some advice as to how to test our system more effectively, and I incorporated that advice into the new testing plan. For one, Larry and I used the same script as each other for this round of testing. In test 7, where we spoke at the same time, I started on sentence 2 and spoke sentence 1 at the end, and Larry started on sentence 1. Initially we tried simply speaking the same script starting from the top, but Larry and I were speaking each word at the same time, so the resulting word error rates would not have been a good measure of our system’s ability to separate different speakers.

Another testing suggestion I added for this round was having a reversed version of each test, so that we could tell if there was any difference between word error rate for my voice (higher pitched) vs. Larry’s voice (lower pitched).

One suggestion that we have not yet used is testing in a wider open space, such as an auditorium or concert hall. For the sake of time (setup and data collection took 3-4hrs), we decided to only get conference room data for now.

We are currently on schedule.

On Sunday, I will complete the poster. Next week, I will generate the appropriate audio files and calculate WER for the following sets of processing steps:

0. Just deep learning (WERs already computed by Charlie)

1. SSF then deep learning, to see if SSF reverb reduction improves the performance of the deep learning algorithm

2. SSF then PDCW, to see if this signal processing approach works well enough for us now that we have better mics

Stella’s Status Report for 23 April 2022

Early this week, I finished fixing the SSF PDCW code in MATLAB so that it runs correctly. I then found the WERs for the speech separated by SSF + PDCW and compared them to the WERs from the deep learning speech separation:

Based on the WERs and on the sound of the recordings, we decided to proceed with the deep learning approach.

This week, we collected a new set of data from two different environments: 1) an indoor conference room, and 2) the outdoor CUC loggia. We collected 5 recordings in each location. I came up with ideas for what we should test before our recording session so we could probe different limits of our system. Our five tests were as follows:

  1. Solo 1 (clean reference signal): Larry speaks script 1 alone
  2. Solo 2(clean reference signal): Stella speaks script 2 alone
  3. Momentary interruptions (as might happen in a real conversation): Stella speaks script 2 while Larry counts – “one”, “two”, “three”, etc… – every 2 seconds (a regular interval for the sake of repeatability)
  4. Partial overlap: Stella begins script 2, midway through Larry  begins script 1, Stella finishes script 2 then, later, Larry finishes script 1
  5. Full overlap: Stella and Larry begin their respective scripts at the same time

We got the following results:

I ran each of these same recordings through the SSF and PDCW algorithms (to try them out as a pre-processing step) and then fed those outputs into the deep learning speech separation model and then into the speech to text model.

In our meeting this week, Dr. Sullivan pointed out that, since we no longer needed our full circular mic array (instead we now need only two mics) we could spend the rest of our budget on purchasing better mics. Our hope is that the better audio quality will improve our system’s performance, and maybe even give us useable results from the SSF PDCW algorithms. So, on Wednesday, I spent time searching for stereo mic setups. Eventually, I found a stereo USB audio interface and, separately, a pair of good-quality mics and submitted an order request.

This week I also worked on the final project presentation, which I will be giving on Monday.

We are on schedule. We are currently finishing our integration.

This weekend, I will finish creating the slides and prepping my script for my presentation next week. After the presentation, I’ll start on the poster and final report and determine what additional data we want to collect.

This week, we collected a new set of data from two different environments: 1) an indoor conference room, and 2) the outdoor CUC loggia. We collected 5 recordings in each location. I came up with ideas for what we should test before our recording session so we could probe different limits of our system. Our five tests were as follows:

  1. Solo 1 (clean reference signal): Larry speaks script 1 alone
  2. Solo 2(clean reference signal): Stella speaks script 2 alone
  3. Momentary interruptions (as might happen in a real conversation): Stella speaks script 2 while Larry counts – “one”, “two”, “three”, etc… – every 2 seconds (a regular interval for the sake of repeatability)
  4. Partial overlap: Stella begins script 2, midway through Larry  begins script 1, Stella finishes script 2 then, later, Larry finishes script 1
  5. Full overlap: Stella and Larry begin their respective scripts at the same time

We got the following results:

I ran each of these same recordings through the SSF and PDCW algorithms (to try them out as a pre-processing step) and then fed those outputs into the deep learning speech separation model and then into the speech to text model.

In our meeting this week, Dr. Sullivan pointed out that, since we no longer needed our full circular mic array (instead we now need only two mics) we could spend the rest of our budget on purchasing better mics. Our hope is that the better audio quality will improve our system’s performance, and maybe even give us useable results from the SSF PDCW algorithms. So, on Wednesday, I spent time searching for stereo mic setups. Eventually, I found a stereo USB audio interface and, separately, a pair of good-quality mics and submitted an order request.

This week I also worked on the final project presentation, which I will be giving on Monday.

We are on schedule. We are currently finishing our integration.

This weekend, I will finish creating the slides and prepping my script for my presentation next week. After the presentation, I’ll start on the poster and final report and determine what additional data we want to collect.

Team Status Report for 23 April 2022

Our greatest current risk is that we will encounter problems in the process of integrating our user interface with the rest of our project. Currently, the video capture, audio capture, and various processing steps are able to work together as we want. We’ve been able to test the performance of our system without the UI, but for demo day we aim to finish a website that allows the users to record their video and view the captioned output video all on the website. The website is largely finished, however, and just needs to be connected to the processing steps of our system. As a contingency plan, we can always ask the user to connect their own laptop (or one of our laptops, for demo day) to the Jetson in order to view the captioned video.

We have made several changes to our design in the past week. For one, we have finalized our decision to use the deep learning approach for speech separation in our final design, rather than using the signal processing techniques SSF and PDCW. While SSF and PDCW do noticeably enhance our speaker of interest, they don’t work well enough to give us a decent WER. We will, however, try using SSF and PDCW to pre-process the audio before passing it to the deep learning algorithm to see if that helps our system’s performance.

While the deep learning algorithm takes in only one channel of input, we still need two channels to distinguish our left from our right speaker. This means that we no longer need our full mic array and could instead use stereo mics. Because we had spent less than half of our budget before this week, we decided to use the rest to buy components for a better stereo recording. We submitted purchase request forms for a stereo audio interface, two microphones of much better quality than the ones in the circular mic array we’ve been working with, and the necessary cords to connect these parts. We hope that a better quality audio recording will help reduce our WER.

We have made no changes to our schedule.

Our project is now very near completion. The website allows file uploads, can display a video, and displays a timer for the user to see how long the recording will go. The captions display nicely over their respective speakers. (See Charlie and Larry’s status reports for more details.)

For the audio processing side, we collected a new set of recordings this past week in two separate locations: indoors in a conference room and outdoors in the CUC loggia (the open archway space along the side of the UC). In both locations, we collected the same set of 5 recordings: 1) Larry speaking alone, 2) Stella speaking alone, 3) Stella speaking with brief interruptions from Larry, 4) partial overlap of speakers (just Stella then both then just Larry), 5) full overlap of speakers. Using the data we collected, we were able to assess the performance of our system under various conditions (see Stella’s status report for further details). Once we get our new microphones, we can perform all or some of these tests again to see the change in performance.

Stella’s Status Report for 16 April 2022

This week I met with Dr. Stern to ask some further questions about the Phase Difference Channel Weighting (PDCW) source separation algorithm and the Suppression of Slowly-varying components and the Falling edge of the power envelope (SSF) reverberation-reducing algorithm. In this meeting, I found a mistake in my previous implementation of the two methods.

I spent time this week fixing my implementation in MATLAB. Instead of running and testing PDCW and SSF separately, as I had previously been doing, I switched to first running SSF on the two channels I intended to input to PDCW, then sending the two channels (now with less reverberation noise) through PDCW to separate out our source of interest.

My goal before next week is to finalize a decision as to whether or not we can use only signal processing for source separation. If the SSF-PDCW combination works well enough, we will proceed with that, but if it doesn’t we will use the deep learning algorithm instead. If we use the deep learning algorithm, we may still be able to get better results by doing some pre-processing with SSF or PDCW – we will have to test this.

This week I also wrote up a list of the remaining data we have to collect for our testing. We want to record the same audio in multiple different locations so having a written-out plan for testing will be useful in helping us get these recordings done more efficiently.

I started planning out the final presentation this week and will finish that in the coming week.

We are on schedule now and are working on integration and testing.

Stella’s Status Report for 2 April 2022

This week, I debugged and tested the delay-and-sum beamforming algorithm with the goal of either getting delay-and-sum beamforming to successfully separate speech in our recordings, or showing that delay-and-sum beamforming with our current mic array would not be sufficient to separate speech and that we should use a different approach.

In order to test the delay-and-sum beamforming algorithm, I first generated a sine wave audio in matlab. I played this sound through an earbud towards the mic at angle 0 on the mic array (mic 1).

I played the sine wave while recording from the mic array. Then, I passed the mic array recordings through my delay-and-sum beamforming code. Though there was a lot of background noise, I found that beamforming towards the source (the earbud) gave an output with a greater SNR (judged by ear, not computationally) than beamforming away from the source. Below are the audio outputs for beamforming towards the source – look direction = 0 – and beamforming directly away from the source – look direction = pi.

look direction = 0:

look direction = pi:

On Thursday, Charlie and I met with Dr. Stern to discuss alternative approaches to speech separation, particularly Phase Difference CHannel Weighting (PDCW) which Dr. Stern published. I got the beamforming test results (shared above) before this meeting, so we went in the goal of finding new algorithms to try rather than trying to get delay-and-sum beamforming working.

We are a bit behind schedule since we have decided to abandon delay-and-sum beamforming in favor of a non-beamforming speech separation approach. To catch up this week, we will collect recordings that we can pass into the PDCW algorithm and determine how well this algorithm can separate speech with our mic setup.

This week, Charlie and I will collect recordings that fit the assumptions of the PDCW algorithm and test the algorithm on them. From there, we will see what we can get working with the PDCW algorithm and we will meet up again with Dr. Stern to discuss our results and get some more guidance (as he suggested in our last meeting).

Team Status Report for 2 April 2022

The most significant risk that could jeopardize our project is whether or not we will be able to get our audio processing pipeline working well enough to get useable captions from the speech-to-text model we’re using.

This past week, we decided to abandon the delay-and-sum beamforming in favor of a different algorithm, Phase Difference Channel Weighting (PDCW), which Dr. Stern published. Charlie and Stella met with Dr. Stern this past week to discuss PDCW and possible reasons our previous implementation wasn’t working. On Friday, Charlie and Larry recorded new data which we will use to test the PDCW algorithm (the data had to be recorded in a particular configuration to meet the assumptions of the PDCW algorithm).

PDCW is our current plan for how to use signal processing techniques in our audio pipeline, but as a backup plan, we have a deep learning module – SpeechBrain’s SepFormer – which we can use to separate multiple speakers. We decided with Dr. Sullivan this past week that, if we go with the deep learning approach, we will test our final system’s performance on more than just two speakers.

The change to our audio processing algorithm is the only significant change we made this week to the design of our system. We have not made any further adjustments to our schedule.

On our video processing side, Larry has been able to generate time-stamped captions, and with our angle estimation working, we are close to being able to put captions on our speakers. With this progress on our video processing pipeline and with the SepFormer module as our interim audio processing pipeline, we’ve been able to start working on integrating the various parts of our system, which we wanted to start as early as possible.

Stella’s Status Report for 26 March 2022

This week I focused on debugging my circular beamforming code and on collecting data to test the code. I also spent time on the ethics assignment and discussion prep. Over the past week, I found and fixed several bugs in my code. The audio still does not seem be beamforming towards the intended speaker though, and instead seems to be cutting out some noise. In the figure below, Figure 1 shows the output signal and Figure 2 shows one channel of the input signal. As can be seen in the figures, the output looks very similar to the input, but has lower amplitude for the parts of the recording with background noise and no speaking.

It is possible that there are more issues with my code, and it is also possible that our recording setup wasn’t precise enough and that the look direction (i.e. angle of the speaker of interest) that I input to my code isn’t accurate enough for the beamforming algorithm to amplify one speaker over the other. This week, for testing, I brought in a tape measure so that we could better estimate distances and angles in our setup, but doing so with just the tape measure was difficult. For the next time we test, I’m planning to bring in a protractor and some masking tape as well so that I can rule out setup inaccuracies (as much as possible) when testing the beamforming code.

We are behind schedule since we do not have beamforming working yet. In the coming week, I will work with Charlie to write test code for my beamforming algorithm so that we can get it working within the week.

Stella’s Status Report for 19 March 2022

This week I worked on getting the circular beamforming code working and on collecting data to test it. For the initial approach, I’m coding our beamforming algorithm using the delay-and-sum method.  From my reading, I’ve found several other beamforming methods that are more complicated to implement but that will likely work better than the delay-and-sum method. I’m using delay-and-sum for now because it’s the simplest method to implement. I decided to calculate the delay-and-sum coefficients in the frequency domain, rather than in the time domain, after learning from several papers that calculating coefficients in the time domain can lead to quantization errors in the coefficients. Right now, the code is able to take in 7 channels of audio recording (recorded using our mic array) and we are able to input the angle of the speaker we want to beamform towards. While the code runs, it currently doesn’t seem to do much to the audio from input to output. The output audios that I’ve listened to have all sounded just like their respective inputs, but with some of the highest-frequency noise removed, which may be a result of how the numbers are stored in MATLAB rather than anything my algorithm is doing.

In the coming week, my main goal will be to determine whether the lack of change from input to output audio comes from a bug in my code or from our hardware being inadequate to do any noticeable beamforming. I suspect there is at least one bug in the code since the output should not sound exactly like the input. If our 7-mic array is insufficient to do good beamforming once I fix the delay-and-sum algorithm, I will try to implement some of the other beamforming algorithms I found. To test my code, I will first check for bugs by visual inspection, then I’ll try simulating my own input data by making 7 channels of a sine wave and time-delaying each of them to reflect a physical setup using Charlie’s code for calculating time delays.

This week I also met up with Charlie to collect some more audio data from our mic array, and I started brainstorming ways we could make our test setup more precise for later testing, since we’re currently judging speaker distance and angle from the mic array by eye.

I initially expected to have completed the beamforming algorithm earlier in the semester, but I am on schedule according to our current plan.

Stella’s Status Report for 26 February 2022

This week I taught myself about the delay-and-sum algorithm for beamforming with a linear microphone array so as to understand how delay-and-sum beamforming can be implemented for a circular microphone array (specifically, how we can calculate the correct delays for each mic in a circular array). I have written some preliminary MATLAB code to implement this algorithm, though I haven’t yet run data through it.

This week I also finished writing a testing plan to guide our audio data collection. This plan covers all of the parameters we’ve discussed varying during testing, as well as setup plans for data that we can collect to assess our system’s performance across these various parameters. I wrote up a set of steps for data collection that we can use for all of our data collection so as to have repeatable tests. Additionally, I started working on the Design Proposal document this week.

Currently, I think we are slightly behind schedule, as speech separation has been a more challenging problem than we were expecting. I had hoped to get delay-and-sum beamforming figured out by the end of this week, but I still have some work left to do to figure out how to work with the circular mic array. I hope to have notes and equations to share with my team tomorrow so that Charlie and I can start implementing the speech separation pipeline this coming week. I also didn’t achieve my goal from last week of completing a draft of the Design Proposal, but this is because we decided at the start of the week that the beamforming algorithm was a more urgent priority.

Early this week (goal: Sunday) I aim to finish figuring out the delay-and-sum beamforming equations and send them to Charlie. Once I send those equations, Charlie and I will be able to implement delay-and-sum beamforming with Charlie’s generalised sidelobe canceller algorithm and test how well that pipeline is able to separate speech (not in real time). By mid-week, we will also finish putting together our Design Proposal. I plan to work on a big part of this proposal since I’ve written similar documents before.

Team Status Report for 26 February 2022

The most significant risk that can jeopardize our project right now is whether or not we will be able to separate speech from two sources (i.e. two people speaking) well enough for our speech-to-text model to generate separate and accurate captions for the two sources. There are two elements here that we have to work with to get well-separated speech: 1) our microphone array and 2) our speech separation algorithm. Currently, our only microphone array is still the UMA-8 circular array. Over the past week, we searched for linear mic arrays that could connect directly to our Jetson TX2 but didn’t find any. We did find two other options for mic arrays: 1) a 4-mic linear array that we can use with a Raspberry Pi, 2) a USBStreamer Kit that we can connect multiple I2S MEMS mics to, then connect the USBStreamer Kit to the Jetson TX2. The challenge with these two new options is that we would need to take in data from multiple separate mics or mic arrays and we would need to synchronize the incoming audio data properly. Our current plan remains to try and get speech separation working with our UMA-8 array, but to look for and buy backup parts in case we cannot separate speech well enough with only the UMA-8.

We have made no changes to our design since last week, but rather have been working on implementing aspects of our design. We have, however, developed new ideas for how to set up our mic array since last week, and we now have ideas for solutions that would allow us to use more mics in a linear array, should we need to pivot from our current design.

We have made two changes to our schedule this week. First, we are scheduling in more time for ourselves to implement speech separation, since this part of the project is proving more challenging than we’d initially thought. Second, we are scheduling in time to work on our design proposal (which we neglected to include in our original schedule).

This week we made progress on our image processing pipeline. We implemented webcam calibration (to correct for any distance distortion our wide-angle camera lens causes) and implemented Detectron2 image segmentation to identify different humans in an image. This coming week we will implement more parts of both our image processing and audio processing pipelines.