Stella’s Status Report for 23 April 2022

Early this week, I finished fixing the SSF PDCW code in MATLAB so that it runs correctly. I then found the WERs for the speech separated by SSF + PDCW and compared them to the WERs from the deep learning speech separation:

Based on the WERs and on the sound of the recordings, we decided to proceed with the deep learning approach.

This week, we collected a new set of data from two different environments: 1) an indoor conference room, and 2) the outdoor CUC loggia. We collected 5 recordings in each location. I came up with ideas for what we should test before our recording session so we could probe different limits of our system. Our five tests were as follows:

  1. Solo 1 (clean reference signal): Larry speaks script 1 alone
  2. Solo 2(clean reference signal): Stella speaks script 2 alone
  3. Momentary interruptions (as might happen in a real conversation): Stella speaks script 2 while Larry counts – “one”, “two”, “three”, etc… – every 2 seconds (a regular interval for the sake of repeatability)
  4. Partial overlap: Stella begins script 2, midway through Larry  begins script 1, Stella finishes script 2 then, later, Larry finishes script 1
  5. Full overlap: Stella and Larry begin their respective scripts at the same time

We got the following results:

I ran each of these same recordings through the SSF and PDCW algorithms (to try them out as a pre-processing step) and then fed those outputs into the deep learning speech separation model and then into the speech to text model.

In our meeting this week, Dr. Sullivan pointed out that, since we no longer needed our full circular mic array (instead we now need only two mics) we could spend the rest of our budget on purchasing better mics. Our hope is that the better audio quality will improve our system’s performance, and maybe even give us useable results from the SSF PDCW algorithms. So, on Wednesday, I spent time searching for stereo mic setups. Eventually, I found a stereo USB audio interface and, separately, a pair of good-quality mics and submitted an order request.

This week I also worked on the final project presentation, which I will be giving on Monday.

We are on schedule. We are currently finishing our integration.

This weekend, I will finish creating the slides and prepping my script for my presentation next week. After the presentation, I’ll start on the poster and final report and determine what additional data we want to collect.

This week, we collected a new set of data from two different environments: 1) an indoor conference room, and 2) the outdoor CUC loggia. We collected 5 recordings in each location. I came up with ideas for what we should test before our recording session so we could probe different limits of our system. Our five tests were as follows:

  1. Solo 1 (clean reference signal): Larry speaks script 1 alone
  2. Solo 2(clean reference signal): Stella speaks script 2 alone
  3. Momentary interruptions (as might happen in a real conversation): Stella speaks script 2 while Larry counts – “one”, “two”, “three”, etc… – every 2 seconds (a regular interval for the sake of repeatability)
  4. Partial overlap: Stella begins script 2, midway through Larry  begins script 1, Stella finishes script 2 then, later, Larry finishes script 1
  5. Full overlap: Stella and Larry begin their respective scripts at the same time

We got the following results:

I ran each of these same recordings through the SSF and PDCW algorithms (to try them out as a pre-processing step) and then fed those outputs into the deep learning speech separation model and then into the speech to text model.

In our meeting this week, Dr. Sullivan pointed out that, since we no longer needed our full circular mic array (instead we now need only two mics) we could spend the rest of our budget on purchasing better mics. Our hope is that the better audio quality will improve our system’s performance, and maybe even give us useable results from the SSF PDCW algorithms. So, on Wednesday, I spent time searching for stereo mic setups. Eventually, I found a stereo USB audio interface and, separately, a pair of good-quality mics and submitted an order request.

This week I also worked on the final project presentation, which I will be giving on Monday.

We are on schedule. We are currently finishing our integration.

This weekend, I will finish creating the slides and prepping my script for my presentation next week. After the presentation, I’ll start on the poster and final report and determine what additional data we want to collect.

Stella’s Status Report for 16 April 2022

This week I met with Dr. Stern to ask some further questions about the Phase Difference Channel Weighting (PDCW) source separation algorithm and the Suppression of Slowly-varying components and the Falling edge of the power envelope (SSF) reverberation-reducing algorithm. In this meeting, I found a mistake in my previous implementation of the two methods.

I spent time this week fixing my implementation in MATLAB. Instead of running and testing PDCW and SSF separately, as I had previously been doing, I switched to first running SSF on the two channels I intended to input to PDCW, then sending the two channels (now with less reverberation noise) through PDCW to separate out our source of interest.

My goal before next week is to finalize a decision as to whether or not we can use only signal processing for source separation. If the SSF-PDCW combination works well enough, we will proceed with that, but if it doesn’t we will use the deep learning algorithm instead. If we use the deep learning algorithm, we may still be able to get better results by doing some pre-processing with SSF or PDCW – we will have to test this.

This week I also wrote up a list of the remaining data we have to collect for our testing. We want to record the same audio in multiple different locations so having a written-out plan for testing will be useful in helping us get these recordings done more efficiently.

I started planning out the final presentation this week and will finish that in the coming week.

We are on schedule now and are working on integration and testing.

Stella’s Status Report for 2 April 2022

This week, I debugged and tested the delay-and-sum beamforming algorithm with the goal of either getting delay-and-sum beamforming to successfully separate speech in our recordings, or showing that delay-and-sum beamforming with our current mic array would not be sufficient to separate speech and that we should use a different approach.

In order to test the delay-and-sum beamforming algorithm, I first generated a sine wave audio in matlab. I played this sound through an earbud towards the mic at angle 0 on the mic array (mic 1).

I played the sine wave while recording from the mic array. Then, I passed the mic array recordings through my delay-and-sum beamforming code. Though there was a lot of background noise, I found that beamforming towards the source (the earbud) gave an output with a greater SNR (judged by ear, not computationally) than beamforming away from the source. Below are the audio outputs for beamforming towards the source – look direction = 0 – and beamforming directly away from the source – look direction = pi.

look direction = 0:

look direction = pi:

On Thursday, Charlie and I met with Dr. Stern to discuss alternative approaches to speech separation, particularly Phase Difference CHannel Weighting (PDCW) which Dr. Stern published. I got the beamforming test results (shared above) before this meeting, so we went in the goal of finding new algorithms to try rather than trying to get delay-and-sum beamforming working.

We are a bit behind schedule since we have decided to abandon delay-and-sum beamforming in favor of a non-beamforming speech separation approach. To catch up this week, we will collect recordings that we can pass into the PDCW algorithm and determine how well this algorithm can separate speech with our mic setup.

This week, Charlie and I will collect recordings that fit the assumptions of the PDCW algorithm and test the algorithm on them. From there, we will see what we can get working with the PDCW algorithm and we will meet up again with Dr. Stern to discuss our results and get some more guidance (as he suggested in our last meeting).

Stella’s Status Report for 26 March 2022

This week I focused on debugging my circular beamforming code and on collecting data to test the code. I also spent time on the ethics assignment and discussion prep. Over the past week, I found and fixed several bugs in my code. The audio still does not seem be beamforming towards the intended speaker though, and instead seems to be cutting out some noise. In the figure below, Figure 1 shows the output signal and Figure 2 shows one channel of the input signal. As can be seen in the figures, the output looks very similar to the input, but has lower amplitude for the parts of the recording with background noise and no speaking.

It is possible that there are more issues with my code, and it is also possible that our recording setup wasn’t precise enough and that the look direction (i.e. angle of the speaker of interest) that I input to my code isn’t accurate enough for the beamforming algorithm to amplify one speaker over the other. This week, for testing, I brought in a tape measure so that we could better estimate distances and angles in our setup, but doing so with just the tape measure was difficult. For the next time we test, I’m planning to bring in a protractor and some masking tape as well so that I can rule out setup inaccuracies (as much as possible) when testing the beamforming code.

We are behind schedule since we do not have beamforming working yet. In the coming week, I will work with Charlie to write test code for my beamforming algorithm so that we can get it working within the week.

Stella’s Status Report for 19 March 2022

This week I worked on getting the circular beamforming code working and on collecting data to test it. For the initial approach, I’m coding our beamforming algorithm using the delay-and-sum method.  From my reading, I’ve found several other beamforming methods that are more complicated to implement but that will likely work better than the delay-and-sum method. I’m using delay-and-sum for now because it’s the simplest method to implement. I decided to calculate the delay-and-sum coefficients in the frequency domain, rather than in the time domain, after learning from several papers that calculating coefficients in the time domain can lead to quantization errors in the coefficients. Right now, the code is able to take in 7 channels of audio recording (recorded using our mic array) and we are able to input the angle of the speaker we want to beamform towards. While the code runs, it currently doesn’t seem to do much to the audio from input to output. The output audios that I’ve listened to have all sounded just like their respective inputs, but with some of the highest-frequency noise removed, which may be a result of how the numbers are stored in MATLAB rather than anything my algorithm is doing.

In the coming week, my main goal will be to determine whether the lack of change from input to output audio comes from a bug in my code or from our hardware being inadequate to do any noticeable beamforming. I suspect there is at least one bug in the code since the output should not sound exactly like the input. If our 7-mic array is insufficient to do good beamforming once I fix the delay-and-sum algorithm, I will try to implement some of the other beamforming algorithms I found. To test my code, I will first check for bugs by visual inspection, then I’ll try simulating my own input data by making 7 channels of a sine wave and time-delaying each of them to reflect a physical setup using Charlie’s code for calculating time delays.

This week I also met up with Charlie to collect some more audio data from our mic array, and I started brainstorming ways we could make our test setup more precise for later testing, since we’re currently judging speaker distance and angle from the mic array by eye.

I initially expected to have completed the beamforming algorithm earlier in the semester, but I am on schedule according to our current plan.

Stella’s Status Report for 26 February 2022

This week I taught myself about the delay-and-sum algorithm for beamforming with a linear microphone array so as to understand how delay-and-sum beamforming can be implemented for a circular microphone array (specifically, how we can calculate the correct delays for each mic in a circular array). I have written some preliminary MATLAB code to implement this algorithm, though I haven’t yet run data through it.

This week I also finished writing a testing plan to guide our audio data collection. This plan covers all of the parameters we’ve discussed varying during testing, as well as setup plans for data that we can collect to assess our system’s performance across these various parameters. I wrote up a set of steps for data collection that we can use for all of our data collection so as to have repeatable tests. Additionally, I started working on the Design Proposal document this week.

Currently, I think we are slightly behind schedule, as speech separation has been a more challenging problem than we were expecting. I had hoped to get delay-and-sum beamforming figured out by the end of this week, but I still have some work left to do to figure out how to work with the circular mic array. I hope to have notes and equations to share with my team tomorrow so that Charlie and I can start implementing the speech separation pipeline this coming week. I also didn’t achieve my goal from last week of completing a draft of the Design Proposal, but this is because we decided at the start of the week that the beamforming algorithm was a more urgent priority.

Early this week (goal: Sunday) I aim to finish figuring out the delay-and-sum beamforming equations and send them to Charlie. Once I send those equations, Charlie and I will be able to implement delay-and-sum beamforming with Charlie’s generalised sidelobe canceller algorithm and test how well that pipeline is able to separate speech (not in real time). By mid-week, we will also finish putting together our Design Proposal. I plan to work on a big part of this proposal since I’ve written similar documents before.

Stella’s Status Report for 19 February 2022

Now that we have our mic array and webcam, we can start collecting data. So, this week I worked on creating a testing plan so that we can make a list of what data we need to collect and what parts of the system we need working to collect it. This testing document includes a list of repeatable steps for measuring the accuracy of our speech-to-text output, as well as a list of parameters we can vary between tests (e.g., script content, speakers speaking separately vs. simultaneously, speaker positioning relative to each other and to our device, frequency range of the speech). Many of these parameters I took from my notes on the feedback Dr. Sullivan gave us on our system design when we met with him on Wednesday.

For the sake of collecting test data I also found some videos on YouTube of people speaking basic English phrases (mostly these videos are made for people learning English). We’ll have to test to see if our speech-to-text pipeline performs any differently on live speaking vs. on recorded speaking, but if live vs. recorded doesn’t make a difference, we could use these videos to allow one person to collect 2-speaker test data on their own (which could be easier for us, logistically).

I also helped edit Larry’s Design Review slides. Specifically I changed the formatting of parts of our block diagrams to make them easier to understand, and for our testing plan I combined our capture-to-display delay tests for video and for captions into a single test, since we decided this week to add captions to our video before showing it to the user.

I think we are currently mostly on schedule. Our main goal this week was to hammer out the details of our design for the design presentation. As a team, we went through all of the components of our system and decided how we want to transfer data between them, which was a big gap in our design before this week. We had initially planned to have an initial beamforming algorithm completed by this week, however in our meeting on Wednesday we decided, based on Dr. Sullivan’s feedback, to try and use a linear rather than circular mic array (which affects the beamforming algorithm). Charlie and I will work this week on finishing a beamforming algorithm that we can start running test data through and improving.

In the next week, I plan to finish writing a testing plan for our audio data collection, so that we can log what types of data we need and what we’ve collected, and so we can collect data in a repeatable manner. Charlie and I will collect more audio data and work on the beamforming algorithm so that we can test an iteration of the speech-to-text pipeline this or next week. I also plan to work on our written design report and hopefully have a draft of that done by the end of the week.

In the next week I’ll also try to find a linear mic array that can connect to our Jetson TX2, since we haven’t found one yet.

Stella’s Status Report for 12 Feb 2022

In the first half of this week, I helped Charlie and Larry put together our proposal presentation and helped Charlie edit his script for the presentation. In particular, I worked on defining our stakeholder (use case) requirements and on our solution approach diagram and testing plans. I took notes during Charlie’s presentation on questions that came up for me and from other students and used these notes to figure out what questions to ask Dr. Sullivan after our presentation on Wednesday.  After the presentation on Wednesday, our team met up to discuss our next steps. We decided that Charlie and I would decide what microphones we wanted to order and would start on our beamforming algorithm.

Using the advice we got from Dr. Sullivan, I searched for possible microphone array options (including pre-built arrays and parts we could use to build our own). The most important criteria I was looking for were: 1) that we could access the data from each audio channel (one channel per mic) to process them on a computer, and 2) for pre-built arrays, that the mics were spaced far enough apart to distinguish two speakers. Charlie and I met up to discuss mic options and decided on a pre-built circular mic array that can connect to the Jetson we’re using.

In order to start working on the beamforming algorithm, I reviewed the beamforming material from the end of 18-792 and started to look into existing MATLAB code for beamforming. Charlie and I met up to discuss the beamforming algorithm. Here we decided to go with a circular mic array rather than a linear one, so in the coming week we will be figuring out the math for circular mic array beamforming (we’ve only learned linear so far). By the end of the week I aim to have at least an outline of our entire algorithm so that we can start feeding in mic data as soon as we get the mic array. I will also be looking for individual mics we can use to set up our own array, as a backup plan in case we have issues using a pre-built array.

We have a list of all parts we want to order now, and we have clear next steps for developing our beamforming algorithm, so I think we’re on track according to our schedule.