Stella’s Status Report for 26 March 2022

This week I focused on debugging my circular beamforming code and on collecting data to test the code. I also spent time on the ethics assignment and discussion prep. Over the past week, I found and fixed several bugs in my code. The audio still does not seem be beamforming towards the intended speaker though, and instead seems to be cutting out some noise. In the figure below, Figure 1 shows the output signal and Figure 2 shows one channel of the input signal. As can be seen in the figures, the output looks very similar to the input, but has lower amplitude for the parts of the recording with background noise and no speaking.

It is possible that there are more issues with my code, and it is also possible that our recording setup wasn’t precise enough and that the look direction (i.e. angle of the speaker of interest) that I input to my code isn’t accurate enough for the beamforming algorithm to amplify one speaker over the other. This week, for testing, I brought in a tape measure so that we could better estimate distances and angles in our setup, but doing so with just the tape measure was difficult. For the next time we test, I’m planning to bring in a protractor and some masking tape as well so that I can rule out setup inaccuracies (as much as possible) when testing the beamforming code.

We are behind schedule since we do not have beamforming working yet. In the coming week, I will work with Charlie to write test code for my beamforming algorithm so that we can get it working within the week.

Team Status Report for 26 March 2022

The most significant risk that could jeopardize our project remains whether or not we are able to separate speech well enough to generate usable captions.

To manage this risk, we are simultaneously working on multiple approaches for speech separation. The deep learning approach is good enough for our MVP, though it may require a calibration step to match voices in the audio to people in the video. We want to avoid having a calibration step and are therefore continuing to develop a beamforming solution. Combining multiple approaches may end up being the best solution for our project. Our contingency plan, however, is to use deep learning with a calibration step. This solution is likely to work, but is also the least novel.

We have not made any significant changes to the existing design of the system. One thing we are considering now is how to determine the number of people that are currently speaking in the video. Information about the number of people currently speaking would help us avoid generating extraneous and incorrect captions. With the current prevalence of masks, we have to rely on an audio solution. Charlie is developing a clever solution that scales to 2 people, which is all we need given our project scope. We will likely integrate his solution with the audio processing code.

We have not made any further adjustments to our schedule.

One component that is now mostly working is angle estimation using the Jetson TX2 CSI camera, which we are using due to its slightly higher FOV and lower distortion.

In the above picture, the angle estimation is overlaid over the center position of the person.

Larry’s Status Report for 26 March 2022

This week, I spent most of my time trying to build OpenCV with GStreamer correctly. I made a lot of small and avoidable mistakes, such as not properly uninstalling other versions, but everything ended up working out.

Once I got OpenCV installed and working with Python 3, I updated the previous code I was using to work with the Jetson TX2’s included CSI camera. I calibrated the camera and also used putText to overlay angle text over the detected people in the camera’s video output. It’s a little hard to see, but the image below shows the red text on the right of the video.

There were  a few oddities that I also spent time working out. By installing OpenCV with GStreamer, the default backend for VideoCapture changed for the webcam. Once I learned what was going on, the fix was simply to change the backend to V4L when using the webcam. I also struggled with making the CSI camera output feel as responsive as the webcam was. A tip online suggested setting the GStreamer appsink to drop buffers, which seemed to help.

Once I got the CSI camera working reasonably, I began investigating how to use the IBM Watson Speech-to-Text API. I did not get very far this week beyond making an account and copying some sample code.

I believe that I am still on schedule for my personal progress. Our group progress, however, may be behind schedule since we have not yet fully figured out any sort of beamforming. I will try to continue making steady progress on my part of the project, and I will also try to support my team members with ideas for the audio processing portion.

Next week, I hope to have speech-to-text working and somewhat integrated with the camera system. I want to be able to overlay un-separated real-time captions onto people in the video. If I cannot get the real-time captions working, I want to at least overlay captions from pre-recorded audio onto a real-time video.

Charlie’s Status Report for 26 March 2022

What did you personally accomplish this week on the project? Give files or photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

After discussing with my team on Monday, we decided that I would work on a speaker identification problem. In particular, I was trying to solve the problem of identifying whether speaker A is speaking, speaker B is speaking, or both speaker A and B are speaking at the same time.

I originally imagined this as a localization problem, where we try to identify where the speakers are. I tried to use the MATLAB RootMUSICEstimator function, but realise that it requires prior information of the number of speakers in the scene, which defeats the purpose.

Thereafter, I formulated this in the form of a path tracing/signal processing method. Essentially, if only one speaker is speaking, I would expect time delay between the mic closer to the speaker and the mic furthest away. The time delay can be computed by a cross correlation, and the sign of the delay indicates whether the speaker is on the left or right. If both speakers are speaking at the same time, the cross correlation is weaker. Therefore, by thresholding the cross correlation, we can psuedo identify which speaker is speaking when.

Next, I wanted to follow up on the deep learning approach to see whether conventional signal processing approach can improve the STT predictions. All beamforming methods used are from MATLAB packages. I summarise my work as follows.

  • gsc beamformer, then feed into dl speech separation –> poor STT
  • gsc beamformer + nc, then feed into dl speech separation –> poor STT
  • dl speech separation across all channels, then add up –> better STT
  • dl speech separation followed by audio alignment –> better STT

(GSC: Generalised Sidelobe Canceller, DL: Deep Learning, STT: Speech to Text)

It seems that processing the input with conventional signal processing methods is generally weaker at predicting speech. However, applying DL techniques on individual channels, then aligning them could improve performance. I will work with stella in the coming week to see if there is any way I can integrate that with her manual beamforming approach, as I am doubtful of the MATLAB packages implementation (primarily for radar antennas)

Finally, I wrapped up my week’s work by reformatting our STT DL methods into simple python callable functions (making the DL implementations as abstract to my teammates as possible). Due to the widely available STT packages, I provided 3 STT modules- Google STT, SpeechBrain and IBM Watson. Based on my initial tests, Google and IBM seem to work best, but they operate on a cloud server which might hinder real-time applications.


Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are behind schedule, as we have not gotten our beamforming to work. Nevertheless, we did manage to get speech separation to work using deep learning, and I have personally been exploring the integration of DL with beamforming.

What deliverables do you hope to complete in the next week?

In the next week, I want to be able to get our beamforming algorithm to work, and test its effectiveness. I will do so by meeting Stella more frequently and debug the script together. I also want to encourage my team to start to integrate our camera module with the speech separation module (our MVP), since we do have a workable DL approach right now.

Larry’s Status Report for 19 March 2022

This week, I worked on writing code for angle estimation by interfacing with the camera, identifying people in the scene, and providing angle estimates for each person. My last status report stated that I hoped to complete angle estimation by the end of this week, and while I am substantially closer, I am still not quite done.

With some help from Charlie, I have been able to use the video from the webcam to identify the pixel locations of each person. With an estimate of the camera calibration matrix, I have also produced angle estimates for the pixel locations. My main issue so far is that the angle estimates are not entirely accurate, primarily due to the strong fisheye effect of the webcam.

As seen in the image above, the webcam produces a greatly distorted image of a square whiteboard. While the camera calibration matrix can produce a good result for any pixel along the horizontal center of the image, it does not compensate for the distortion at the edges.

Another thing I noticed was that while our webcam claimed to have a 140 degree FOV, I measured the horizontal FOV to be at best 60 degrees. The fisheye effect gives the impression of a wide angle camera, but in reality the FOV does not meet our design requirements. I have decided to try and use the included camera on the TX2, which I initially deemed to have too narrow a field of view for our project.

The above image shows that the included TX2 camera (top left) has a horizontal FOV that is slightly better than the webcam (bottom right). What I am currently working on is trying to integrate the included camera with my existing code. The issue I struggled with at the end of this week was installing OpenCV with GStreamer support to use the CSI camera, which took many hours.

I believe that we are still generally on schedule, though further behind than we were last week. To ensure that we stay on schedule, I will try to focus on integrating more of the components together to allow for faster and more applicable testing. My main concern so far is how we will actually handle the speech separation, so finishing up all the aspects around speech separation should allow us to focus on it.

By next week, I hope to have the camera and angle estimation code completely finished. I also want to be able to overlay text onto people in a scene, and have some work done toward generating captions from audio input.

 

Stella’s Status Report for 19 March 2022

This week I worked on getting the circular beamforming code working and on collecting data to test it. For the initial approach, I’m coding our beamforming algorithm using the delay-and-sum method.  From my reading, I’ve found several other beamforming methods that are more complicated to implement but that will likely work better than the delay-and-sum method. I’m using delay-and-sum for now because it’s the simplest method to implement. I decided to calculate the delay-and-sum coefficients in the frequency domain, rather than in the time domain, after learning from several papers that calculating coefficients in the time domain can lead to quantization errors in the coefficients. Right now, the code is able to take in 7 channels of audio recording (recorded using our mic array) and we are able to input the angle of the speaker we want to beamform towards. While the code runs, it currently doesn’t seem to do much to the audio from input to output. The output audios that I’ve listened to have all sounded just like their respective inputs, but with some of the highest-frequency noise removed, which may be a result of how the numbers are stored in MATLAB rather than anything my algorithm is doing.

In the coming week, my main goal will be to determine whether the lack of change from input to output audio comes from a bug in my code or from our hardware being inadequate to do any noticeable beamforming. I suspect there is at least one bug in the code since the output should not sound exactly like the input. If our 7-mic array is insufficient to do good beamforming once I fix the delay-and-sum algorithm, I will try to implement some of the other beamforming algorithms I found. To test my code, I will first check for bugs by visual inspection, then I’ll try simulating my own input data by making 7 channels of a sine wave and time-delaying each of them to reflect a physical setup using Charlie’s code for calculating time delays.

This week I also met up with Charlie to collect some more audio data from our mic array, and I started brainstorming ways we could make our test setup more precise for later testing, since we’re currently judging speaker distance and angle from the mic array by eye.

I initially expected to have completed the beamforming algorithm earlier in the semester, but I am on schedule according to our current plan.

Team Status Report for 19 March 2022

What are the most significant risks that could jeopardize the success of the project? How are these risks being managed? What contingency plans are ready?

The most significant risk is the beamforming part of our system. To manage this risk, Charlie and Stella will be working together to get beamforming working. After discussing on Friday, Stella will look through her delay-and-sum beamforming code on Saturday, and try to separate a simulated wave on Sunday. If no bug is observed, Charlie will then come in and work with Stella to figure out the problems.

The contingency plan is a deep learning speech separation approach. Charlie has obtained significant success in using deep learning to separate speech, but the deep learning model he has used cannot be used in real-time due to processing latency. Nevertheless, it satisfy our MVP.

Were any changes made to the existing design of the system (requirements, block diagram, system spec, etc)? Why was this change necessary, what costs does the change incur, and how will these costs be mitigated going forward?

There is no change to our system design, unless we resort to a deep learning approach to separate mixed speech. This would be a discussion we will have in two weeks.

Provide an updated schedule if changes have occurred.

No change to schedule.

This is also the place to put some photos of your progress or to brag about component you got working.

The separated speech from deep learning by Charlie is worth bragging.

Mixed speech

Stella speech

Larry speech

Charlie’s Status Report for 19 March 2022

What did you personally accomplish this week on the project? Give files or photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

This week, I was experimenting with the use of deep learning approaches to separate mixed (or overlapping speech). The intention of this divergent branch from our beamforming approach is twofold. First, we want to have a backup strategy to separate overlapping speech in the instance that beamforming does not work as expected. Second, if beamforming does suppress other voices, we could use a deep learning approach to further improve the performance. We strongly believe the second option ties best to our project.

I managed to demonstrate that deep learning models are able to separate speech to some level of performance.

The following is overlapping speech between Stella and Larry.

The following is the separated speech of Stella.

The following is the separated speech of Larry.

One interesting point that we discovered was that filtering out noise prior to feeding the signals into the deep learning model harms performance. We believe this arises from the fact that noise filtering filters out critical frequencies, and that the deep learning model has inherent denoising ability built into it.

Second, we notice that the STT model was not able to interpret the separated speech. It could either be caused by poor enunciation of words from our speakers, or that the output of the STT model is not clear.

Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule.

What deliverables do you hope to complete in the next week?

In the next week, I want to experiment using the deep learning separated speech to cancel the original speakers, to test whether the output of the noise canceled speech leads to better STT predictions.