Larry’s Status Report for 26 March 2022

This week, I spent most of my time trying to build OpenCV with GStreamer correctly. I made a lot of small and avoidable mistakes, such as not properly uninstalling other versions, but everything ended up working out.

Once I got OpenCV installed and working with Python 3, I updated the previous code I was using to work with the Jetson TX2’s included CSI camera. I calibrated the camera and also used putText to overlay angle text over the detected people in the camera’s video output. It’s a little hard to see, but the image below shows the red text on the right of the video.

There were  a few oddities that I also spent time working out. By installing OpenCV with GStreamer, the default backend for VideoCapture changed for the webcam. Once I learned what was going on, the fix was simply to change the backend to V4L when using the webcam. I also struggled with making the CSI camera output feel as responsive as the webcam was. A tip online suggested setting the GStreamer appsink to drop buffers, which seemed to help.

Once I got the CSI camera working reasonably, I began investigating how to use the IBM Watson Speech-to-Text API. I did not get very far this week beyond making an account and copying some sample code.

I believe that I am still on schedule for my personal progress. Our group progress, however, may be behind schedule since we have not yet fully figured out any sort of beamforming. I will try to continue making steady progress on my part of the project, and I will also try to support my team members with ideas for the audio processing portion.

Next week, I hope to have speech-to-text working and somewhat integrated with the camera system. I want to be able to overlay un-separated real-time captions onto people in the video. If I cannot get the real-time captions working, I want to at least overlay captions from pre-recorded audio onto a real-time video.

Charlie’s Status Report for 26 March 2022

What did you personally accomplish this week on the project? Give files or photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

After discussing with my team on Monday, we decided that I would work on a speaker identification problem. In particular, I was trying to solve the problem of identifying whether speaker A is speaking, speaker B is speaking, or both speaker A and B are speaking at the same time.

I originally imagined this as a localization problem, where we try to identify where the speakers are. I tried to use the MATLAB RootMUSICEstimator function, but realise that it requires prior information of the number of speakers in the scene, which defeats the purpose.

Thereafter, I formulated this in the form of a path tracing/signal processing method. Essentially, if only one speaker is speaking, I would expect time delay between the mic closer to the speaker and the mic furthest away. The time delay can be computed by a cross correlation, and the sign of the delay indicates whether the speaker is on the left or right. If both speakers are speaking at the same time, the cross correlation is weaker. Therefore, by thresholding the cross correlation, we can psuedo identify which speaker is speaking when.

Next, I wanted to follow up on the deep learning approach to see whether conventional signal processing approach can improve the STT predictions. All beamforming methods used are from MATLAB packages. I summarise my work as follows.

  • gsc beamformer, then feed into dl speech separation –> poor STT
  • gsc beamformer + nc, then feed into dl speech separation –> poor STT
  • dl speech separation across all channels, then add up –> better STT
  • dl speech separation followed by audio alignment –> better STT

(GSC: Generalised Sidelobe Canceller, DL: Deep Learning, STT: Speech to Text)

It seems that processing the input with conventional signal processing methods is generally weaker at predicting speech. However, applying DL techniques on individual channels, then aligning them could improve performance. I will work with stella in the coming week to see if there is any way I can integrate that with her manual beamforming approach, as I am doubtful of the MATLAB packages implementation (primarily for radar antennas)

Finally, I wrapped up my week’s work by reformatting our STT DL methods into simple python callable functions (making the DL implementations as abstract to my teammates as possible). Due to the widely available STT packages, I provided 3 STT modules- Google STT, SpeechBrain and IBM Watson. Based on my initial tests, Google and IBM seem to work best, but they operate on a cloud server which might hinder real-time applications.


Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are behind schedule, as we have not gotten our beamforming to work. Nevertheless, we did manage to get speech separation to work using deep learning, and I have personally been exploring the integration of DL with beamforming.

What deliverables do you hope to complete in the next week?

In the next week, I want to be able to get our beamforming algorithm to work, and test its effectiveness. I will do so by meeting Stella more frequently and debug the script together. I also want to encourage my team to start to integrate our camera module with the speech separation module (our MVP), since we do have a workable DL approach right now.

Larry’s Status Report for 19 March 2022

This week, I worked on writing code for angle estimation by interfacing with the camera, identifying people in the scene, and providing angle estimates for each person. My last status report stated that I hoped to complete angle estimation by the end of this week, and while I am substantially closer, I am still not quite done.

With some help from Charlie, I have been able to use the video from the webcam to identify the pixel locations of each person. With an estimate of the camera calibration matrix, I have also produced angle estimates for the pixel locations. My main issue so far is that the angle estimates are not entirely accurate, primarily due to the strong fisheye effect of the webcam.

As seen in the image above, the webcam produces a greatly distorted image of a square whiteboard. While the camera calibration matrix can produce a good result for any pixel along the horizontal center of the image, it does not compensate for the distortion at the edges.

Another thing I noticed was that while our webcam claimed to have a 140 degree FOV, I measured the horizontal FOV to be at best 60 degrees. The fisheye effect gives the impression of a wide angle camera, but in reality the FOV does not meet our design requirements. I have decided to try and use the included camera on the TX2, which I initially deemed to have too narrow a field of view for our project.

The above image shows that the included TX2 camera (top left) has a horizontal FOV that is slightly better than the webcam (bottom right). What I am currently working on is trying to integrate the included camera with my existing code. The issue I struggled with at the end of this week was installing OpenCV with GStreamer support to use the CSI camera, which took many hours.

I believe that we are still generally on schedule, though further behind than we were last week. To ensure that we stay on schedule, I will try to focus on integrating more of the components together to allow for faster and more applicable testing. My main concern so far is how we will actually handle the speech separation, so finishing up all the aspects around speech separation should allow us to focus on it.

By next week, I hope to have the camera and angle estimation code completely finished. I also want to be able to overlay text onto people in a scene, and have some work done toward generating captions from audio input.

 

Stella’s Status Report for 19 March 2022

This week I worked on getting the circular beamforming code working and on collecting data to test it. For the initial approach, I’m coding our beamforming algorithm using the delay-and-sum method.  From my reading, I’ve found several other beamforming methods that are more complicated to implement but that will likely work better than the delay-and-sum method. I’m using delay-and-sum for now because it’s the simplest method to implement. I decided to calculate the delay-and-sum coefficients in the frequency domain, rather than in the time domain, after learning from several papers that calculating coefficients in the time domain can lead to quantization errors in the coefficients. Right now, the code is able to take in 7 channels of audio recording (recorded using our mic array) and we are able to input the angle of the speaker we want to beamform towards. While the code runs, it currently doesn’t seem to do much to the audio from input to output. The output audios that I’ve listened to have all sounded just like their respective inputs, but with some of the highest-frequency noise removed, which may be a result of how the numbers are stored in MATLAB rather than anything my algorithm is doing.

In the coming week, my main goal will be to determine whether the lack of change from input to output audio comes from a bug in my code or from our hardware being inadequate to do any noticeable beamforming. I suspect there is at least one bug in the code since the output should not sound exactly like the input. If our 7-mic array is insufficient to do good beamforming once I fix the delay-and-sum algorithm, I will try to implement some of the other beamforming algorithms I found. To test my code, I will first check for bugs by visual inspection, then I’ll try simulating my own input data by making 7 channels of a sine wave and time-delaying each of them to reflect a physical setup using Charlie’s code for calculating time delays.

This week I also met up with Charlie to collect some more audio data from our mic array, and I started brainstorming ways we could make our test setup more precise for later testing, since we’re currently judging speaker distance and angle from the mic array by eye.

I initially expected to have completed the beamforming algorithm earlier in the semester, but I am on schedule according to our current plan.

Team Status Report for 19 March 2022

What are the most significant risks that could jeopardize the success of the project? How are these risks being managed? What contingency plans are ready?

The most significant risk is the beamforming part of our system. To manage this risk, Charlie and Stella will be working together to get beamforming working. After discussing on Friday, Stella will look through her delay-and-sum beamforming code on Saturday, and try to separate a simulated wave on Sunday. If no bug is observed, Charlie will then come in and work with Stella to figure out the problems.

The contingency plan is a deep learning speech separation approach. Charlie has obtained significant success in using deep learning to separate speech, but the deep learning model he has used cannot be used in real-time due to processing latency. Nevertheless, it satisfy our MVP.

Were any changes made to the existing design of the system (requirements, block diagram, system spec, etc)? Why was this change necessary, what costs does the change incur, and how will these costs be mitigated going forward?

There is no change to our system design, unless we resort to a deep learning approach to separate mixed speech. This would be a discussion we will have in two weeks.

Provide an updated schedule if changes have occurred.

No change to schedule.

This is also the place to put some photos of your progress or to brag about component you got working.

The separated speech from deep learning by Charlie is worth bragging.

Mixed speech

Stella speech

Larry speech

Charlie’s Status Report for 19 March 2022

What did you personally accomplish this week on the project? Give files or photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

This week, I was experimenting with the use of deep learning approaches to separate mixed (or overlapping speech). The intention of this divergent branch from our beamforming approach is twofold. First, we want to have a backup strategy to separate overlapping speech in the instance that beamforming does not work as expected. Second, if beamforming does suppress other voices, we could use a deep learning approach to further improve the performance. We strongly believe the second option ties best to our project.

I managed to demonstrate that deep learning models are able to separate speech to some level of performance.

The following is overlapping speech between Stella and Larry.

The following is the separated speech of Stella.

The following is the separated speech of Larry.

One interesting point that we discovered was that filtering out noise prior to feeding the signals into the deep learning model harms performance. We believe this arises from the fact that noise filtering filters out critical frequencies, and that the deep learning model has inherent denoising ability built into it.

Second, we notice that the STT model was not able to interpret the separated speech. It could either be caused by poor enunciation of words from our speakers, or that the output of the STT model is not clear.

Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule.

What deliverables do you hope to complete in the next week?

In the next week, I want to experiment using the deep learning separated speech to cancel the original speakers, to test whether the output of the noise canceled speech leads to better STT predictions.

Stella’s Status Report for 26 February 2022

This week I taught myself about the delay-and-sum algorithm for beamforming with a linear microphone array so as to understand how delay-and-sum beamforming can be implemented for a circular microphone array (specifically, how we can calculate the correct delays for each mic in a circular array). I have written some preliminary MATLAB code to implement this algorithm, though I haven’t yet run data through it.

This week I also finished writing a testing plan to guide our audio data collection. This plan covers all of the parameters we’ve discussed varying during testing, as well as setup plans for data that we can collect to assess our system’s performance across these various parameters. I wrote up a set of steps for data collection that we can use for all of our data collection so as to have repeatable tests. Additionally, I started working on the Design Proposal document this week.

Currently, I think we are slightly behind schedule, as speech separation has been a more challenging problem than we were expecting. I had hoped to get delay-and-sum beamforming figured out by the end of this week, but I still have some work left to do to figure out how to work with the circular mic array. I hope to have notes and equations to share with my team tomorrow so that Charlie and I can start implementing the speech separation pipeline this coming week. I also didn’t achieve my goal from last week of completing a draft of the Design Proposal, but this is because we decided at the start of the week that the beamforming algorithm was a more urgent priority.

Early this week (goal: Sunday) I aim to finish figuring out the delay-and-sum beamforming equations and send them to Charlie. Once I send those equations, Charlie and I will be able to implement delay-and-sum beamforming with Charlie’s generalised sidelobe canceller algorithm and test how well that pipeline is able to separate speech (not in real time). By mid-week, we will also finish putting together our Design Proposal. I plan to work on a big part of this proposal since I’ve written similar documents before.

Team Status Report for 26 February 2022

The most significant risk that can jeopardize our project right now is whether or not we will be able to separate speech from two sources (i.e. two people speaking) well enough for our speech-to-text model to generate separate and accurate captions for the two sources. There are two elements here that we have to work with to get well-separated speech: 1) our microphone array and 2) our speech separation algorithm. Currently, our only microphone array is still the UMA-8 circular array. Over the past week, we searched for linear mic arrays that could connect directly to our Jetson TX2 but didn’t find any. We did find two other options for mic arrays: 1) a 4-mic linear array that we can use with a Raspberry Pi, 2) a USBStreamer Kit that we can connect multiple I2S MEMS mics to, then connect the USBStreamer Kit to the Jetson TX2. The challenge with these two new options is that we would need to take in data from multiple separate mics or mic arrays and we would need to synchronize the incoming audio data properly. Our current plan remains to try and get speech separation working with our UMA-8 array, but to look for and buy backup parts in case we cannot separate speech well enough with only the UMA-8.

We have made no changes to our design since last week, but rather have been working on implementing aspects of our design. We have, however, developed new ideas for how to set up our mic array since last week, and we now have ideas for solutions that would allow us to use more mics in a linear array, should we need to pivot from our current design.

We have made two changes to our schedule this week. First, we are scheduling in more time for ourselves to implement speech separation, since this part of the project is proving more challenging than we’d initially thought. Second, we are scheduling in time to work on our design proposal (which we neglected to include in our original schedule).

This week we made progress on our image processing pipeline. We implemented webcam calibration (to correct for any distance distortion our wide-angle camera lens causes) and implemented Detectron2 image segmentation to identify different humans in an image. This coming week we will implement more parts of both our image processing and audio processing pipelines.

Charlie’s status report for 26 February 2022

 What did you personally accomplish this week on the project? Give files orphotos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

This week, I implemented image segmentation using Detectron2 to identify the location of only humans in the scene.

The image on the left is the test image provided by the Detectron2 dataset. Just to be certain that the image segmentation method works on non-full bodies and difficult images, I tested it on my own image. It appears that the Detectron2 works very well. Thereafter, I wrote the network in a modular method so that a function that outputs the location of each speaker can be easily called. The coordinates will then be combined with our angle estimation pipeline (which Larry is currently implementing).

 Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

I would say we are slightly behind, because I was surprised at how difficult it was to separate the speech of two speakers even with the STFT method. I had previously implemented this method in a class and tested that it worked successfully. However, it did not work on our audio recordings. I am currently in the process of debugging why that occurred. In the meantime, Stella is working on delay-and-sum beamforming which is our second attempt at enhancing speech.

 What deliverables do you hope to complete in the next week?

In the coming week, I hope to be able to get the equations for delay-and-sum beamforming from stella. Once I receive the equations, I will then be able to implement the generalised sidelobe canceller  (GSC) to determine if our speech enhancement method works (in a non-real-time case). In the event that GSC does not work, my group has identified a deep learning approach to separate speech. We do not plan to jump straight to that as it cannot be applied in realtime.

Larry’s Status Report for 26 February 2022

I presented the design review presentation this week, which is what I spent a majority of my working time on again. Overall, I was fairly satisfied with how the presentation went. There weren’t many questions after the presentation and I have not yet received any feedback, so we have not made any adjustments in response to the design review. I hope to receive constructive feedback once we get comments back, however.

Last week, I stated that I hoped to produce an accurate angle for a single person in webcam view. I did not meet that goal this week, though Charlie and I have most of what we think we need for producing a good result. I calibrated the webcam and Charlie worked on converting Detectron2’s image segmentation into a usable format for our project. Below is a picture of the camera’s calibration matrix.

Our project is still on schedule. Looking at our Gantt chart, I was scheduled to complete angle estimation by next week. I should be able to produce a reasonable angle estimation soon given our current progress.

Coming up soon is the design report, which will likely also take a good amount of time to put together. I notice now that it isn’t in the current version of the Gantt chart, which is a bit of an oversight. I believe our schedule has enough slack for it, however. The main deliverables I hope to complete by next week are the angle estimation and the design report.