Charlie’s Status Report for 16 April 2022

What did you personally accomplish this week on the project? Give files or photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

This week, our team discussed with Professor Sullivan and we highlighted the three main priorities that we want to achieve by our final demo.

  1. Collecting data from different environments. This is because our deep learning system may work better in some environments than others (All)
  2. Fully integrating our audio and visual interface, as we are currently facing some Python dependency issues (Larry)
  3. Setting up a UI interface for our full system, such as a website (Me)

I decided to handle the last priority as I have former experience writing a full website. The following shows the progress that I have made so far.

In my experience, I prefer writing websites from a systems perspective. I make sure that every “active” component of the website works before I proceed with improving the design. In the above structure, I have made sure that my website is able to receive inputs (duration of recording + file upload) and produce output (downloadable captioned video).

Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are on schedule, as we are currently in the integration face of our project.

What deliverables do you hope to complete in the next week?

In the coming weekend and week, I will be improving on the visual design of the website.

Larry and I also ran into some problems with the GPU on the jetson. When we run on speech separation on the Jetson, it was taking about 3 minutes, which is rather strange given that the same script took about 0.8s on poorer GPUs on Google Collab. I suspect that the GPU on the Jetson is broken, but Larry disagrees. This week we will try to debug to see if the GPU actually works.

Larry’s Status Report for 10 April 2022

This week, I worked on writing and integrating the last few parts of the deep learning and non-real-time approach to our project. Instead of my previous approach of combining OpenCV and PyAudio into one Python script, I simply used GStreamer to record audio and video at the same time. I also extracted the timestamped captions and overlaid them onto the video without too much trouble. Example below:

I am currently trying to install all of the relevant libraries into a Python 3.8  virtual library. Running the deep learning solution requires Python 3.8, and I figured that moving everything to a Python 3.8 virtual environment would make later testing and integration a lot easier. As I mentioned in my last support, some components required Python 3.7 while others were installed on the default Python 3.6.

Using GStreamer directly on the command line instead of through OpenCV means that we do not have to compile OpenCV with GStreamer support, which is convenient for our switch to Python 3.8. I have not finished building PyTorch and Detectron2 for 3.8 yet, but so far I do not anticipate any major issues with this change.

I am currently slightly behind schedule, since I wanted to have some thoughts on how we would build a real-time implementation by now. Given the amount of time left, a real-time implementation may not be feasible. This is something we envisioned from the beginning of project, so it does not significantly change our plans. In the context of working only with a non-real-time implementation, I am on schedule.

By next week, I hope to have every component working together in a streamlined fashion. We should easily be able to record and produce a captioned video with a command. Instead of focusing on real-time, we may pivot toward working on a nice GUI for our project. I hope to have worked on either one or the other by the end of this week.

Team Status Report for 10 April 2022

What are the most significant risks that could jeopardize the success of the project? How are these risks being managed? What contingency plans are ready?

The current most significant risk is to integrate the camera and audio modules together. We are facing quite a bit of python compatibility issues on the Jetson. We have a naive solution (contingency), where we write a shell scripts that execute each of the module using a different version of python. However, this is extremely inefficient. Alternatively, we are thinking of resetting the Jetson or at least resetting the jetson module.

We are also having troubles using a complete digital signal processing approach in separating speech. We realised that beamforming was not sufficient in separating speech, and have turned to Phase Difference Channel Weighting (PDCW) and Suppression of Slowly-varying components and the Falling edge of the power envelope (SSF), which Stella is currently testing.

The contingency plan is a deep learning speech separation approach. Charlie has obtained significant success in using deep learning to separate speech, but the deep learning model he has used cannot be used in real-time due to processing latency. We demonstrated it during our demo during the week, and showed the results of using this deep learning approach. Professor Sullivan and Janet seem sufficiently convinced that our approach is logical.

Were any changes made to the existing design of the system (requirements, block diagram, system spec, etc)? Why was this change necessary, what costs does the change incur, and how will these costs be mitigated going forward?

Yes. We are probably going to use a deep learning approach in separating speech given that the results we are getting is spectacular. This change was necessary as a conventional signal processing method given our microphone array is not sufficient in separating non-linear combination of speeches. Deep learning approaches are known to be able to learn and adapt to non-linear manifolds of the training instance. This is the reason why our current deep learning approach is extremely successful in separating overlapping speech.

Provide an updated schedule if changes have occurred.

No change to schedule.

This is also the place to put some photos of your progress or to brag about component you got working.

The biggest thing to brag about this week is numerical values to show how successful our deep learning approach is

When one speaker is speaking (isolated case), we get a Word Error Rate (WER) of about 40%. We tested two cases where two speakers are speaking at once, and when one speaker interrupts. Using our speech separation technique, we are getting a nearly comparable WER as compared to the isolated case.

Charlie’s Status Report for 10 April 2022

What did you personally accomplish this week on the project? Give files or photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

This week, I refined my code for the audio component of our project. In particular, Larry wanted me to generate a normalized version of the collected audio, which will be integrated into the video.

The audio component of our project runs on Python 3.8. However, Larry informed me that there were a couple of python dependency errors. In particular, the Speech-To-Text (STT) requires python 3.9, and OpenCV requires 3.7. Python should be backward compatible. Therefore, I was trying to learn how to reset Python on the Jetson. However, we agreed that we should not reset it before the demo day, so that at least we can demonstrate the working components of our project.

For the demo day, I organised my team members and we decided on the list of things we would like to show Professor Sullivan and Janet. Here is the following list I came up with.

Charlie
3 different versions of speech separation of recorded two speakers
– including the predicted speech and WER

Larry
– overlaying text
– simultaneous video and audio recording
– image segmentation

Stella
Gantt Chart

In particular, I wanted to show the WER results of our current speech separation techniques as compared to the baseline. This is an important step to the final stages of our capstone project. The results are summarised below

These results show that our speech separation module is comparable to a single speaker speaking (or clear speech), and the largest margin of error arises from the limitations of the STT model. This is a great demonstration of the effectiveness of our current deep learning approach system.

I also did a naive test of how fast speech separation takes on a GPU. Turns out for a 20s audio, it only takes 0.8s, which makes it more realistic for daily use.

Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule with contigencies. Stella is currently working on the Phase Difference Channel Weighing (PDCW) algorithm generously suggested to us by Professor Stern. We are still hoping to use a more signal processing based approach to speech separation, even though we do already have an effective deep learning approach to compare with.

What deliverables do you hope to complete in the next week?

In the next week, I will work with Larry to fully integrate our system. I will work with Stella to see if our signal processing approach is still viable. Finally, upon the suggestion from Professor Sullivan, I want to do a final experiment to get the default WER of our microphone array, simply by having a speaker speak into the microphone array and determining the word error rate from the predicted text.

Stella’s Status Report for 2 April 2022

This week, I debugged and tested the delay-and-sum beamforming algorithm with the goal of either getting delay-and-sum beamforming to successfully separate speech in our recordings, or showing that delay-and-sum beamforming with our current mic array would not be sufficient to separate speech and that we should use a different approach.

In order to test the delay-and-sum beamforming algorithm, I first generated a sine wave audio in matlab. I played this sound through an earbud towards the mic at angle 0 on the mic array (mic 1).

I played the sine wave while recording from the mic array. Then, I passed the mic array recordings through my delay-and-sum beamforming code. Though there was a lot of background noise, I found that beamforming towards the source (the earbud) gave an output with a greater SNR (judged by ear, not computationally) than beamforming away from the source. Below are the audio outputs for beamforming towards the source – look direction = 0 – and beamforming directly away from the source – look direction = pi.

look direction = 0:

look direction = pi:

On Thursday, Charlie and I met with Dr. Stern to discuss alternative approaches to speech separation, particularly Phase Difference CHannel Weighting (PDCW) which Dr. Stern published. I got the beamforming test results (shared above) before this meeting, so we went in the goal of finding new algorithms to try rather than trying to get delay-and-sum beamforming working.

We are a bit behind schedule since we have decided to abandon delay-and-sum beamforming in favor of a non-beamforming speech separation approach. To catch up this week, we will collect recordings that we can pass into the PDCW algorithm and determine how well this algorithm can separate speech with our mic setup.

This week, Charlie and I will collect recordings that fit the assumptions of the PDCW algorithm and test the algorithm on them. From there, we will see what we can get working with the PDCW algorithm and we will meet up again with Dr. Stern to discuss our results and get some more guidance (as he suggested in our last meeting).

Team Status Report for 2 April 2022

The most significant risk that could jeopardize our project is whether or not we will be able to get our audio processing pipeline working well enough to get useable captions from the speech-to-text model we’re using.

This past week, we decided to abandon the delay-and-sum beamforming in favor of a different algorithm, Phase Difference Channel Weighting (PDCW), which Dr. Stern published. Charlie and Stella met with Dr. Stern this past week to discuss PDCW and possible reasons our previous implementation wasn’t working. On Friday, Charlie and Larry recorded new data which we will use to test the PDCW algorithm (the data had to be recorded in a particular configuration to meet the assumptions of the PDCW algorithm).

PDCW is our current plan for how to use signal processing techniques in our audio pipeline, but as a backup plan, we have a deep learning module – SpeechBrain’s SepFormer – which we can use to separate multiple speakers. We decided with Dr. Sullivan this past week that, if we go with the deep learning approach, we will test our final system’s performance on more than just two speakers.

The change to our audio processing algorithm is the only significant change we made this week to the design of our system. We have not made any further adjustments to our schedule.

On our video processing side, Larry has been able to generate time-stamped captions, and with our angle estimation working, we are close to being able to put captions on our speakers. With this progress on our video processing pipeline and with the SepFormer module as our interim audio processing pipeline, we’ve been able to start working on integrating the various parts of our system, which we wanted to start as early as possible.

Larry’s Status Report for 2 April 2022

This week, I worked on using IBM Watson’s Speech-To-Text API in Python. The only trouble I had was that using the API required Python 3.7, while the default installed on the system was Python 3.6. Since I built OpenCV for Python3.6 and do not want to go through the trouble of making everything work, I will try to just use Python 3.7 for the Speech-To-Text and Python 3.6 for everything else. These are some of the timestamped captions that I generated:

Since we have the Interim Demo in the upcoming week, I focused on trying to put together a non-real-time demonstration. I have not managed to completely figure out how to record video and audio at the same time,  but I was able to both well enough to produce new recordings for Charlie and Stella to work with. I also worked out a lot of details with Charlie about how data should be passed around. As we are not targeting real-time, we will just be generating and using files. We currently have the means to produced separated audio, timestamped captions, and video with angle estimation, so we believe we can put together a good demo.

I am currently behind schedule, since I expected to already have placed prerecorded captions onto video already. I have all the tools available for doing so, however, so I expect to be able to catch up in the next week. At this point, it is just a matter of writing a couple Python scripts.

One aspect of our project I am worried about is that the Jetson TX2 may not be fast enough for real-time applications. While not an issue for the demo, I noticed a lot of slowdown when doing processing on real-time video. Next week, I will spend some time investigating how to speed up some of the processing. Other than a working demo as a deliverable, I want to have a more concrete understanding of where the bottlenecks are and how they may be addressed.

Charlie’s Status Report for 2 April 2022

What did you personally accomplish this week on the project? Give files or photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

This week, my team and I were discussing the current missing parts of our project. We realised that other than beamforming, we do have the rest of the components required for the integration of our Minimum Viable Product (MVP).

As I am in charge of the deep learning components of our project, I reformatted the deep learning packages for our speech separation module. In particular, for our MVP, we are planning to use SpeechBrain’s SepFormer module, which is trained on the whamr dataset which contains environmental and reverberation. Using the two microphones on our array, I am able to estimate the position of the speakers based on the delay in the time of arrival. This is crucial, as SepFormer separates speeches but does not provide any information about where the speakers are located.

On Thursday, Stella and I got on a call with Professor Stern. This is because I discovered that Professor Stern published a paper on speech separation with two microphones that are spaced 4cm apart. After speaking with Professor Stern, we identified three possible errors as to why our current implementation does not work.

  1. Reverb Environment in small music room (original recordings)
  2. Beamforming with given array will likely lead to spatial aliasing
  3. Audio is sampled unnecessarily high (44.1kHz)

Professor Stern suggested that we could do a new recording at 16kHz at an environment with a lot less reverberation, such as outdoors or a larger room.

On Friday, Larry and I went to an office at the ECE Staff lounge to do the new recording. This new recording will be tested on the Phase Difference Channel Weighting (PDCW) algorithm that Professor Stern published. Of course, this is a branch that allows us to continue with a more signal processing approach to our capstone project.

A casual conversation.

A scripted conversation.


Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are slightly behind as we are planning to abandon beamforming and switch to Professor Stern’s published work. However, it is difficult to consider that we are behind because we already have contingencies present (deep learning speech separation). To catch up, we just have to collect new recordings and test them using the PDCW algorithm.

What deliverables do you hope to complete in the next week?

In the next week, Stella and I will test the PDCW algorithm on our newly collected recordings. We will also likely meet up with Professor Stern to further advise us on our project, as he is very knowledgeable about binaural speech processing.

Stella’s Status Report for 26 March 2022

This week I focused on debugging my circular beamforming code and on collecting data to test the code. I also spent time on the ethics assignment and discussion prep. Over the past week, I found and fixed several bugs in my code. The audio still does not seem be beamforming towards the intended speaker though, and instead seems to be cutting out some noise. In the figure below, Figure 1 shows the output signal and Figure 2 shows one channel of the input signal. As can be seen in the figures, the output looks very similar to the input, but has lower amplitude for the parts of the recording with background noise and no speaking.

It is possible that there are more issues with my code, and it is also possible that our recording setup wasn’t precise enough and that the look direction (i.e. angle of the speaker of interest) that I input to my code isn’t accurate enough for the beamforming algorithm to amplify one speaker over the other. This week, for testing, I brought in a tape measure so that we could better estimate distances and angles in our setup, but doing so with just the tape measure was difficult. For the next time we test, I’m planning to bring in a protractor and some masking tape as well so that I can rule out setup inaccuracies (as much as possible) when testing the beamforming code.

We are behind schedule since we do not have beamforming working yet. In the coming week, I will work with Charlie to write test code for my beamforming algorithm so that we can get it working within the week.

Team Status Report for 26 March 2022

The most significant risk that could jeopardize our project remains whether or not we are able to separate speech well enough to generate usable captions.

To manage this risk, we are simultaneously working on multiple approaches for speech separation. The deep learning approach is good enough for our MVP, though it may require a calibration step to match voices in the audio to people in the video. We want to avoid having a calibration step and are therefore continuing to develop a beamforming solution. Combining multiple approaches may end up being the best solution for our project. Our contingency plan, however, is to use deep learning with a calibration step. This solution is likely to work, but is also the least novel.

We have not made any significant changes to the existing design of the system. One thing we are considering now is how to determine the number of people that are currently speaking in the video. Information about the number of people currently speaking would help us avoid generating extraneous and incorrect captions. With the current prevalence of masks, we have to rely on an audio solution. Charlie is developing a clever solution that scales to 2 people, which is all we need given our project scope. We will likely integrate his solution with the audio processing code.

We have not made any further adjustments to our schedule.

One component that is now mostly working is angle estimation using the Jetson TX2 CSI camera, which we are using due to its slightly higher FOV and lower distortion.

In the above picture, the angle estimation is overlaid over the center position of the person.