chaoli2 – Team D6: EyeHear

April 30, 2022April 30, 2022

Team Status Report for 30 April 2022

What are the most significant risks that could jeopardize the success of the project? How are these risks being managed? What contingency plans are ready?

The most significant risk that can jeopardise our project is linking the front-end (website) to the backend (processing). Larry and I will be working on this integration before our final demo day. The contingency plan is that we demo the front end and the back end separately. We already have a pipeline set up to record, split and overlay captions on the video, which is our MVP.

Were any changes made to the existing design of the system (requirements, block diagram, system spec, etc)? Why was this change necessary, what costs does the change incur, and how will these costs be mitigated going forward?

Yes. We are probably going to use a deep learning approach in separating speech given that the results we are getting is spectacular. This change was necessary as a conventional signal processing method given our microphone array is not sufficient in separating non-linear combination of speeches. Deep learning approaches are known to be able to learn and adapt to non-linear manifolds of the training instance. This is the reason why our current deep learning approach is extremely successful in separating overlapping speech.

At the moment, we decided to not do real time due to the deep learning approach limitations. Therefore, we replaced this with a website interface that allows users to upload their own recordings for text overlay.

Provide an updated schedule if changes have occurred.

No change to schedule.

This is also the place to put some photos of your progress or to brag about a component you got working.

We did our final data collection on Friday with the new microphones.

The microphones are unidirectional, placed 10cm apart. Speakers are seated 30 degrees away from the principal axis for two reasons. First, it is the maximum field of view of the jetson camera. Second, it is the basis of our pure signal processing (SSF+PDCW) algorithm.

We split the audio and overlay the text on the video. This is a casual conversation between larry and charlie. As we expected, the captioning for charlie is bad given his Singaporean accent, but larry’s captioning is much better.

https://drive.google.com/file/d/1zGE7MbR_yAVq2upUYq0c2LfmLz1AmquT/view?usp=sharing

April 30, 2022

Charlie’s Status Report for 30 April 2022

What did you personally accomplish this week on the project? Give files or
photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

I summarise the work that I did this week into 3 main points.

Writing a python script that runs the video+ audio recordings, splits the audio and overlay the captions.
Helping Stella with her Final Presentation
Data recordings

The last component of our project is to create a website interface that connects to our jetson to begin the video and audio recording. Last week, I completed the UI for the website (front-end). This week, I am writing the back-end to run all the commands needed to execute all the processing.

Stella was also presenting our final presentation, so I provided some feedback on her slides

On Friday night (8.30pm-11.30pm), my team and I went to CIC fourth floor conference room to do our final data collection with our newly purchased unidirectional microphones. We notice a drastic improvement in the sound quality, and we are in the process of processing this data. This is our final data collection and will be the main component of what we will present at the final demo and in our final report.

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule.

“ What deliverables do you hope to complete in the next week?

Next week, I hope to accomplish 2 items

Compute the WER for our data collection on Friday to be included in our poster
Complete the backend of our project for the final demo

April 22, 2022April 22, 2022

Charlie’s Status Report for 23 April 2022

The following was my agenda from last week after our discussion with Professor Sullivan

Collecting data from different environments. This is because our deep learning system may work better in some environments than others (All)
Fully integrating our audio and visual interface, as we are currently facing some Python dependency issues (Larry)
Setting up a UI interface for our full system, such as a website (Me)

On Monday at 3pm, our team went to the ECE staff lounge at Hammershlag and sheltered walkway outside the CUC to do sound tests.

We managed to fully fix our dependency issues related to Python for our audio separation system.

I worked on the features for the website. As discussed on Wednesday, I am more interested in the usability of our website rather than the design.

To create such an active website, I had to set up a local (flask) server on the Jetson which can accept file uploads and downloads. I also set up a timer once the recording button is hit to give user an idea of how long they can record their audio (as per suggestion from Professor Sullivan). As this is my first experience with javascript and flask, it took me over 10 hours to get it to work.

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are on schedule, as we are currently in the integration face of our project.

“ What deliverables do you hope to complete in the next week?

This coming weekend, I will be helping Stella with her preparation for the final project. After the presentation, I want to fully connect the website to the audio separation system, so I will be working closely with Larry.

April 15, 2022

Charlie’s Status Report for 16 April 2022

This week, our team discussed with Professor Sullivan and we highlighted the three main priorities that we want to achieve by our final demo.

Collecting data from different environments. This is because our deep learning system may work better in some environments than others (All)
Fully integrating our audio and visual interface, as we are currently facing some Python dependency issues (Larry)
Setting up a UI interface for our full system, such as a website (Me)

I decided to handle the last priority as I have former experience writing a full website. The following shows the progress that I have made so far.

In my experience, I prefer writing websites from a systems perspective. I make sure that every “active” component of the website works before I proceed with improving the design. In the above structure, I have made sure that my website is able to receive inputs (duration of recording + file upload) and produce output (downloadable captioned video).

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are on schedule, as we are currently in the integration face of our project.

“ What deliverables do you hope to complete in the next week?

In the coming weekend and week, I will be improving on the visual design of the website.

Larry and I also ran into some problems with the GPU on the jetson. When we run on speech separation on the Jetson, it was taking about 3 minutes, which is rather strange given that the same script took about 0.8s on poorer GPUs on Google Collab. I suspect that the GPU on the Jetson is broken, but Larry disagrees. This week we will try to debug to see if the GPU actually works.

April 10, 2022

Team Status Report for 10 April 2022

What are the most significant risks that could jeopardize the success of the project? How are these risks being managed? What contingency plans are ready?

The current most significant risk is to integrate the camera and audio modules together. We are facing quite a bit of python compatibility issues on the Jetson. We have a naive solution (contingency), where we write a shell scripts that execute each of the module using a different version of python. However, this is extremely inefficient. Alternatively, we are thinking of resetting the Jetson or at least resetting the jetson module.

We are also having troubles using a complete digital signal processing approach in separating speech. We realised that beamforming was not sufficient in separating speech, and have turned to Phase Difference Channel Weighting (PDCW) and Suppression of Slowly-varying components and the Falling edge of the power envelope (SSF), which Stella is currently testing.

The contingency plan is a deep learning speech separation approach. Charlie has obtained significant success in using deep learning to separate speech, but the deep learning model he has used cannot be used in real-time due to processing latency. We demonstrated it during our demo during the week, and showed the results of using this deep learning approach. Professor Sullivan and Janet seem sufficiently convinced that our approach is logical.

Provide an updated schedule if changes have occurred.

No change to schedule.

This is also the place to put some photos of your progress or to brag about a component you got working.

The biggest thing to brag about this week is numerical values to show how successful our deep learning approach is

When one speaker is speaking (isolated case), we get a Word Error Rate (WER) of about 40%. We tested two cases where two speakers are speaking at once, and when one speaker interrupts. Using our speech separation technique, we are getting a nearly comparable WER as compared to the isolated case.

April 10, 2022April 10, 2022

Charlie’s Status Report for 10 April 2022

This week, I refined my code for the audio component of our project. In particular, Larry wanted me to generate a normalized version of the collected audio, which will be integrated into the video.

The audio component of our project runs on Python 3.8. However, Larry informed me that there were a couple of python dependency errors. In particular, the Speech-To-Text (STT) requires python 3.9, and OpenCV requires 3.7. Python should be backward compatible. Therefore, I was trying to learn how to reset Python on the Jetson. However, we agreed that we should not reset it before the demo day, so that at least we can demonstrate the working components of our project.

For the demo day, I organised my team members and we decided on the list of things we would like to show Professor Sullivan and Janet. Here is the following list I came up with.

Charlie
3 different versions of speech separation of recorded two speakers
– including the predicted speech and WER

Larry
– overlaying text
– simultaneous video and audio recording
– image segmentation

Stella
Gantt Chart

In particular, I wanted to show the WER results of our current speech separation techniques as compared to the baseline. This is an important step to the final stages of our capstone project. The results are summarised below

These results show that our speech separation module is comparable to a single speaker speaking (or clear speech), and the largest margin of error arises from the limitations of the STT model. This is a great demonstration of the effectiveness of our current deep learning approach system.

I also did a naive test of how fast speech separation takes on a GPU. Turns out for a 20s audio, it only takes 0.8s, which makes it more realistic for daily use.

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule with contigencies. Stella is currently working on the Phase Difference Channel Weighing (PDCW) algorithm generously suggested to us by Professor Stern. We are still hoping to use a more signal processing based approach to speech separation, even though we do already have an effective deep learning approach to compare with.

“ What deliverables do you hope to complete in the next week?

In the next week, I will work with Larry to fully integrate our system. I will work with Stella to see if our signal processing approach is still viable. Finally, upon the suggestion from Professor Sullivan, I want to do a final experiment to get the default WER of our microphone array, simply by having a speaker speak into the microphone array and determining the word error rate from the predicted text.

April 2, 2022

Charlie’s Status Report for 2 April 2022

This week, my team and I were discussing the current missing parts of our project. We realised that other than beamforming, we do have the rest of the components required for the integration of our Minimum Viable Product (MVP).

As I am in charge of the deep learning components of our project, I reformatted the deep learning packages for our speech separation module. In particular, for our MVP, we are planning to use SpeechBrain’s SepFormer module, which is trained on the whamr dataset which contains environmental and reverberation. Using the two microphones on our array, I am able to estimate the position of the speakers based on the delay in the time of arrival. This is crucial, as SepFormer separates speeches but does not provide any information about where the speakers are located.

On Thursday, Stella and I got on a call with Professor Stern. This is because I discovered that Professor Stern published a paper on speech separation with two microphones that are spaced 4cm apart. After speaking with Professor Stern, we identified three possible errors as to why our current implementation does not work.

Reverb Environment in small music room (original recordings)
Beamforming with given array will likely lead to spatial aliasing
Audio is sampled unnecessarily high (44.1kHz)

Professor Stern suggested that we could do a new recording at 16kHz at an environment with a lot less reverberation, such as outdoors or a larger room.

On Friday, Larry and I went to an office at the ECE Staff lounge to do the new recording. This new recording will be tested on the Phase Difference Channel Weighting (PDCW) algorithm that Professor Stern published. Of course, this is a branch that allows us to continue with a more signal processing approach to our capstone project.

A casual conversation.

A scripted conversation.

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are slightly behind as we are planning to abandon beamforming and switch to Professor Stern’s published work. However, it is difficult to consider that we are behind because we already have contingencies present (deep learning speech separation). To catch up, we just have to collect new recordings and test them using the PDCW algorithm.

“ What deliverables do you hope to complete in the next week?

In the next week, Stella and I will test the PDCW algorithm on our newly collected recordings. We will also likely meet up with Professor Stern to further advise us on our project, as he is very knowledgeable about binaural speech processing.

March 26, 2022March 27, 2022

Charlie’s Status Report for 26 March 2022

After discussing with my team on Monday, we decided that I would work on a speaker identification problem. In particular, I was trying to solve the problem of identifying whether speaker A is speaking, speaker B is speaking, or both speaker A and B are speaking at the same time.

I originally imagined this as a localization problem, where we try to identify where the speakers are. I tried to use the MATLAB RootMUSICEstimator function, but realise that it requires prior information of the number of speakers in the scene, which defeats the purpose.

Thereafter, I formulated this in the form of a path tracing/signal processing method. Essentially, if only one speaker is speaking, I would expect time delay between the mic closer to the speaker and the mic furthest away. The time delay can be computed by a cross correlation, and the sign of the delay indicates whether the speaker is on the left or right. If both speakers are speaking at the same time, the cross correlation is weaker. Therefore, by thresholding the cross correlation, we can psuedo identify which speaker is speaking when.

Next, I wanted to follow up on the deep learning approach to see whether conventional signal processing approach can improve the STT predictions. All beamforming methods used are from MATLAB packages. I summarise my work as follows.

gsc beamformer, then feed into dl speech separation –> poor STT
gsc beamformer + nc, then feed into dl speech separation –> poor STT
dl speech separation across all channels, then add up –> better STT
dl speech separation followed by audio alignment –> better STT

(GSC: Generalised Sidelobe Canceller, DL: Deep Learning, STT: Speech to Text)

It seems that processing the input with conventional signal processing methods is generally weaker at predicting speech. However, applying DL techniques on individual channels, then aligning them could improve performance. I will work with stella in the coming week to see if there is any way I can integrate that with her manual beamforming approach, as I am doubtful of the MATLAB packages implementation (primarily for radar antennas)

Finally, I wrapped up my week’s work by reformatting our STT DL methods into simple python callable functions (making the DL implementations as abstract to my teammates as possible). Due to the widely available STT packages, I provided 3 STT modules- Google STT, SpeechBrain and IBM Watson. Based on my initial tests, Google and IBM seem to work best, but they operate on a cloud server which might hinder real-time applications.

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are behind schedule, as we have not gotten our beamforming to work. Nevertheless, we did manage to get speech separation to work using deep learning, and I have personally been exploring the integration of DL with beamforming.

“ What deliverables do you hope to complete in the next week?

In the next week, I want to be able to get our beamforming algorithm to work, and test its effectiveness. I will do so by meeting Stella more frequently and debug the script together. I also want to encourage my team to start to integrate our camera module with the speech separation module (our MVP), since we do have a workable DL approach right now.

March 19, 2022

Team Status Report for 19 March 2022

What are the most significant risks that could jeopardize the success of the project? How are these risks being managed? What contingency plans are ready?

The most significant risk is the beamforming part of our system. To manage this risk, Charlie and Stella will be working together to get beamforming working. After discussing on Friday, Stella will look through her delay-and-sum beamforming code on Saturday, and try to separate a simulated wave on Sunday. If no bug is observed, Charlie will then come in and work with Stella to figure out the problems.

There is no change to our system design, unless we resort to a deep learning approach to separate mixed speech. This would be a discussion we will have in two weeks.

Provide an updated schedule if changes have occurred.

No change to schedule.

This is also the place to put some photos of your progress or to brag about a component you got working.

The separated speech from deep learning by Charlie is worth bragging.

Mixed speech

Stella speech

Larry speech

March 19, 2022March 19, 2022

Charlie’s Status Report for 19 March 2022

This week, I was experimenting with the use of deep learning approaches to separate mixed (or overlapping speech). The intention of this divergent branch from our beamforming approach is twofold. First, we want to have a backup strategy to separate overlapping speech in the instance that beamforming does not work as expected. Second, if beamforming does suppress other voices, we could use a deep learning approach to further improve the performance. We strongly believe the second option ties best to our project.

I managed to demonstrate that deep learning models are able to separate speech to some level of performance.

The following is overlapping speech between Stella and Larry.

The following is the separated speech of Stella.

The following is the separated speech of Larry.

One interesting point that we discovered was that filtering out noise prior to feeding the signals into the deep learning model harms performance. We believe this arises from the fact that noise filtering filters out critical frequencies, and that the deep learning model has inherent denoising ability built into it.

Second, we notice that the STT model was not able to interpret the separated speech. It could either be caused by poor enunciation of words from our speakers, or that the output of the STT model is not clear.

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule.

“ What deliverables do you hope to complete in the next week?

In the next week, I want to experiment using the deep learning separated speech to cancel the original speakers, to test whether the output of the noise canceled speech leads to better STT predictions.