Charlie’s Status Reports – Team D6: EyeHear

April 30, 2022

Charlie’s Status Report for 30 April 2022

What did you personally accomplish this week on the project? Give files or
photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

I summarise the work that I did this week into 3 main points.

Writing a python script that runs the video+ audio recordings, splits the audio and overlay the captions.
Helping Stella with her Final Presentation
Data recordings

The last component of our project is to create a website interface that connects to our jetson to begin the video and audio recording. Last week, I completed the UI for the website (front-end). This week, I am writing the back-end to run all the commands needed to execute all the processing.

Stella was also presenting our final presentation, so I provided some feedback on her slides

On Friday night (8.30pm-11.30pm), my team and I went to CIC fourth floor conference room to do our final data collection with our newly purchased unidirectional microphones. We notice a drastic improvement in the sound quality, and we are in the process of processing this data. This is our final data collection and will be the main component of what we will present at the final demo and in our final report.

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule.

“ What deliverables do you hope to complete in the next week?

Next week, I hope to accomplish 2 items

Compute the WER for our data collection on Friday to be included in our poster
Complete the backend of our project for the final demo

April 22, 2022April 22, 2022

Charlie’s Status Report for 23 April 2022

The following was my agenda from last week after our discussion with Professor Sullivan

Collecting data from different environments. This is because our deep learning system may work better in some environments than others (All)
Fully integrating our audio and visual interface, as we are currently facing some Python dependency issues (Larry)
Setting up a UI interface for our full system, such as a website (Me)

On Monday at 3pm, our team went to the ECE staff lounge at Hammershlag and sheltered walkway outside the CUC to do sound tests.

We managed to fully fix our dependency issues related to Python for our audio separation system.

I worked on the features for the website. As discussed on Wednesday, I am more interested in the usability of our website rather than the design.

To create such an active website, I had to set up a local (flask) server on the Jetson which can accept file uploads and downloads. I also set up a timer once the recording button is hit to give user an idea of how long they can record their audio (as per suggestion from Professor Sullivan). As this is my first experience with javascript and flask, it took me over 10 hours to get it to work.

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are on schedule, as we are currently in the integration face of our project.

“ What deliverables do you hope to complete in the next week?

This coming weekend, I will be helping Stella with her preparation for the final project. After the presentation, I want to fully connect the website to the audio separation system, so I will be working closely with Larry.

April 15, 2022

Charlie’s Status Report for 16 April 2022

This week, our team discussed with Professor Sullivan and we highlighted the three main priorities that we want to achieve by our final demo.

Collecting data from different environments. This is because our deep learning system may work better in some environments than others (All)
Fully integrating our audio and visual interface, as we are currently facing some Python dependency issues (Larry)
Setting up a UI interface for our full system, such as a website (Me)

I decided to handle the last priority as I have former experience writing a full website. The following shows the progress that I have made so far.

In my experience, I prefer writing websites from a systems perspective. I make sure that every “active” component of the website works before I proceed with improving the design. In the above structure, I have made sure that my website is able to receive inputs (duration of recording + file upload) and produce output (downloadable captioned video).

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are on schedule, as we are currently in the integration face of our project.

“ What deliverables do you hope to complete in the next week?

In the coming weekend and week, I will be improving on the visual design of the website.

Larry and I also ran into some problems with the GPU on the jetson. When we run on speech separation on the Jetson, it was taking about 3 minutes, which is rather strange given that the same script took about 0.8s on poorer GPUs on Google Collab. I suspect that the GPU on the Jetson is broken, but Larry disagrees. This week we will try to debug to see if the GPU actually works.

April 10, 2022April 10, 2022

Charlie’s Status Report for 10 April 2022

This week, I refined my code for the audio component of our project. In particular, Larry wanted me to generate a normalized version of the collected audio, which will be integrated into the video.

The audio component of our project runs on Python 3.8. However, Larry informed me that there were a couple of python dependency errors. In particular, the Speech-To-Text (STT) requires python 3.9, and OpenCV requires 3.7. Python should be backward compatible. Therefore, I was trying to learn how to reset Python on the Jetson. However, we agreed that we should not reset it before the demo day, so that at least we can demonstrate the working components of our project.

For the demo day, I organised my team members and we decided on the list of things we would like to show Professor Sullivan and Janet. Here is the following list I came up with.

Charlie
3 different versions of speech separation of recorded two speakers
– including the predicted speech and WER

Larry
– overlaying text
– simultaneous video and audio recording
– image segmentation

Stella
Gantt Chart

In particular, I wanted to show the WER results of our current speech separation techniques as compared to the baseline. This is an important step to the final stages of our capstone project. The results are summarised below

These results show that our speech separation module is comparable to a single speaker speaking (or clear speech), and the largest margin of error arises from the limitations of the STT model. This is a great demonstration of the effectiveness of our current deep learning approach system.

I also did a naive test of how fast speech separation takes on a GPU. Turns out for a 20s audio, it only takes 0.8s, which makes it more realistic for daily use.

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule with contigencies. Stella is currently working on the Phase Difference Channel Weighing (PDCW) algorithm generously suggested to us by Professor Stern. We are still hoping to use a more signal processing based approach to speech separation, even though we do already have an effective deep learning approach to compare with.

“ What deliverables do you hope to complete in the next week?

In the next week, I will work with Larry to fully integrate our system. I will work with Stella to see if our signal processing approach is still viable. Finally, upon the suggestion from Professor Sullivan, I want to do a final experiment to get the default WER of our microphone array, simply by having a speaker speak into the microphone array and determining the word error rate from the predicted text.

April 2, 2022

Charlie’s Status Report for 2 April 2022

This week, my team and I were discussing the current missing parts of our project. We realised that other than beamforming, we do have the rest of the components required for the integration of our Minimum Viable Product (MVP).

As I am in charge of the deep learning components of our project, I reformatted the deep learning packages for our speech separation module. In particular, for our MVP, we are planning to use SpeechBrain’s SepFormer module, which is trained on the whamr dataset which contains environmental and reverberation. Using the two microphones on our array, I am able to estimate the position of the speakers based on the delay in the time of arrival. This is crucial, as SepFormer separates speeches but does not provide any information about where the speakers are located.

On Thursday, Stella and I got on a call with Professor Stern. This is because I discovered that Professor Stern published a paper on speech separation with two microphones that are spaced 4cm apart. After speaking with Professor Stern, we identified three possible errors as to why our current implementation does not work.

Reverb Environment in small music room (original recordings)
Beamforming with given array will likely lead to spatial aliasing
Audio is sampled unnecessarily high (44.1kHz)

Professor Stern suggested that we could do a new recording at 16kHz at an environment with a lot less reverberation, such as outdoors or a larger room.

On Friday, Larry and I went to an office at the ECE Staff lounge to do the new recording. This new recording will be tested on the Phase Difference Channel Weighting (PDCW) algorithm that Professor Stern published. Of course, this is a branch that allows us to continue with a more signal processing approach to our capstone project.

A casual conversation.

A scripted conversation.

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are slightly behind as we are planning to abandon beamforming and switch to Professor Stern’s published work. However, it is difficult to consider that we are behind because we already have contingencies present (deep learning speech separation). To catch up, we just have to collect new recordings and test them using the PDCW algorithm.

“ What deliverables do you hope to complete in the next week?

In the next week, Stella and I will test the PDCW algorithm on our newly collected recordings. We will also likely meet up with Professor Stern to further advise us on our project, as he is very knowledgeable about binaural speech processing.

March 26, 2022March 27, 2022

Charlie’s Status Report for 26 March 2022

After discussing with my team on Monday, we decided that I would work on a speaker identification problem. In particular, I was trying to solve the problem of identifying whether speaker A is speaking, speaker B is speaking, or both speaker A and B are speaking at the same time.

I originally imagined this as a localization problem, where we try to identify where the speakers are. I tried to use the MATLAB RootMUSICEstimator function, but realise that it requires prior information of the number of speakers in the scene, which defeats the purpose.

Thereafter, I formulated this in the form of a path tracing/signal processing method. Essentially, if only one speaker is speaking, I would expect time delay between the mic closer to the speaker and the mic furthest away. The time delay can be computed by a cross correlation, and the sign of the delay indicates whether the speaker is on the left or right. If both speakers are speaking at the same time, the cross correlation is weaker. Therefore, by thresholding the cross correlation, we can psuedo identify which speaker is speaking when.

Next, I wanted to follow up on the deep learning approach to see whether conventional signal processing approach can improve the STT predictions. All beamforming methods used are from MATLAB packages. I summarise my work as follows.

gsc beamformer, then feed into dl speech separation –> poor STT
gsc beamformer + nc, then feed into dl speech separation –> poor STT
dl speech separation across all channels, then add up –> better STT
dl speech separation followed by audio alignment –> better STT

(GSC: Generalised Sidelobe Canceller, DL: Deep Learning, STT: Speech to Text)

It seems that processing the input with conventional signal processing methods is generally weaker at predicting speech. However, applying DL techniques on individual channels, then aligning them could improve performance. I will work with stella in the coming week to see if there is any way I can integrate that with her manual beamforming approach, as I am doubtful of the MATLAB packages implementation (primarily for radar antennas)

Finally, I wrapped up my week’s work by reformatting our STT DL methods into simple python callable functions (making the DL implementations as abstract to my teammates as possible). Due to the widely available STT packages, I provided 3 STT modules- Google STT, SpeechBrain and IBM Watson. Based on my initial tests, Google and IBM seem to work best, but they operate on a cloud server which might hinder real-time applications.

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are behind schedule, as we have not gotten our beamforming to work. Nevertheless, we did manage to get speech separation to work using deep learning, and I have personally been exploring the integration of DL with beamforming.

“ What deliverables do you hope to complete in the next week?

In the next week, I want to be able to get our beamforming algorithm to work, and test its effectiveness. I will do so by meeting Stella more frequently and debug the script together. I also want to encourage my team to start to integrate our camera module with the speech separation module (our MVP), since we do have a workable DL approach right now.

March 19, 2022March 19, 2022

Charlie’s Status Report for 19 March 2022

This week, I was experimenting with the use of deep learning approaches to separate mixed (or overlapping speech). The intention of this divergent branch from our beamforming approach is twofold. First, we want to have a backup strategy to separate overlapping speech in the instance that beamforming does not work as expected. Second, if beamforming does suppress other voices, we could use a deep learning approach to further improve the performance. We strongly believe the second option ties best to our project.

I managed to demonstrate that deep learning models are able to separate speech to some level of performance.

The following is overlapping speech between Stella and Larry.

The following is the separated speech of Stella.

The following is the separated speech of Larry.

One interesting point that we discovered was that filtering out noise prior to feeding the signals into the deep learning model harms performance. We believe this arises from the fact that noise filtering filters out critical frequencies, and that the deep learning model has inherent denoising ability built into it.

Second, we notice that the STT model was not able to interpret the separated speech. It could either be caused by poor enunciation of words from our speakers, or that the output of the STT model is not clear.

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule.

“ What deliverables do you hope to complete in the next week?

In the next week, I want to experiment using the deep learning separated speech to cancel the original speakers, to test whether the output of the noise canceled speech leads to better STT predictions.

February 26, 2022February 27, 2022

Charlie’s status report for 26 February 2022

“ What did you personally accomplish this week on the project? Give files or
photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

This week, I implemented image segmentation using Detectron2 to identify the location of only humans in the scene.

The image on the left is the test image provided by the Detectron2 dataset. Just to be certain that the image segmentation method works on non-full bodies and difficult images, I tested it on my own image. It appears that the Detectron2 works very well. Thereafter, I wrote the network in a modular method so that a function that outputs the location of each speaker can be easily called. The coordinates will then be combined with our angle estimation pipeline (which Larry is currently implementing).

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

I would say we are slightly behind, because I was surprised at how difficult it was to separate the speech of two speakers even with the STFT method. I had previously implemented this method in a class and tested that it worked successfully. However, it did not work on our audio recordings. I am currently in the process of debugging why that occurred. In the meantime, Stella is working on delay-and-sum beamforming which is our second attempt at enhancing speech.

“ What deliverables do you hope to complete in the next week?

In the coming week, I hope to be able to get the equations for delay-and-sum beamforming from stella. Once I receive the equations, I will then be able to implement the generalised sidelobe canceller (GSC) to determine if our speech enhancement method works (in a non-real-time case). In the event that GSC does not work, my group has identified a deep learning approach to separate speech. We do not plan to jump straight to that as it cannot be applied in realtime.

February 19, 2022February 20, 2022

Charlie’s status report for 19 February 2022

This week, I sent the purchase orders for the components that my team would need for our project. Most of our parts have arrived. I tested to make sure that our webcam and microphone array worked by plugging into my computer.

To start working on our speech separation techniques, I recorded audio with two speakers, where the two speakers are located on the opposite sides of the circular microphone array. More specifically, I measured the spacing between each of the microphones, which is a crucial parameter for delay and sum beamforming.

My team and I have also finalized our design. More specifically, we decided that in order to make our system real-time, we will broadcast our video output with captions on a computer, and then do screensharing. We came up with this solution after conducting intensive literature reviews of the challenges that previous capstone groups faced with real-time streaming of video and audio data.

“ Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule since the focus of this week was to work on the design presentation that Larry is presenting. In the coming weekend, I will work closely with Larry during the weekend to help him prepare for his presentation.

“ What deliverables do you hope to complete in the next week?

In the next week, stella and I will begin collecting more audio files. The current audio files that we had collected were in a noisy room, which made it very difficult for us to separate speech properly. I am thinking of collecting our audio files in a soundproof room in fifth commons.

I will also start to come up with some designs for microphone beamforming and Short-Time-Fourier-Transform speech separation techniques. In the following week, I will then test my speech separation techniques on the audio files that I have collected.

February 11, 2022February 11, 2022

Charlie’s Status Report for 12 February 2022

This week, I gave the proposal presentation on behalf of my team. In the presentation, I discussed the use case of our product, and our proposed solution to tackle the current problem. I received some very interesting questions from the rest of the teams. One of my favorites was the possibility to use facial recognition deep learning models to do lip reading. While I doubt that my team will adopt such techniques due to the complexity, it does pique my interest in the topic as a future research direction.

I also designed a simple illustration of our solution, as shown below.

I think my illustration really helped my audience to understand my solution. It is extremely intuitive yet representative of my idea.

After the presentation on Wednesday, my team and I met up in person to work on our design. We decided that the most logical next step was to start working on the design presentation since it will help us to figure out what components we needed to obtain. Nevertheless, we booked a Jetson because we wanted to make sure that Larry could figure out how to use it given that he is the embedded person in our group.

I was also concerned that we might need to upsample our 8kHz audio to 16kHz or 44.1kHz in order to feed into our speech-to-text (STT) model.

I tested this and made sure that even a 8kHz sampling rate was sufficient for the STT to work.

We are currently on track with our progress, with reference to our gantt chart.

We should be coming up with a list of items we need by the end of this week to submit. Stella and I will start designing the beamforming algorithm this weekend, and Larry will start working on his presentation.