Larry’s Status Report for 30 April 2022

Last week, I spent some time adjusting the caption generation. I wrote a very simple recursive algorithm to have the text move to a new line instead of moving off-screen, though now it can split words in half. I still have to adjust it so that words are not split in half, though that is not a difficult problem.

I also spent time looking over Stella’s final presentation and providing a small amount of feedback.

Another thing I did this week was to integrate the new microphones that we purchased with the scripts that we were running on the Jetson TX2. Surprisingly, we only had to change the sample rates in a few areas. Everything else worked pretty much out of the box. We recorded a lot more data with the new microphones. Here is a good example of a full overlap recording, showcasing both the newer captions and the higher quality microphones:

https://drive.google.com/file/d/1MFlt5AUgrVL5hiOT9XV_zveZVAu-saj3/view?usp=sharing

For comparison, this is a recording from our previous status reports that we made using the microphone array:

https://drive.google.com/file/d/1MmEE7Yh0Kxe5wChuq5n5rHsnGKMOyZYr/view?usp=sharing

The difference is pretty stark.

Currently, we are on schedule for producing an integrated final product. We definitely do not have time for a real-time implementation and we are currently discussing whether we have enough time to create some sort of enclosure for our product. Given how busy the next week will be, I doubt we will be able to do anything substantive.

By next week, we will have completed the final demo. The main items we have to finish are the integration with the website, the data analysis, the final poster, and the final video.

Team Status Report for 30 April 2022

What are the most significant risks that could jeopardize the success of the project? How are these risks being managed? What contingency plans are ready?

The most significant risk that can jeopardise our project is linking the front-end (website) to the backend (processing). Larry and I will be working on this integration before our final demo day. The contingency plan is that we demo the front end and the back end separately. We already have a pipeline set up to record, split and overlay captions on the video, which is our MVP.

Were any changes made to the existing design of the system (requirements, block diagram, system spec, etc)? Why was this change necessary, what costs does the change incur, and how will these costs be mitigated going forward?

Yes. We are probably going to use a deep learning approach in separating speech given that the results we are getting is spectacular. This change was necessary as a conventional signal processing method given our microphone array is not sufficient in separating non-linear combination of speeches. Deep learning approaches are known to be able to learn and adapt to non-linear manifolds of the training instance. This is the reason why our current deep learning approach is extremely successful in separating overlapping speech.

At the moment, we decided to not do real time due to the deep learning approach limitations. Therefore, we replaced this with a website interface that allows users to upload their own recordings for text overlay.

Provide an updated schedule if changes have occurred.

No change to schedule.

This is also the place to put some photos of your progress or to brag about component you got working.

We did our final data collection on Friday with the new microphones.

The microphones are unidirectional, placed 10cm apart. Speakers are seated 30 degrees away from the principal axis for two reasons. First, it is the maximum field of view of the jetson camera. Second, it is the basis of our pure signal processing (SSF+PDCW) algorithm.

We split the audio and overlay the text on the video. This is a casual conversation between larry and charlie. As we expected, the captioning for charlie is bad given his Singaporean accent, but larry’s captioning is much better.

https://drive.google.com/file/d/1zGE7MbR_yAVq2upUYq0c2LfmLz1AmquT/view?usp=sharing

Charlie’s Status Report for 30 April 2022

What did you personally accomplish this week on the project? Give files or photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

I summarise the work that I did this week into 3 main points.

  1. Writing a python script that runs the video+ audio recordings, splits the audio and overlay the captions.
  2. Helping Stella with her Final Presentation
  3. Data recordings

The last component of our project is to create a website interface that connects to our jetson to begin the video and audio recording. Last week, I completed the UI for the website (front-end). This week, I am writing the back-end to run all the commands needed to execute all the processing.

Stella was also presenting our final presentation, so I provided some feedback on her slides

On Friday night (8.30pm-11.30pm), my team and I went to CIC fourth floor conference room to do our final data collection with our newly purchased unidirectional microphones. We notice a drastic improvement in the sound quality, and we are in the process of processing this data. This is our final data collection and will be the main component of what we will present at the final demo and in our final report.

Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule.

What deliverables do you hope to complete in the next week?

Next week, I hope to accomplish 2 items

  1. Compute the WER for our data collection on Friday to be included in our poster
  2. Complete the backend of our project for the final demo

Stella’s Status Report for 23 April 2022

Early this week, I finished fixing the SSF PDCW code in MATLAB so that it runs correctly. I then found the WERs for the speech separated by SSF + PDCW and compared them to the WERs from the deep learning speech separation:

Based on the WERs and on the sound of the recordings, we decided to proceed with the deep learning approach.

This week, we collected a new set of data from two different environments: 1) an indoor conference room, and 2) the outdoor CUC loggia. We collected 5 recordings in each location. I came up with ideas for what we should test before our recording session so we could probe different limits of our system. Our five tests were as follows:

  1. Solo 1 (clean reference signal): Larry speaks script 1 alone
  2. Solo 2(clean reference signal): Stella speaks script 2 alone
  3. Momentary interruptions (as might happen in a real conversation): Stella speaks script 2 while Larry counts – “one”, “two”, “three”, etc… – every 2 seconds (a regular interval for the sake of repeatability)
  4. Partial overlap: Stella begins script 2, midway through Larry  begins script 1, Stella finishes script 2 then, later, Larry finishes script 1
  5. Full overlap: Stella and Larry begin their respective scripts at the same time

We got the following results:

I ran each of these same recordings through the SSF and PDCW algorithms (to try them out as a pre-processing step) and then fed those outputs into the deep learning speech separation model and then into the speech to text model.

In our meeting this week, Dr. Sullivan pointed out that, since we no longer needed our full circular mic array (instead we now need only two mics) we could spend the rest of our budget on purchasing better mics. Our hope is that the better audio quality will improve our system’s performance, and maybe even give us useable results from the SSF PDCW algorithms. So, on Wednesday, I spent time searching for stereo mic setups. Eventually, I found a stereo USB audio interface and, separately, a pair of good-quality mics and submitted an order request.

This week I also worked on the final project presentation, which I will be giving on Monday.

We are on schedule. We are currently finishing our integration.

This weekend, I will finish creating the slides and prepping my script for my presentation next week. After the presentation, I’ll start on the poster and final report and determine what additional data we want to collect.

This week, we collected a new set of data from two different environments: 1) an indoor conference room, and 2) the outdoor CUC loggia. We collected 5 recordings in each location. I came up with ideas for what we should test before our recording session so we could probe different limits of our system. Our five tests were as follows:

  1. Solo 1 (clean reference signal): Larry speaks script 1 alone
  2. Solo 2(clean reference signal): Stella speaks script 2 alone
  3. Momentary interruptions (as might happen in a real conversation): Stella speaks script 2 while Larry counts – “one”, “two”, “three”, etc… – every 2 seconds (a regular interval for the sake of repeatability)
  4. Partial overlap: Stella begins script 2, midway through Larry  begins script 1, Stella finishes script 2 then, later, Larry finishes script 1
  5. Full overlap: Stella and Larry begin their respective scripts at the same time

We got the following results:

I ran each of these same recordings through the SSF and PDCW algorithms (to try them out as a pre-processing step) and then fed those outputs into the deep learning speech separation model and then into the speech to text model.

In our meeting this week, Dr. Sullivan pointed out that, since we no longer needed our full circular mic array (instead we now need only two mics) we could spend the rest of our budget on purchasing better mics. Our hope is that the better audio quality will improve our system’s performance, and maybe even give us useable results from the SSF PDCW algorithms. So, on Wednesday, I spent time searching for stereo mic setups. Eventually, I found a stereo USB audio interface and, separately, a pair of good-quality mics and submitted an order request.

This week I also worked on the final project presentation, which I will be giving on Monday.

We are on schedule. We are currently finishing our integration.

This weekend, I will finish creating the slides and prepping my script for my presentation next week. After the presentation, I’ll start on the poster and final report and determine what additional data we want to collect.

Team Status Report for 23 April 2022

Our greatest current risk is that we will encounter problems in the process of integrating our user interface with the rest of our project. Currently, the video capture, audio capture, and various processing steps are able to work together as we want. We’ve been able to test the performance of our system without the UI, but for demo day we aim to finish a website that allows the users to record their video and view the captioned output video all on the website. The website is largely finished, however, and just needs to be connected to the processing steps of our system. As a contingency plan, we can always ask the user to connect their own laptop (or one of our laptops, for demo day) to the Jetson in order to view the captioned video.

We have made several changes to our design in the past week. For one, we have finalized our decision to use the deep learning approach for speech separation in our final design, rather than using the signal processing techniques SSF and PDCW. While SSF and PDCW do noticeably enhance our speaker of interest, they don’t work well enough to give us a decent WER. We will, however, try using SSF and PDCW to pre-process the audio before passing it to the deep learning algorithm to see if that helps our system’s performance.

While the deep learning algorithm takes in only one channel of input, we still need two channels to distinguish our left from our right speaker. This means that we no longer need our full mic array and could instead use stereo mics. Because we had spent less than half of our budget before this week, we decided to use the rest to buy components for a better stereo recording. We submitted purchase request forms for a stereo audio interface, two microphones of much better quality than the ones in the circular mic array we’ve been working with, and the necessary cords to connect these parts. We hope that a better quality audio recording will help reduce our WER.

We have made no changes to our schedule.

Our project is now very near completion. The website allows file uploads, can display a video, and displays a timer for the user to see how long the recording will go. The captions display nicely over their respective speakers. (See Charlie and Larry’s status reports for more details.)

For the audio processing side, we collected a new set of recordings this past week in two separate locations: indoors in a conference room and outdoors in the CUC loggia (the open archway space along the side of the UC). In both locations, we collected the same set of 5 recordings: 1) Larry speaking alone, 2) Stella speaking alone, 3) Stella speaking with brief interruptions from Larry, 4) partial overlap of speakers (just Stella then both then just Larry), 5) full overlap of speakers. Using the data we collected, we were able to assess the performance of our system under various conditions (see Stella’s status report for further details). Once we get our new microphones, we can perform all or some of these tests again to see the change in performance.

Larry’s Status Report for 23 April 2022

This week, I worked on finishing the integration of all the various subsystems that we developed. One challenge I encountered was that when I tried to combine everything into one Python script, the system ran out of memory. For now, I think we will split the system  into several Python scripts that are called from one shell script, and avoid recording for more than ~20 seconds. So far, this strategy seems to be working, and I am almost surprised at how well everything functions. The caption accuracy leaves something to be desired when two people are overlapping their speech, but otherwise the whole system is usable.

Here is a video sample with fully overlapping speech:
https://drive.google.com/file/d/1MmEE7Yh0Kxe5wChuq5n5rHsnGKMOyZYr/view?usp=sharing

One thing that still needs work is keeping the captions within the video frame boundaries and away from each other. I will clean up this issue next week, and do not anticipate much work involved. The other main deliverable yet to be finished is the integration Charlie’s website onto the Jetson TX2. Additionally, we ordered extra microphones that we need to test with our setup as well. Finally, the system currently only uses the deep learning approach to separate speech. It would be interesting to try to overlay captions that are generated using the signal processing approach. Once all that is done, we will have a finished product.

So far, the project is on schedule. I believe we left enough slack time and planned enough contingencies to produce something usable for the final demo. It is possible that we struggle a lot with the website, in which case we could rapidly develop something that works locally. Charlie seems confident in what he has developed, however, so we probably won’t need to change our plans.

By next week, I hope to have the deliverables I mentioned above done. I will also be helping Stella with the final presentation.

 

Charlie’s Status Report for 23 April 2022

What did you personally accomplish this week on the project? Give files or photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

The following was my agenda from last week after our discussion with Professor Sullivan

  1. Collecting data from different environments. This is because our deep learning system may work better in some environments than others (All)
  2. Fully integrating our audio and visual interface, as we are currently facing some Python dependency issues (Larry)
  3. Setting up a UI interface for our full system, such as a website (Me)

On Monday at 3pm, our team went to the ECE staff lounge at Hammershlag and sheltered walkway outside the CUC to do sound tests.

We managed to fully fix our dependency issues related to Python for our audio separation system.

I worked on the features for the website. As discussed on Wednesday, I am more interested in the usability of our website rather than the design.

To create such an active website, I had to set up a local (flask) server on the Jetson which can accept file uploads and downloads. I also set up a timer once the recording button is hit to give user an idea of how long they can record their audio (as per suggestion from Professor Sullivan). As this is my first experience with javascript and flask, it took me over 10 hours to get it to work.

Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are on schedule, as we are currently in the integration face of our project.

What deliverables do you hope to complete in the next week?

This coming weekend, I will be helping Stella with her preparation for the final project. After the presentation, I want to fully connect the website to the audio separation system, so I will be working closely with Larry.

Stella’s Status Report for 16 April 2022

This week I met with Dr. Stern to ask some further questions about the Phase Difference Channel Weighting (PDCW) source separation algorithm and the Suppression of Slowly-varying components and the Falling edge of the power envelope (SSF) reverberation-reducing algorithm. In this meeting, I found a mistake in my previous implementation of the two methods.

I spent time this week fixing my implementation in MATLAB. Instead of running and testing PDCW and SSF separately, as I had previously been doing, I switched to first running SSF on the two channels I intended to input to PDCW, then sending the two channels (now with less reverberation noise) through PDCW to separate out our source of interest.

My goal before next week is to finalize a decision as to whether or not we can use only signal processing for source separation. If the SSF-PDCW combination works well enough, we will proceed with that, but if it doesn’t we will use the deep learning algorithm instead. If we use the deep learning algorithm, we may still be able to get better results by doing some pre-processing with SSF or PDCW – we will have to test this.

This week I also wrote up a list of the remaining data we have to collect for our testing. We want to record the same audio in multiple different locations so having a written-out plan for testing will be useful in helping us get these recordings done more efficiently.

I started planning out the final presentation this week and will finish that in the coming week.

We are on schedule now and are working on integration and testing.

Larry’s Status Report for 16 April 2022

This week, I worked on installing all relevant libraries in a Python 3.8 virtual environment. I was able to successfully test all of the relevant Python components, and they seem to be working reasonably. One concern that Charlie had was that the deep learning speech separation was running extremely slowly, even after he had enabled GPU acceleration. This is something that we will look at in the coming weeks.

I also worked more on integrating everything together, and in particular fixed some small mistakes I made while doing the caption generation. The IBM Watson Speech-to-Text API separates generated text based on its confidence level. I noticed, however, that the highest confidence level text usually captions only a portion of the audio. Currently, I will use all the text that the API returns, since otherwise there are obvious gaps in the captions.

There aren’t any more interesting results for me to share here, since all I have been working on is fixing the Python libraries and integrating scripts. Overall, I would say that I am still slightly behind schedule. I hoped to have every component working together with the push of a button by now, but we are not quite there yet. If we give up on the real-time aspect, however, there is not that much left to do. Charlie and I will work together on setting up the web server, after which we will have a final product to present.

By next week, I definitely should have the entire system working together. We should be able to run a single script to record video/audio, generate and overlay captions, and add stereo audio to the video output. We also should have a basic website running through the Jetson TX2.

Team Status Report for 16 April 2022

The most significant risk that could jeopardize are project is that we are not able to put together a strong user experience. We believe we have all of the parts working and relatively integrated, but we have very little experience creating a good UI. Our current plan is to create a website that lets the user record and download video. If we are unable to create a decent website, our contingency plan is to fall back on the HDMI to USB converter that we purchased. The user would then have to connect their personal laptop to the Jetson to see the output.

We are still having difficulty using signal processing techniques such as PDCW and SSF to separate speech, and are most likely moving forward with the deep learning approach. We are quite confident in the deep learning approach at this time. One risk associated with it is that it may take a very long time to run, potentially degrading the user experience. We are currently trying to work on speeding up the processing, but may have to settle for a much less responsive final product.

The only major change we have made to our system is that we are attempting to run a web server on the Jetson TX2 to provide a nice user interface. We wanted an interface that was both easy to implement and easy to use, and settled on a web server. There are no extra costs incurred by this change, and we have the time to work on it.

In our schedule, Charlie will now be working on the website for most of the remaining time. We have no other changes to our schedule.

Early progress on the website: