Charlie’s status report for 26 February 2022

 What did you personally accomplish this week on the project? Give files orphotos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

This week, I implemented image segmentation using Detectron2 to identify the location of only humans in the scene.

The image on the left is the test image provided by the Detectron2 dataset. Just to be certain that the image segmentation method works on non-full bodies and difficult images, I tested it on my own image. It appears that the Detectron2 works very well. Thereafter, I wrote the network in a modular method so that a function that outputs the location of each speaker can be easily called. The coordinates will then be combined with our angle estimation pipeline (which Larry is currently implementing).

 Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

I would say we are slightly behind, because I was surprised at how difficult it was to separate the speech of two speakers even with the STFT method. I had previously implemented this method in a class and tested that it worked successfully. However, it did not work on our audio recordings. I am currently in the process of debugging why that occurred. In the meantime, Stella is working on delay-and-sum beamforming which is our second attempt at enhancing speech.

 What deliverables do you hope to complete in the next week?

In the coming week, I hope to be able to get the equations for delay-and-sum beamforming from stella. Once I receive the equations, I will then be able to implement the generalised sidelobe canceller  (GSC) to determine if our speech enhancement method works (in a non-real-time case). In the event that GSC does not work, my group has identified a deep learning approach to separate speech. We do not plan to jump straight to that as it cannot be applied in realtime.

Charlie’s status report for 19 February 2022

What did you personally accomplish this week on the project? Give files or photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

This week, I sent the purchase orders for the components that my team would need for our project. Most of our parts have arrived. I tested to make sure that our webcam and microphone array worked by plugging into my computer.

To start working on our speech separation techniques, I recorded audio with two speakers, where the two speakers are located on the opposite sides of the circular microphone array. More specifically, I measured the spacing between each of the microphones, which is a crucial parameter for delay and sum beamforming.

My team and I have also finalized our design. More specifically, we decided that in order to make our system real-time, we will broadcast our video output with captions on a computer, and then do screensharing. We came up with this solution after conducting intensive literature reviews of the challenges that previous capstone groups faced with real-time streaming of video and audio data.

Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule since the focus of this week was to work on the design presentation that Larry is presenting. In the coming weekend, I will work closely with Larry during the weekend to help him prepare for his presentation.

What deliverables do you hope to complete in the next week?

In the next week, stella and I will begin collecting more audio files. The current audio files that we had collected were in a noisy room, which made it very difficult for us to separate speech properly. I am thinking of collecting our audio files in a soundproof room in fifth commons.

I will also start to come up with some designs for microphone beamforming and Short-Time-Fourier-Transform speech separation techniques. In the following week, I will then test my speech separation techniques on the audio files that I have collected.

Team Status Report for 12 February 2022

The most significant risk that can jeopardize our project is how to receive more than 2 channels of audio input. We are currently thinking of two solutions. The first solution is to purchase a prebuilt microphone array, and then design complex speech separation algorithms (since we cannot adjust the distance of the microphones). The second solution is to purchase a USB hub, and wire USB microphones to the hub. This allows us to design less complex beamforming algorithms, since we can vary the positions of the microphones. The current plan is to first use the prebuilt microphone array with complex algorithms, and if that does not work, we can still use our complex algorithm on top of the flexibility of altering the positions of the microphones.

One of the changes that we made is to purchase a prebuilt microphone array instead of building one ourselves. This is because we do not know how to create multichannel inputs by soldering microphones. We also adopted this suggestion from Professor Sullivan. Just in case, we purchased multiple such boards to test, so that increased our spending slightly.

No changes have been made to our current schedule that we discussed in the proposal presentation.

Photos of our current progress can be found on Charlie’s report. This week we primarily worked on our proposal presentation. Therefore, we do not have many design graphics to show.

Charlie’s Status Report for 12 February 2022

This week, I gave the proposal presentation on behalf of my team. In the presentation, I discussed the use case of our product, and our proposed solution to tackle the current problem. I received some very interesting questions from the rest of the teams. One of my favorites was the possibility to use facial recognition deep learning models to do lip reading. While I doubt that my team will adopt such techniques due to the complexity, it does pique my interest in the topic as a future research direction.

I also designed a simple illustration of our solution, as shown below.

 

I think my illustration really helped my audience to understand my solution. It is extremely intuitive yet representative of my idea.

After the presentation on Wednesday, my team and I met up in person to work on our design. We decided that the most logical next step was to start working on the design presentation since it will help us to figure out what components we needed to obtain. Nevertheless, we booked a Jetson because we wanted to make sure that Larry could figure out how to use it given that he is the embedded person in our group.

I was also concerned that we might need to upsample our 8kHz audio to 16kHz or 44.1kHz in order to feed into our speech-to-text (STT) model.

I tested this and made sure that even a 8kHz sampling rate was sufficient for the STT to work.

We are currently on track with our progress, with reference to our gantt chart.

We should be coming up with a list of items we need by the end of this week to submit. Stella and I will start designing the beamforming algorithm this weekend, and Larry will start working on his presentation.