Stella’s Status Report for 26 February 2022

This week I taught myself about the delay-and-sum algorithm for beamforming with a linear microphone array so as to understand how delay-and-sum beamforming can be implemented for a circular microphone array (specifically, how we can calculate the correct delays for each mic in a circular array). I have written some preliminary MATLAB code to implement this algorithm, though I haven’t yet run data through it.

This week I also finished writing a testing plan to guide our audio data collection. This plan covers all of the parameters we’ve discussed varying during testing, as well as setup plans for data that we can collect to assess our system’s performance across these various parameters. I wrote up a set of steps for data collection that we can use for all of our data collection so as to have repeatable tests. Additionally, I started working on the Design Proposal document this week.

Currently, I think we are slightly behind schedule, as speech separation has been a more challenging problem than we were expecting. I had hoped to get delay-and-sum beamforming figured out by the end of this week, but I still have some work left to do to figure out how to work with the circular mic array. I hope to have notes and equations to share with my team tomorrow so that Charlie and I can start implementing the speech separation pipeline this coming week. I also didn’t achieve my goal from last week of completing a draft of the Design Proposal, but this is because we decided at the start of the week that the beamforming algorithm was a more urgent priority.

Early this week (goal: Sunday) I aim to finish figuring out the delay-and-sum beamforming equations and send them to Charlie. Once I send those equations, Charlie and I will be able to implement delay-and-sum beamforming with Charlie’s generalised sidelobe canceller algorithm and test how well that pipeline is able to separate speech (not in real time). By mid-week, we will also finish putting together our Design Proposal. I plan to work on a big part of this proposal since I’ve written similar documents before.

Team Status Report for 26 February 2022

The most significant risk that can jeopardize our project right now is whether or not we will be able to separate speech from two sources (i.e. two people speaking) well enough for our speech-to-text model to generate separate and accurate captions for the two sources. There are two elements here that we have to work with to get well-separated speech: 1) our microphone array and 2) our speech separation algorithm. Currently, our only microphone array is still the UMA-8 circular array. Over the past week, we searched for linear mic arrays that could connect directly to our Jetson TX2 but didn’t find any. We did find two other options for mic arrays: 1) a 4-mic linear array that we can use with a Raspberry Pi, 2) a USBStreamer Kit that we can connect multiple I2S MEMS mics to, then connect the USBStreamer Kit to the Jetson TX2. The challenge with these two new options is that we would need to take in data from multiple separate mics or mic arrays and we would need to synchronize the incoming audio data properly. Our current plan remains to try and get speech separation working with our UMA-8 array, but to look for and buy backup parts in case we cannot separate speech well enough with only the UMA-8.

We have made no changes to our design since last week, but rather have been working on implementing aspects of our design. We have, however, developed new ideas for how to set up our mic array since last week, and we now have ideas for solutions that would allow us to use more mics in a linear array, should we need to pivot from our current design.

We have made two changes to our schedule this week. First, we are scheduling in more time for ourselves to implement speech separation, since this part of the project is proving more challenging than we’d initially thought. Second, we are scheduling in time to work on our design proposal (which we neglected to include in our original schedule).

This week we made progress on our image processing pipeline. We implemented webcam calibration (to correct for any distance distortion our wide-angle camera lens causes) and implemented Detectron2 image segmentation to identify different humans in an image. This coming week we will implement more parts of both our image processing and audio processing pipelines.

Charlie’s status report for 26 February 2022

 What did you personally accomplish this week on the project? Give files orphotos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

This week, I implemented image segmentation using Detectron2 to identify the location of only humans in the scene.

The image on the left is the test image provided by the Detectron2 dataset. Just to be certain that the image segmentation method works on non-full bodies and difficult images, I tested it on my own image. It appears that the Detectron2 works very well. Thereafter, I wrote the network in a modular method so that a function that outputs the location of each speaker can be easily called. The coordinates will then be combined with our angle estimation pipeline (which Larry is currently implementing).

 Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

I would say we are slightly behind, because I was surprised at how difficult it was to separate the speech of two speakers even with the STFT method. I had previously implemented this method in a class and tested that it worked successfully. However, it did not work on our audio recordings. I am currently in the process of debugging why that occurred. In the meantime, Stella is working on delay-and-sum beamforming which is our second attempt at enhancing speech.

 What deliverables do you hope to complete in the next week?

In the coming week, I hope to be able to get the equations for delay-and-sum beamforming from stella. Once I receive the equations, I will then be able to implement the generalised sidelobe canceller  (GSC) to determine if our speech enhancement method works (in a non-real-time case). In the event that GSC does not work, my group has identified a deep learning approach to separate speech. We do not plan to jump straight to that as it cannot be applied in realtime.

Larry’s Status Report for 26 February 2022

I presented the design review presentation this week, which is what I spent a majority of my working time on again. Overall, I was fairly satisfied with how the presentation went. There weren’t many questions after the presentation and I have not yet received any feedback, so we have not made any adjustments in response to the design review. I hope to receive constructive feedback once we get comments back, however.

Last week, I stated that I hoped to produce an accurate angle for a single person in webcam view. I did not meet that goal this week, though Charlie and I have most of what we think we need for producing a good result. I calibrated the webcam and Charlie worked on converting Detectron2’s image segmentation into a usable format for our project. Below is a picture of the camera’s calibration matrix.

Our project is still on schedule. Looking at our Gantt chart, I was scheduled to complete angle estimation by next week. I should be able to produce a reasonable angle estimation soon given our current progress.

Coming up soon is the design report, which will likely also take a good amount of time to put together. I notice now that it isn’t in the current version of the Gantt chart, which is a bit of an oversight. I believe our schedule has enough slack for it, however. The main deliverables I hope to complete by next week are the angle estimation and the design report.

Stella’s Status Report for 19 February 2022

Now that we have our mic array and webcam, we can start collecting data. So, this week I worked on creating a testing plan so that we can make a list of what data we need to collect and what parts of the system we need working to collect it. This testing document includes a list of repeatable steps for measuring the accuracy of our speech-to-text output, as well as a list of parameters we can vary between tests (e.g., script content, speakers speaking separately vs. simultaneously, speaker positioning relative to each other and to our device, frequency range of the speech). Many of these parameters I took from my notes on the feedback Dr. Sullivan gave us on our system design when we met with him on Wednesday.

For the sake of collecting test data I also found some videos on YouTube of people speaking basic English phrases (mostly these videos are made for people learning English). We’ll have to test to see if our speech-to-text pipeline performs any differently on live speaking vs. on recorded speaking, but if live vs. recorded doesn’t make a difference, we could use these videos to allow one person to collect 2-speaker test data on their own (which could be easier for us, logistically).

I also helped edit Larry’s Design Review slides. Specifically I changed the formatting of parts of our block diagrams to make them easier to understand, and for our testing plan I combined our capture-to-display delay tests for video and for captions into a single test, since we decided this week to add captions to our video before showing it to the user.

I think we are currently mostly on schedule. Our main goal this week was to hammer out the details of our design for the design presentation. As a team, we went through all of the components of our system and decided how we want to transfer data between them, which was a big gap in our design before this week. We had initially planned to have an initial beamforming algorithm completed by this week, however in our meeting on Wednesday we decided, based on Dr. Sullivan’s feedback, to try and use a linear rather than circular mic array (which affects the beamforming algorithm). Charlie and I will work this week on finishing a beamforming algorithm that we can start running test data through and improving.

In the next week, I plan to finish writing a testing plan for our audio data collection, so that we can log what types of data we need and what we’ve collected, and so we can collect data in a repeatable manner. Charlie and I will collect more audio data and work on the beamforming algorithm so that we can test an iteration of the speech-to-text pipeline this or next week. I also plan to work on our written design report and hopefully have a draft of that done by the end of the week.

In the next week I’ll also try to find a linear mic array that can connect to our Jetson TX2, since we haven’t found one yet.

Team Status Report for 19 February 2022

Currently, the most significant risk that can jeopardize our project is that we may not be able to separate the speakers well enough for the Speech to Text model to produce usable captions. We spoke with Professor Sullivan about our circular microphone array, and he strongly recommended the use of a linear array for our application. There don’t seem to be any great options for prebuilt linear arrays online, as we could only find one specifically for the Raspberry Pi. The estimated shipping time for that array is a month, so for now we plan to continue working with the UMA-8. If the UMA-8 is too small for both beamforming and STFT, we will have to try building our own array out of separate microphones. This approach will add cost and potentially take a lot more time. None of us are familiar with the steps involved in recording from multiple microphones, so we hope to avoid that complication.

One of the main changes we made from the proposal presentation is the use of a Jetson TX2 for all of the processing. We wanted to limit the amount of data movement that we would have to deal with, and the Jetson TX2 also provides consistent processing and I/O capability compared to the variability of the user’s laptop. Another key design choice we made was to use an HDMI to USB video capture card to transfer our final output to the user’s laptop. We based this off of the iContact project from Fall 2020. Both of these changes should greatly simplify our design and allow us to focus on the sound processing.

Our schedule remains pretty much the same as the one presented in the proposal presentation. Instead of having to worry about circuit wiring, however, we now just have to deal with the video capture card.

We were able to successfully use the TX2 to interface with the webcam and UMA-8 through a USB hub. We have now started to work with the video and audio data of what we hope to be our final components.

Larry’s Status Report for 19 February 2022

This week, I worked on both the design presentation and the initial testing of the components that we received. I got the demo of Detectron2 running on the TX2 and was able to record 7 channels using the UMA-8 microphone array. Installing all the required software packages took some time, but was fairly simple all things considered. I installed Detectron2 from source and had to use Nvidia’s instructions for installing PyTorch on Jetsons with CUDA enabled. Below is a picture of the image segmentation demo from Detectron2.

Since I will be presenting for the design presentation, I put the majority of my time this week towards that.

I believe that we are definitely on schedule so far. We have the high level design figured out and have confirmed that our purchased components work together. Of course, we have yet to tackle the hardest parts of the project.

Next week, I will have completed the design presentation and will begin working on using the image segmentation data to do angle estimation. As a deliverable, I hope to produce an accurate angle for a single person in the webcam view.

Charlie’s status report for 19 February 2022

What did you personally accomplish this week on the project? Give files or photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

This week, I sent the purchase orders for the components that my team would need for our project. Most of our parts have arrived. I tested to make sure that our webcam and microphone array worked by plugging into my computer.

To start working on our speech separation techniques, I recorded audio with two speakers, where the two speakers are located on the opposite sides of the circular microphone array. More specifically, I measured the spacing between each of the microphones, which is a crucial parameter for delay and sum beamforming.

My team and I have also finalized our design. More specifically, we decided that in order to make our system real-time, we will broadcast our video output with captions on a computer, and then do screensharing. We came up with this solution after conducting intensive literature reviews of the challenges that previous capstone groups faced with real-time streaming of video and audio data.

Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule since the focus of this week was to work on the design presentation that Larry is presenting. In the coming weekend, I will work closely with Larry during the weekend to help him prepare for his presentation.

What deliverables do you hope to complete in the next week?

In the next week, stella and I will begin collecting more audio files. The current audio files that we had collected were in a noisy room, which made it very difficult for us to separate speech properly. I am thinking of collecting our audio files in a soundproof room in fifth commons.

I will also start to come up with some designs for microphone beamforming and Short-Time-Fourier-Transform speech separation techniques. In the following week, I will then test my speech separation techniques on the audio files that I have collected.

Stella’s Status Report for 12 Feb 2022

In the first half of this week, I helped Charlie and Larry put together our proposal presentation and helped Charlie edit his script for the presentation. In particular, I worked on defining our stakeholder (use case) requirements and on our solution approach diagram and testing plans. I took notes during Charlie’s presentation on questions that came up for me and from other students and used these notes to figure out what questions to ask Dr. Sullivan after our presentation on Wednesday.  After the presentation on Wednesday, our team met up to discuss our next steps. We decided that Charlie and I would decide what microphones we wanted to order and would start on our beamforming algorithm.

Using the advice we got from Dr. Sullivan, I searched for possible microphone array options (including pre-built arrays and parts we could use to build our own). The most important criteria I was looking for were: 1) that we could access the data from each audio channel (one channel per mic) to process them on a computer, and 2) for pre-built arrays, that the mics were spaced far enough apart to distinguish two speakers. Charlie and I met up to discuss mic options and decided on a pre-built circular mic array that can connect to the Jetson we’re using.

In order to start working on the beamforming algorithm, I reviewed the beamforming material from the end of 18-792 and started to look into existing MATLAB code for beamforming. Charlie and I met up to discuss the beamforming algorithm. Here we decided to go with a circular mic array rather than a linear one, so in the coming week we will be figuring out the math for circular mic array beamforming (we’ve only learned linear so far). By the end of the week I aim to have at least an outline of our entire algorithm so that we can start feeding in mic data as soon as we get the mic array. I will also be looking for individual mics we can use to set up our own array, as a backup plan in case we have issues using a pre-built array.

We have a list of all parts we want to order now, and we have clear next steps for developing our beamforming algorithm, so I think we’re on track according to our schedule.

Team Status Report for 12 February 2022

The most significant risk that can jeopardize our project is how to receive more than 2 channels of audio input. We are currently thinking of two solutions. The first solution is to purchase a prebuilt microphone array, and then design complex speech separation algorithms (since we cannot adjust the distance of the microphones). The second solution is to purchase a USB hub, and wire USB microphones to the hub. This allows us to design less complex beamforming algorithms, since we can vary the positions of the microphones. The current plan is to first use the prebuilt microphone array with complex algorithms, and if that does not work, we can still use our complex algorithm on top of the flexibility of altering the positions of the microphones.

One of the changes that we made is to purchase a prebuilt microphone array instead of building one ourselves. This is because we do not know how to create multichannel inputs by soldering microphones. We also adopted this suggestion from Professor Sullivan. Just in case, we purchased multiple such boards to test, so that increased our spending slightly.

No changes have been made to our current schedule that we discussed in the proposal presentation.

Photos of our current progress can be found on Charlie’s report. This week we primarily worked on our proposal presentation. Therefore, we do not have many design graphics to show.