Stella’s Status Report for 19 February 2022

Now that we have our mic array and webcam, we can start collecting data. So, this week I worked on creating a testing plan so that we can make a list of what data we need to collect and what parts of the system we need working to collect it. This testing document includes a list of repeatable steps for measuring the accuracy of our speech-to-text output, as well as a list of parameters we can vary between tests (e.g., script content, speakers speaking separately vs. simultaneously, speaker positioning relative to each other and to our device, frequency range of the speech). Many of these parameters I took from my notes on the feedback Dr. Sullivan gave us on our system design when we met with him on Wednesday.

For the sake of collecting test data I also found some videos on YouTube of people speaking basic English phrases (mostly these videos are made for people learning English). We’ll have to test to see if our speech-to-text pipeline performs any differently on live speaking vs. on recorded speaking, but if live vs. recorded doesn’t make a difference, we could use these videos to allow one person to collect 2-speaker test data on their own (which could be easier for us, logistically).

I also helped edit Larry’s Design Review slides. Specifically I changed the formatting of parts of our block diagrams to make them easier to understand, and for our testing plan I combined our capture-to-display delay tests for video and for captions into a single test, since we decided this week to add captions to our video before showing it to the user.

I think we are currently mostly on schedule. Our main goal this week was to hammer out the details of our design for the design presentation. As a team, we went through all of the components of our system and decided how we want to transfer data between them, which was a big gap in our design before this week. We had initially planned to have an initial beamforming algorithm completed by this week, however in our meeting on Wednesday we decided, based on Dr. Sullivan’s feedback, to try and use a linear rather than circular mic array (which affects the beamforming algorithm). Charlie and I will work this week on finishing a beamforming algorithm that we can start running test data through and improving.

In the next week, I plan to finish writing a testing plan for our audio data collection, so that we can log what types of data we need and what we’ve collected, and so we can collect data in a repeatable manner. Charlie and I will collect more audio data and work on the beamforming algorithm so that we can test an iteration of the speech-to-text pipeline this or next week. I also plan to work on our written design report and hopefully have a draft of that done by the end of the week.

In the next week I’ll also try to find a linear mic array that can connect to our Jetson TX2, since we haven’t found one yet.

Team Status Report for 19 February 2022

Currently, the most significant risk that can jeopardize our project is that we may not be able to separate the speakers well enough for the Speech to Text model to produce usable captions. We spoke with Professor Sullivan about our circular microphone array, and he strongly recommended the use of a linear array for our application. There don’t seem to be any great options for prebuilt linear arrays online, as we could only find one specifically for the Raspberry Pi. The estimated shipping time for that array is a month, so for now we plan to continue working with the UMA-8. If the UMA-8 is too small for both beamforming and STFT, we will have to try building our own array out of separate microphones. This approach will add cost and potentially take a lot more time. None of us are familiar with the steps involved in recording from multiple microphones, so we hope to avoid that complication.

One of the main changes we made from the proposal presentation is the use of a Jetson TX2 for all of the processing. We wanted to limit the amount of data movement that we would have to deal with, and the Jetson TX2 also provides consistent processing and I/O capability compared to the variability of the user’s laptop. Another key design choice we made was to use an HDMI to USB video capture card to transfer our final output to the user’s laptop. We based this off of the iContact project from Fall 2020. Both of these changes should greatly simplify our design and allow us to focus on the sound processing.

Our schedule remains pretty much the same as the one presented in the proposal presentation. Instead of having to worry about circuit wiring, however, we now just have to deal with the video capture card.

We were able to successfully use the TX2 to interface with the webcam and UMA-8 through a USB hub. We have now started to work with the video and audio data of what we hope to be our final components.

Larry’s Status Report for 19 February 2022

This week, I worked on both the design presentation and the initial testing of the components that we received. I got the demo of Detectron2 running on the TX2 and was able to record 7 channels using the UMA-8 microphone array. Installing all the required software packages took some time, but was fairly simple all things considered. I installed Detectron2 from source and had to use Nvidia’s instructions for installing PyTorch on Jetsons with CUDA enabled. Below is a picture of the image segmentation demo from Detectron2.

Since I will be presenting for the design presentation, I put the majority of my time this week towards that.

I believe that we are definitely on schedule so far. We have the high level design figured out and have confirmed that our purchased components work together. Of course, we have yet to tackle the hardest parts of the project.

Next week, I will have completed the design presentation and will begin working on using the image segmentation data to do angle estimation. As a deliverable, I hope to produce an accurate angle for a single person in the webcam view.

Charlie’s status report for 19 February 2022

What did you personally accomplish this week on the project? Give files or photos that demonstrate your progress. Prove to the reader that you put sufficient effort into the project over the course of the week (12+ hours).

This week, I sent the purchase orders for the components that my team would need for our project. Most of our parts have arrived. I tested to make sure that our webcam and microphone array worked by plugging into my computer.

To start working on our speech separation techniques, I recorded audio with two speakers, where the two speakers are located on the opposite sides of the circular microphone array. More specifically, I measured the spacing between each of the microphones, which is a crucial parameter for delay and sum beamforming.

My team and I have also finalized our design. More specifically, we decided that in order to make our system real-time, we will broadcast our video output with captions on a computer, and then do screensharing. We came up with this solution after conducting intensive literature reviews of the challenges that previous capstone groups faced with real-time streaming of video and audio data.

Is your progress on schedule or behind? If you are behind, what actions will be taken to catch up to the project schedule?

We are currently on schedule since the focus of this week was to work on the design presentation that Larry is presenting. In the coming weekend, I will work closely with Larry during the weekend to help him prepare for his presentation.

What deliverables do you hope to complete in the next week?

In the next week, stella and I will begin collecting more audio files. The current audio files that we had collected were in a noisy room, which made it very difficult for us to separate speech properly. I am thinking of collecting our audio files in a soundproof room in fifth commons.

I will also start to come up with some designs for microphone beamforming and Short-Time-Fourier-Transform speech separation techniques. In the following week, I will then test my speech separation techniques on the audio files that I have collected.

Stella’s Status Report for 12 Feb 2022

In the first half of this week, I helped Charlie and Larry put together our proposal presentation and helped Charlie edit his script for the presentation. In particular, I worked on defining our stakeholder (use case) requirements and on our solution approach diagram and testing plans. I took notes during Charlie’s presentation on questions that came up for me and from other students and used these notes to figure out what questions to ask Dr. Sullivan after our presentation on Wednesday.  After the presentation on Wednesday, our team met up to discuss our next steps. We decided that Charlie and I would decide what microphones we wanted to order and would start on our beamforming algorithm.

Using the advice we got from Dr. Sullivan, I searched for possible microphone array options (including pre-built arrays and parts we could use to build our own). The most important criteria I was looking for were: 1) that we could access the data from each audio channel (one channel per mic) to process them on a computer, and 2) for pre-built arrays, that the mics were spaced far enough apart to distinguish two speakers. Charlie and I met up to discuss mic options and decided on a pre-built circular mic array that can connect to the Jetson we’re using.

In order to start working on the beamforming algorithm, I reviewed the beamforming material from the end of 18-792 and started to look into existing MATLAB code for beamforming. Charlie and I met up to discuss the beamforming algorithm. Here we decided to go with a circular mic array rather than a linear one, so in the coming week we will be figuring out the math for circular mic array beamforming (we’ve only learned linear so far). By the end of the week I aim to have at least an outline of our entire algorithm so that we can start feeding in mic data as soon as we get the mic array. I will also be looking for individual mics we can use to set up our own array, as a backup plan in case we have issues using a pre-built array.

We have a list of all parts we want to order now, and we have clear next steps for developing our beamforming algorithm, so I think we’re on track according to our schedule.

Team Status Report for 12 February 2022

The most significant risk that can jeopardize our project is how to receive more than 2 channels of audio input. We are currently thinking of two solutions. The first solution is to purchase a prebuilt microphone array, and then design complex speech separation algorithms (since we cannot adjust the distance of the microphones). The second solution is to purchase a USB hub, and wire USB microphones to the hub. This allows us to design less complex beamforming algorithms, since we can vary the positions of the microphones. The current plan is to first use the prebuilt microphone array with complex algorithms, and if that does not work, we can still use our complex algorithm on top of the flexibility of altering the positions of the microphones.

One of the changes that we made is to purchase a prebuilt microphone array instead of building one ourselves. This is because we do not know how to create multichannel inputs by soldering microphones. We also adopted this suggestion from Professor Sullivan. Just in case, we purchased multiple such boards to test, so that increased our spending slightly.

No changes have been made to our current schedule that we discussed in the proposal presentation.

Photos of our current progress can be found on Charlie’s report. This week we primarily worked on our proposal presentation. Therefore, we do not have many design graphics to show.

Larry’s Status Report for 12 February 2022

For this first part of this week, I helped Charlie construct his presentation. I also requested and received a Jetson TX2, which we are considering using. The biggest unknown for me is how we plan on capturing and moving around data. We proposed capturing the datastream on the Pi and sending it to a laptop for processing, but doing both on the TX2 would remove the extra step.

I attempted to reflash the TX2 using my personal computer, but I do not have the correct Ubuntu version. Fortunately, the previous group left the default password on, so we could just remove their files and add ours. Space is extremely limited right now without an SD card, so I poked around and preliminarily deleted about a gigabyte in files.

Since the current plan is for me to do the design presentation, I also began looking through the guidance document and structuring the presentation. I should have all of it done by next week.

Overall, I think that I am on schedule or maybe slightly behind. There are still many aspects of the design that need to be hashed out.

Charlie’s Status Report for 12 February 2022

This week, I gave the proposal presentation on behalf of my team. In the presentation, I discussed the use case of our product, and our proposed solution to tackle the current problem. I received some very interesting questions from the rest of the teams. One of my favorites was the possibility to use facial recognition deep learning models to do lip reading. While I doubt that my team will adopt such techniques due to the complexity, it does pique my interest in the topic as a future research direction.

I also designed a simple illustration of our solution, as shown below.

 

I think my illustration really helped my audience to understand my solution. It is extremely intuitive yet representative of my idea.

After the presentation on Wednesday, my team and I met up in person to work on our design. We decided that the most logical next step was to start working on the design presentation since it will help us to figure out what components we needed to obtain. Nevertheless, we booked a Jetson because we wanted to make sure that Larry could figure out how to use it given that he is the embedded person in our group.

I was also concerned that we might need to upsample our 8kHz audio to 16kHz or 44.1kHz in order to feed into our speech-to-text (STT) model.

I tested this and made sure that even a 8kHz sampling rate was sufficient for the STT to work.

We are currently on track with our progress, with reference to our gantt chart.

We should be coming up with a list of items we need by the end of this week to submit. Stella and I will start designing the beamforming algorithm this weekend, and Larry will start working on his presentation.