The most significant risk that can jeopardize our project right now is whether or not we will be able to separate speech from two sources (i.e. two people speaking) well enough for our speech-to-text model to generate separate and accurate captions for the two sources. There are two elements here that we have to work with to get well-separated speech: 1) our microphone array and 2) our speech separation algorithm. Currently, our only microphone array is still the UMA-8 circular array. Over the past week, we searched for linear mic arrays that could connect directly to our Jetson TX2 but didn’t find any. We did find two other options for mic arrays: 1) a 4-mic linear array that we can use with a Raspberry Pi, 2) a USBStreamer Kit that we can connect multiple I2S MEMS mics to, then connect theĀ USBStreamer Kit to the Jetson TX2. The challenge with these two new options is that we would need to take in data from multiple separate mics or mic arrays and we would need to synchronize the incoming audio data properly. Our current plan remains to try and get speech separation working with our UMA-8 array, but to look for and buy backup parts in case we cannot separate speech well enough with only the UMA-8.
We have made no changes to our design since last week, but rather have been working on implementing aspects of our design. We have, however, developed new ideas for how to set up our mic array since last week, and we now have ideas for solutions that would allow us to use more mics in a linear array, should we need to pivot from our current design.
We have made two changes to our schedule this week. First, we are scheduling in more time for ourselves to implement speech separation, since this part of the project is proving more challenging than we’d initially thought. Second, we are scheduling in time to work on our design proposal (which we neglected to include in our original schedule).
This week we made progress on our image processing pipeline. We implemented webcam calibration (to correct for any distance distortion our wide-angle camera lens causes) and implemented Detectron2 image segmentation to identify different humans in an image. This coming week we will implement more parts of both our image processing and audio processing pipelines.