Larry’s Status Report for 30 April 2022

Last week, I spent some time adjusting the caption generation. I wrote a very simple recursive algorithm to have the text move to a new line instead of moving off-screen, though now it can split words in half. I still have to adjust it so that words are not split in half, though that is not a difficult problem.

I also spent time looking over Stella’s final presentation and providing a small amount of feedback.

Another thing I did this week was to integrate the new microphones that we purchased with the scripts that we were running on the Jetson TX2. Surprisingly, we only had to change the sample rates in a few areas. Everything else worked pretty much out of the box. We recorded a lot more data with the new microphones. Here is a good example of a full overlap recording, showcasing both the newer captions and the higher quality microphones:

https://drive.google.com/file/d/1MFlt5AUgrVL5hiOT9XV_zveZVAu-saj3/view?usp=sharing

For comparison, this is a recording from our previous status reports that we made using the microphone array:

https://drive.google.com/file/d/1MmEE7Yh0Kxe5wChuq5n5rHsnGKMOyZYr/view?usp=sharing

The difference is pretty stark.

Currently, we are on schedule for producing an integrated final product. We definitely do not have time for a real-time implementation and we are currently discussing whether we have enough time to create some sort of enclosure for our product. Given how busy the next week will be, I doubt we will be able to do anything substantive.

By next week, we will have completed the final demo. The main items we have to finish are the integration with the website, the data analysis, the final poster, and the final video.

Larry’s Status Report for 23 April 2022

This week, I worked on finishing the integration of all the various subsystems that we developed. One challenge I encountered was that when I tried to combine everything into one Python script, the system ran out of memory. For now, I think we will split the system  into several Python scripts that are called from one shell script, and avoid recording for more than ~20 seconds. So far, this strategy seems to be working, and I am almost surprised at how well everything functions. The caption accuracy leaves something to be desired when two people are overlapping their speech, but otherwise the whole system is usable.

Here is a video sample with fully overlapping speech:
https://drive.google.com/file/d/1MmEE7Yh0Kxe5wChuq5n5rHsnGKMOyZYr/view?usp=sharing

One thing that still needs work is keeping the captions within the video frame boundaries and away from each other. I will clean up this issue next week, and do not anticipate much work involved. The other main deliverable yet to be finished is the integration Charlie’s website onto the Jetson TX2. Additionally, we ordered extra microphones that we need to test with our setup as well. Finally, the system currently only uses the deep learning approach to separate speech. It would be interesting to try to overlay captions that are generated using the signal processing approach. Once all that is done, we will have a finished product.

So far, the project is on schedule. I believe we left enough slack time and planned enough contingencies to produce something usable for the final demo. It is possible that we struggle a lot with the website, in which case we could rapidly develop something that works locally. Charlie seems confident in what he has developed, however, so we probably won’t need to change our plans.

By next week, I hope to have the deliverables I mentioned above done. I will also be helping Stella with the final presentation.

 

Larry’s Status Report for 16 April 2022

This week, I worked on installing all relevant libraries in a Python 3.8 virtual environment. I was able to successfully test all of the relevant Python components, and they seem to be working reasonably. One concern that Charlie had was that the deep learning speech separation was running extremely slowly, even after he had enabled GPU acceleration. This is something that we will look at in the coming weeks.

I also worked more on integrating everything together, and in particular fixed some small mistakes I made while doing the caption generation. The IBM Watson Speech-to-Text API separates generated text based on its confidence level. I noticed, however, that the highest confidence level text usually captions only a portion of the audio. Currently, I will use all the text that the API returns, since otherwise there are obvious gaps in the captions.

There aren’t any more interesting results for me to share here, since all I have been working on is fixing the Python libraries and integrating scripts. Overall, I would say that I am still slightly behind schedule. I hoped to have every component working together with the push of a button by now, but we are not quite there yet. If we give up on the real-time aspect, however, there is not that much left to do. Charlie and I will work together on setting up the web server, after which we will have a final product to present.

By next week, I definitely should have the entire system working together. We should be able to run a single script to record video/audio, generate and overlay captions, and add stereo audio to the video output. We also should have a basic website running through the Jetson TX2.

Team Status Report for 16 April 2022

The most significant risk that could jeopardize are project is that we are not able to put together a strong user experience. We believe we have all of the parts working and relatively integrated, but we have very little experience creating a good UI. Our current plan is to create a website that lets the user record and download video. If we are unable to create a decent website, our contingency plan is to fall back on the HDMI to USB converter that we purchased. The user would then have to connect their personal laptop to the Jetson to see the output.

We are still having difficulty using signal processing techniques such as PDCW and SSF to separate speech, and are most likely moving forward with the deep learning approach. We are quite confident in the deep learning approach at this time. One risk associated with it is that it may take a very long time to run, potentially degrading the user experience. We are currently trying to work on speeding up the processing, but may have to settle for a much less responsive final product.

The only major change we have made to our system is that we are attempting to run a web server on the Jetson TX2 to provide a nice user interface. We wanted an interface that was both easy to implement and easy to use, and settled on a web server. There are no extra costs incurred by this change, and we have the time to work on it.

In our schedule, Charlie will now be working on the website for most of the remaining time. We have no other changes to our schedule.

Early progress on the website:

Larry’s Status Report for 10 April 2022

This week, I worked on writing and integrating the last few parts of the deep learning and non-real-time approach to our project. Instead of my previous approach of combining OpenCV and PyAudio into one Python script, I simply used GStreamer to record audio and video at the same time. I also extracted the timestamped captions and overlaid them onto the video without too much trouble. Example below:

I am currently trying to install all of the relevant libraries into a Python 3.8  virtual library. Running the deep learning solution requires Python 3.8, and I figured that moving everything to a Python 3.8 virtual environment would make later testing and integration a lot easier. As I mentioned in my last support, some components required Python 3.7 while others were installed on the default Python 3.6.

Using GStreamer directly on the command line instead of through OpenCV means that we do not have to compile OpenCV with GStreamer support, which is convenient for our switch to Python 3.8. I have not finished building PyTorch and Detectron2 for 3.8 yet, but so far I do not anticipate any major issues with this change.

I am currently slightly behind schedule, since I wanted to have some thoughts on how we would build a real-time implementation by now. Given the amount of time left, a real-time implementation may not be feasible. This is something we envisioned from the beginning of project, so it does not significantly change our plans. In the context of working only with a non-real-time implementation, I am on schedule.

By next week, I hope to have every component working together in a streamlined fashion. We should easily be able to record and produce a captioned video with a command. Instead of focusing on real-time, we may pivot toward working on a nice GUI for our project. I hope to have worked on either one or the other by the end of this week.

Larry’s Status Report for 2 April 2022

This week, I worked on using IBM Watson’s Speech-To-Text API in Python. The only trouble I had was that using the API required Python 3.7, while the default installed on the system was Python 3.6. Since I built OpenCV for Python3.6 and do not want to go through the trouble of making everything work, I will try to just use Python 3.7 for the Speech-To-Text and Python 3.6 for everything else. These are some of the timestamped captions that I generated:

Since we have the Interim Demo in the upcoming week, I focused on trying to put together a non-real-time demonstration. I have not managed to completely figure out how to record video and audio at the same time,  but I was able to both well enough to produce new recordings for Charlie and Stella to work with. I also worked out a lot of details with Charlie about how data should be passed around. As we are not targeting real-time, we will just be generating and using files. We currently have the means to produced separated audio, timestamped captions, and video with angle estimation, so we believe we can put together a good demo.

I am currently behind schedule, since I expected to already have placed prerecorded captions onto video already. I have all the tools available for doing so, however, so I expect to be able to catch up in the next week. At this point, it is just a matter of writing a couple Python scripts.

One aspect of our project I am worried about is that the Jetson TX2 may not be fast enough for real-time applications. While not an issue for the demo, I noticed a lot of slowdown when doing processing on real-time video. Next week, I will spend some time investigating how to speed up some of the processing. Other than a working demo as a deliverable, I want to have a more concrete understanding of where the bottlenecks are and how they may be addressed.

Team Status Report for 26 March 2022

The most significant risk that could jeopardize our project remains whether or not we are able to separate speech well enough to generate usable captions.

To manage this risk, we are simultaneously working on multiple approaches for speech separation. The deep learning approach is good enough for our MVP, though it may require a calibration step to match voices in the audio to people in the video. We want to avoid having a calibration step and are therefore continuing to develop a beamforming solution. Combining multiple approaches may end up being the best solution for our project. Our contingency plan, however, is to use deep learning with a calibration step. This solution is likely to work, but is also the least novel.

We have not made any significant changes to the existing design of the system. One thing we are considering now is how to determine the number of people that are currently speaking in the video. Information about the number of people currently speaking would help us avoid generating extraneous and incorrect captions. With the current prevalence of masks, we have to rely on an audio solution. Charlie is developing a clever solution that scales to 2 people, which is all we need given our project scope. We will likely integrate his solution with the audio processing code.

We have not made any further adjustments to our schedule.

One component that is now mostly working is angle estimation using the Jetson TX2 CSI camera, which we are using due to its slightly higher FOV and lower distortion.

In the above picture, the angle estimation is overlaid over the center position of the person.

Larry’s Status Report for 26 March 2022

This week, I spent most of my time trying to build OpenCV with GStreamer correctly. I made a lot of small and avoidable mistakes, such as not properly uninstalling other versions, but everything ended up working out.

Once I got OpenCV installed and working with Python 3, I updated the previous code I was using to work with the Jetson TX2’s included CSI camera. I calibrated the camera and also used putText to overlay angle text over the detected people in the camera’s video output. It’s a little hard to see, but the image below shows the red text on the right of the video.

There were  a few oddities that I also spent time working out. By installing OpenCV with GStreamer, the default backend for VideoCapture changed for the webcam. Once I learned what was going on, the fix was simply to change the backend to V4L when using the webcam. I also struggled with making the CSI camera output feel as responsive as the webcam was. A tip online suggested setting the GStreamer appsink to drop buffers, which seemed to help.

Once I got the CSI camera working reasonably, I began investigating how to use the IBM Watson Speech-to-Text API. I did not get very far this week beyond making an account and copying some sample code.

I believe that I am still on schedule for my personal progress. Our group progress, however, may be behind schedule since we have not yet fully figured out any sort of beamforming. I will try to continue making steady progress on my part of the project, and I will also try to support my team members with ideas for the audio processing portion.

Next week, I hope to have speech-to-text working and somewhat integrated with the camera system. I want to be able to overlay un-separated real-time captions onto people in the video. If I cannot get the real-time captions working, I want to at least overlay captions from pre-recorded audio onto a real-time video.

Larry’s Status Report for 19 March 2022

This week, I worked on writing code for angle estimation by interfacing with the camera, identifying people in the scene, and providing angle estimates for each person. My last status report stated that I hoped to complete angle estimation by the end of this week, and while I am substantially closer, I am still not quite done.

With some help from Charlie, I have been able to use the video from the webcam to identify the pixel locations of each person. With an estimate of the camera calibration matrix, I have also produced angle estimates for the pixel locations. My main issue so far is that the angle estimates are not entirely accurate, primarily due to the strong fisheye effect of the webcam.

As seen in the image above, the webcam produces a greatly distorted image of a square whiteboard. While the camera calibration matrix can produce a good result for any pixel along the horizontal center of the image, it does not compensate for the distortion at the edges.

Another thing I noticed was that while our webcam claimed to have a 140 degree FOV, I measured the horizontal FOV to be at best 60 degrees. The fisheye effect gives the impression of a wide angle camera, but in reality the FOV does not meet our design requirements. I have decided to try and use the included camera on the TX2, which I initially deemed to have too narrow a field of view for our project.

The above image shows that the included TX2 camera (top left) has a horizontal FOV that is slightly better than the webcam (bottom right). What I am currently working on is trying to integrate the included camera with my existing code. The issue I struggled with at the end of this week was installing OpenCV with GStreamer support to use the CSI camera, which took many hours.

I believe that we are still generally on schedule, though further behind than we were last week. To ensure that we stay on schedule, I will try to focus on integrating more of the components together to allow for faster and more applicable testing. My main concern so far is how we will actually handle the speech separation, so finishing up all the aspects around speech separation should allow us to focus on it.

By next week, I hope to have the camera and angle estimation code completely finished. I also want to be able to overlay text onto people in a scene, and have some work done toward generating captions from audio input.

 

Larry’s Status Report for 26 February 2022

I presented the design review presentation this week, which is what I spent a majority of my working time on again. Overall, I was fairly satisfied with how the presentation went. There weren’t many questions after the presentation and I have not yet received any feedback, so we have not made any adjustments in response to the design review. I hope to receive constructive feedback once we get comments back, however.

Last week, I stated that I hoped to produce an accurate angle for a single person in webcam view. I did not meet that goal this week, though Charlie and I have most of what we think we need for producing a good result. I calibrated the webcam and Charlie worked on converting Detectron2’s image segmentation into a usable format for our project. Below is a picture of the camera’s calibration matrix.

Our project is still on schedule. Looking at our Gantt chart, I was scheduled to complete angle estimation by next week. I should be able to produce a reasonable angle estimation soon given our current progress.

Coming up soon is the design report, which will likely also take a good amount of time to put together. I notice now that it isn’t in the current version of the Gantt chart, which is a bit of an oversight. I believe our schedule has enough slack for it, however. The main deliverables I hope to complete by next week are the angle estimation and the design report.