This week, I worked on installing all relevant libraries in a Python 3.8 virtual environment. I was able to successfully test all of the relevant Python components, and they seem to be working reasonably. One concern that Charlie had was that the deep learning speech separation was running extremely slowly, even after he had enabled GPU acceleration. This is something that we will look at in the coming weeks.
I also worked more on integrating everything together, and in particular fixed some small mistakes I made while doing the caption generation. The IBM Watson Speech-to-Text API separates generated text based on its confidence level. I noticed, however, that the highest confidence level text usually captions only a portion of the audio. Currently, I will use all the text that the API returns, since otherwise there are obvious gaps in the captions.
There aren’t any more interesting results for me to share here, since all I have been working on is fixing the Python libraries and integrating scripts. Overall, I would say that I am still slightly behind schedule. I hoped to have every component working together with the push of a button by now, but we are not quite there yet. If we give up on the real-time aspect, however, there is not that much left to do. Charlie and I will work together on setting up the web server, after which we will have a final product to present.
By next week, I definitely should have the entire system working together. We should be able to run a single script to record video/audio, generate and overlay captions, and add stereo audio to the video output. We also should have a basic website running through the Jetson TX2.
