This week, I worked on using IBM Watson’s Speech-To-Text API in Python. The only trouble I had was that using the API required Python 3.7, while the default installed on the system was Python 3.6. Since I built OpenCV for Python3.6 and do not want to go through the trouble of making everything work, I will try to just use Python 3.7 for the Speech-To-Text and Python 3.6 for everything else. These are some of the timestamped captions that I generated:
Since we have the Interim Demo in the upcoming week, I focused on trying to put together a non-real-time demonstration. I have not managed to completely figure out how to record video and audio at the same time, but I was able to both well enough to produce new recordings for Charlie and Stella to work with. I also worked out a lot of details with Charlie about how data should be passed around. As we are not targeting real-time, we will just be generating and using files. We currently have the means to produced separated audio, timestamped captions, and video with angle estimation, so we believe we can put together a good demo.
I am currently behind schedule, since I expected to already have placed prerecorded captions onto video already. I have all the tools available for doing so, however, so I expect to be able to catch up in the next week. At this point, it is just a matter of writing a couple Python scripts.
One aspect of our project I am worried about is that the Jetson TX2 may not be fast enough for real-time applications. While not an issue for the demo, I noticed a lot of slowdown when doing processing on real-time video. Next week, I will spend some time investigating how to speed up some of the processing. Other than a working demo as a deliverable, I want to have a more concrete understanding of where the bottlenecks are and how they may be addressed.