This week I mainly spent finally getting the audio to work on the raspberry pi. After spending a considerable amount of time trying to get our old separate mic/speaker setup to work, we eventually decided to just transition to a 2 in 1 speakerphone. Eventhough, we were initially led to believe this would not work, I was able to spend sometime configuring the raspberry pi to recognize the speakerphone to allow for programmatic audio input/output. I was finally able to start testing the capabilities our our program. However, I had to spend quite a lot of time getting the TTS models working on the raspberry pi. This required tirelessly searching for a version of Pytorch that would work on the raspberry pi.
Since I was able to finally get the device inputting and outputting audio, I decided to start benchmarking the TTS(Text-To-Speech) and ASR(Automatic Speech Recognition models we were using. As mentioned in our previous post we switched from pocketphinx/espeak for ASR/TTS to VOSK/Coqui TTS. Vosk perfomed in line with what we wanted to allowing for almost real time speech recognition. However, Coqui TTS, was very slow. I tested a few different TTS models such as piper-tts, solero-tts, espeak, nanotts, and others. Espeak was the fastest but also the worst sounding, while piper-tts combined speed and performance. However it is still a bit to slow for out use case.
To combat, this issue we are looking to transition back to using a Raspberry pi 5, after our last Raspberry pi 5 was stolen and we were forced to use a Raspberry pi 4. I think we are definitely on track and I will spend next week working on integrating querying the LLM with the devie.