oemegwa – Team E0: VoiceVault

I spent a large part of this week preparing for my presentation. I sent time practicing what I was going to say and how to best convey our object and technical design. In addition, we as a team did a lot more testing for the final report. We conducted more user studies, to help verify our technical requirements.

By myself, I spent time benchmarking our Speech-To-Text and Text-To-Speech models as well as the roundtrip latency for our entire system. My findings showed that in most cases the roundtrip time from when the user stops speaking their query to when the model starts with its response is under 5s. This fulfills our technical specifications. The only case where this failed was on the first query to the server. This is likely attributed to a need for a buffer to make sure everything has started up properly.

I think we are definitely on track. For this upcoming week, I will mainly spend my time working on the report and poster for the final demo.

April 20, 2025

Kemdi Emegwa’s Status Report for 4/19

I spent this week improving the quality of our system primarily but implementing streaming queries from the cloud VM to the device. As mentioned in my last status report, I implemented streaming the query response from the text to speech engine to the speaker. This week I built on top of this logic so the entire query process is streamed end to end.

Using the python httpx library, I changed the server logic so rather than sending the entire response at once, it first sends the type, so the device can prepare to handle the response and then it sends the rest of the response in chunks. This massively improved the time to first token, making our program feel near real-time.

Additionally, as a team, we worked with some test subjects to validate and verify the quality of our system. The overall sentiment was highly positive.

I think we are definitely on track to finish our project and don’t for see any blockers. This next week, I will mainly spend on error handling and improving the robustness our system.

April 13, 2025

Kemdi Emegwa’s Status Report for 4/12

This week was mainly spent hardening our system and ironing out kinks and last minute problems. As mentioned in previous reports, we were encountering a dilemma where the text to speech model we were using was sub par, but the better one was a bit too slow for our use case. To combat this we introduced streaming into out architecture.

There were two main areas were were streaming needed to be introduced first was for the text to speech model itself. This would improve the time to first token output because rather than synthesizing the whole text we can synthesize in chunks and then output them. This already dramatically improved performance and allowed us to use the higher quality model. However this did not address the fact that if the model hosted on the cloud returned a very long response we would still have to wait for the entire thing before running the text-to-speech model.

In order to address this fault, we decided to stream the entire query architecture. This involved work on both the server side and the device side. We also had to change how the model returned its response to accommodate for this. However, immediately sending each chunk to the tts model to be outputted resulted in weird and choppy output. To rectify this, I made it so that the device buffers chunks until it sees a “.” or “,”, then it sends type buffered chunks to the tts model. This made it sound significantly more natural.

For this next week, I will mainly spend my time cleaning up code/error handling and also working with Justin to introduce TLS so we can query over https rather than http. I think we are definitely on track and I don’t for see us encountering any problems.

March 30, 2025

Kemdi Emegwa’s status report for 3/29

I spent this week doing a lot of different things mostly pertaining to the device, in addition we did an end to end test.

Firstly, as I mentioned last week we decided to upgrade back to a Raspberry pi 5 from the Raspberry pi 4 we had been using because the Raspberry pi 4 was not delivering the performance we wanted. I spent a bit of time configuring the new Raspberry pi 5 to work with our code and with the speakerphone we bought last week. Once I got it working, I test the Text to Speech models which were the reason we made the switch in the first place. Piper TTS, which was what I had wanted to move forward with previously was a lot faster, but still had a noticeable delay even when streaming the output. I plan on doing more research on faster text to speech models, but for right now we are using Espeak, which provides realtime tts, albeit at worse quality.

In addition, I started to think about how a user would approach setting up the device when they first get it. Operating under the assumption that the device was not already connected to wifi, the user needed a way to access the frontend. This posed a challenge. I did some research and found a solution: Access point mode.

Wifi devices can operate in 1 of 2 modes at any given time. Client mode And access point mode. Typically for our devices we use client mode and routers use access point mode, but by leveraging access point mode we are able to allow the user to access the device frontend without a wifi connection.

How this works is that when the device starts up and detects it is not connected to wifi it will activate access point mode. This emits a network that the user can connect to by going into the wifi settings and inputting a password. They then just have to go their browser and input the ip address/port where the flask server is being hosted. I have written scripts to automate starting up access point mode and turning it off, but more needs to be done to allow the device to detect it is not connected to wifi and use those scripts.

In the same vain of user experience, I configured the raspberry pi so that our scripts run on startup/reboot. This eliminates the need of a monitor to run the code, which is the way we envision users interacting with our project anyways.

Lastly, as a group we did a full end to end test with a cloud hosted open source model. We were able to test all our core functionalities, including regular LLM queries, music playback and alarms.

I don’t for see any challeneges upcoming and I believe we are on track. This upcoming week will be spent researching more TTS models, allowing the device to detect that it is not connected, and error handling.

March 23, 2025

Team Status report for 3/22

This week was largely spent making the Raspberry pi work with our code. We were able to mitigate our largest problem we were facing, which was that we were not able to programmatically input or ouput audio through the Raspberry pi. This meant that even though we could test the code on our laptops, we couldn’t verify it on the device. However after a lot of time configuring, we were able to achieve just that.

A few changes were made to our desing, we intially planned to use a Raspberry pi5, but that was stolen and we were left with a Raspberry pi 4, we have decided to go back to using a Raspberry pi 5. The TTS models that allow for reasonable quality and clarity simply do not run fast enough on a Raspberry pi 4. Our than this there is unlikely to be any major chnages upcoming.

We are in a good position and we don’t for see any major challenges heading our way. The biggest risk right now is trying to integrate the entire system.

March 23, 2025

Kemdi Emegwa’s status report for 3/22

This week I mainly spent finally getting the audio to work on the raspberry pi. After spending a considerable amount of time trying to get our old separate mic/speaker setup to work, we eventually decided to just transition to a 2 in 1 speakerphone. Eventhough, we were initially led to believe this would not work, I was able to spend sometime configuring the raspberry pi to recognize the speakerphone to allow for programmatic audio input/output. I was finally able to start testing the capabilities our our program. However, I had to spend quite a lot of time getting the TTS models working on the raspberry pi. This required tirelessly searching for a version of Pytorch that would work on the raspberry pi.

Since I was able to finally get the device inputting and outputting audio, I decided to start benchmarking the TTS(Text-To-Speech) and ASR(Automatic Speech Recognition models we were using. As mentioned in our previous post we switched from pocketphinx/espeak for ASR/TTS to VOSK/Coqui TTS. Vosk perfomed in line with what we wanted to allowing for almost real time speech recognition. However, Coqui TTS, was very slow. I tested a few different TTS models such as piper-tts, solero-tts, espeak, nanotts, and others. Espeak was the fastest but also the worst sounding, while piper-tts combined speed and performance. However it is still a bit to slow for out use case.

To combat, this issue we are looking to transition back to using a Raspberry pi 5, after our last Raspberry pi 5 was stolen and we were forced to use a Raspberry pi 4. I think we are definitely on track and I will spend next week working on integrating querying the LLM with the devie.

March 16, 2025

Kemdi Emegwa’s status report for 3/15

This week I spent a lot of time testing and making changes. After extensive testing I determined that the current solutions we were using for speech to text and text to speech were not going to be sufficient for what we want to do. CMU Pocketsphinx and Espeak simply did not allow for the minimum performance necessary for out system. Thus i made the transition to Vosk for speech to text and Coqui TTS for text to speech.

I spent a lot of time configuring the environment for these two new additions as well as determining which models would be suitable for the raspberry pi. I was able to get both working and tested, which yielded significantly better performance for similar power usage.

In addition, I add the ability to add/delete songs on our frontend for out music capabilities. I also added a database to store these songs.

I am on track and going forward, I plan on testing the frontend on the raspberry pi.

March 8, 2025

Kemdi Emegwa’s Status Report for 3/8

This week, my primary focus was on testing and debugging the microphone integration within our Raspberry Pi setup. I dedicated significant time to troubleshooting the microphone functionality issues encountered when executing our code. This involved meticulously reviewing error logs, verifying hardware connections, and running numerous diagnostic tests to pinpoint the problem. Additionally, I explored different software configurations and settings to identify compatibility challenges with the microphone.

In parallel to the debugging process, I worked extensively on refining our existing codebase. The goal was to enhance compatibility and ensure greater stability when running directly on the Raspberry Pi. This refinement process included optimizing performance, addressing potential memory usage concerns, and ensuring that our code efficiently interfaces with the hardware. The improvements made this week will set a solid foundation for the upcoming integration work.

Despite the encountered difficulties, our project remains on schedule. Through careful evaluation, we concluded that pivoting to a hardware setup utilizing a single sound card for both the speaker and microphone would be beneficial. This decision should simplify integration significantly and resolve the compatibility issues previously faced. Next week, I will specifically focus on testing and integrating this revised speaker/microphone configuration, which should maintain our progress and help us stay aligned with our overall project timeline.

Author: oemegwa

Final Report

Final Video

Kemdi Emegwa’s Status Report for 4/26