Team E0: VoiceVault – Page 2 – Carnegie Mellon ECE Capstone, Spring 2025 – Kemdi Emegwa, David Herman, Justin Ankrom

April 20, 2025

Team Status Report for 4/19

There are currently no risks to our project as we are essentially done with design. We are currently working on the testing that we outlined in the design report. This week we did the user testing together with 10 users.

No changes were made to the design as we are just testing, except for a little bit of the 3d model which was outlined in David’s status report. This change is really minute and just requires a little bit of tape.

We are still on schedule, and just have to test the product.

April 20, 2025April 20, 2025

David’s Status Report for 4/19

This week I mainly was designing the final 3d print. It took me some time to figure out how to put a sliding lid on the bottom. I also decided to not put a cover on the top as it could potentially block out the sound or the mic. Instead I am thinking of maybe using tape to hold it in. If it prints how I want, which it might not due to the sliding door needing very specific resolution which can vary from printer to printer, then we should be good for the model. If not I will have to submit another design. The rest of the week I spent with my group helping to user test our project. We met with 10 friends who we gave the product, and let them set it up, and afterwards we asked them questions which the specifics will be covered more in the report. I am definitely on schedule, and next week I will finalize all the testing like the latencies for both end to end and component wise.

For this week’s question, I definitely learned a ton working on this project as I kind of bounced between tasks. I initially learned how to set up the raspberry pi and configure its wifi, data/time, and other settings. To create the server, I learned flask and how endpoints work along with the ollama app for the model. I also learned how to create a VM using google cloud platform. I have never worked with docker before, so I also learned how to containerize a server. Lastly, I learned how to 3d design using fusion 360 to create the physical casing. To learn all these topics I did a lot of research online as well as talking to my group (if one of them was familiar with it).

April 13, 2025

Kemdi Emegwa’s Status Report for 4/12

This week was mainly spent hardening our system and ironing out kinks and last minute problems. As mentioned in previous reports, we were encountering a dilemma where the text to speech model we were using was sub par, but the better one was a bit too slow for our use case. To combat this we introduced streaming into out architecture.

There were two main areas were were streaming needed to be introduced first was for the text to speech model itself. This would improve the time to first token output because rather than synthesizing the whole text we can synthesize in chunks and then output them. This already dramatically improved performance and allowed us to use the higher quality model. However this did not address the fact that if the model hosted on the cloud returned a very long response we would still have to wait for the entire thing before running the text-to-speech model.

In order to address this fault, we decided to stream the entire query architecture. This involved work on both the server side and the device side. We also had to change how the model returned its response to accommodate for this. However, immediately sending each chunk to the tts model to be outputted resulted in weird and choppy output. To rectify this, I made it so that the device buffers chunks until it sees a “.” or “,”, then it sends type buffered chunks to the tts model. This made it sound significantly more natural.

For this next week, I will mainly spend my time cleaning up code/error handling and also working with Justin to introduce TLS so we can query over https rather than http. I think we are definitely on track and I don’t for see us encountering any problems.

April 13, 2025

Team Status Report for 4/12

This week we have been working on streaming for both server and board models. We have been able to stream responses end to end however they are pretty choppy. We will try to fix it next week by increasing the amount of words in each send, so that the speech to text model has a little more to work with.

We currently do not have any big risks as our project is mostly integrated. We have not changed any part of the design and are on schedule. We plan to do the testing and verification soon as we have finally finished the parts that deal with latency as we obviously couldn’t have done any useful testing until that was finished.

April 13, 2025April 13, 2025

David’s Status Report for 4/12

This week I worked on two main things: server side streaming and 3d printing. For server side streaming, we were worried about end to end latency as current testing seemed to be around 8 seconds. To eliminate this, we decided to implement streaming for both the server/model and the on board speech/text models. I did the server side and modified the code to allow the server to return words as they are generated which massively increases the speed. This makes the speech a little choppy due to it receiving one word at a time, so next week I might try to change the granularity to maybe 5-10 words. I also have done some work on improving the 3d model. The first iteration came out pretty well, but I want to add some more components to make it complete like a sliding bottom and a sliding top to prevent the speaker from falling out if moved upside down. I am hoping this is the final iteration, but if it doesn’t come out according to plan then I might have to do one more print. I am on schedule, and along with the aforementioned things I will also try to do some testing/verification for next week.

April 12, 2025

Justin Ankrom’s Status Report for 4/12

This week I worked on 2 main things: making all the instructions to download all the different models and making the setup guide for the VM. I first had to make all the different docker images for both the flask app and ollama server for the different models. Since we chose 4 models, this was a total of 8 docker images I had to build and upload to docker hub. I then went ahead and replaced all the model placeholder values on our website with these models and made setup instructions for each of them.

Then I made the VM setup guide which can be found here: https://docs.google.com/document/d/1h3yitViqHSCFIGRavRVXS2N8mghXY5RRwU7VamdTrG8/edit?tab=t.0 . I spent a lot of time getting this into a doc and making sure everything worked by following these instructions. I also setup TLS this week on the VM so spent a lot of time researching into how to do this. I struggled setting up the Nginx reverse proxy so that TLS works and spent lots of time on this but eventually got it working. I got it to work from a curl command on my computer but still need to make sure it works from the device using https and not http which I will do next week. After I finish this next week, I will update the guide with what steps are necessary to make sure https works from the device and update the main website with the setup guide. My progress is on schedule. Next week I will work on finishing the setup guide and finalizing the website, and also help with the user testing experiment we plan on doing.

The week before this one (when no status report was due) I worked on testing different prompts for us to use and choosing the best one.

In terms of verification, we plan to run the tests mentioned above to test that users can setup their device and vm under a certain amount of time we mention in the requirements. I also tested our prompt engineering efforts by testing many different prompts against expected results and chose the best one. So model accuracy testing is complete. We also plan on testing the latency of the whole system in the coming week or 2. I also tested that the websites were responsive and easy to use and follow by asking a group of 10 people what they thought of the website and how intuitive it is.

March 30, 2025

David’s Status Report for 3/29

This week I worked on Docker and some VM stuff. Throughout the week I tried to fix the docker container to run Ollama inside, but to no avail. Me and Justin also tried working on it together, but we weren’t able to fully finish it. Justin was able to fix it later, and I was also able to make a working version as well, although we are going to use Justin’s version as the final. The main issue for my version was that my gpu wasn’t running with the model in the container. I fixed this by not scripting ollama serve in the Dockerfile initially, and just downloading Ollama first. Then I would be able to Docker run with all gpus to start my container. After that I could pull the models and also run Ollama serve to have a fully functional local docker container working. If we were to have used this version I could script the pulling and running of Ollama serve to occur after running Docker. Earlier in the week I also tried to get a vm with a T4 GPU on gcp. However, after multiple tries across servers, I was not able to successfully acquire one. Me, Kemdi, and Justin also met together at the end of the week to flesh out the demo which is basically fully working. I am on schedule, and my main goal for next week is to get a 3d printed container for the board and speaker/mic through FBS.

March 30, 2025

Kemdi Emegwa’s status report for 3/29

I spent this week doing a lot of different things mostly pertaining to the device, in addition we did an end to end test.

Firstly, as I mentioned last week we decided to upgrade back to a Raspberry pi 5 from the Raspberry pi 4 we had been using because the Raspberry pi 4 was not delivering the performance we wanted. I spent a bit of time configuring the new Raspberry pi 5 to work with our code and with the speakerphone we bought last week. Once I got it working, I test the Text to Speech models which were the reason we made the switch in the first place. Piper TTS, which was what I had wanted to move forward with previously was a lot faster, but still had a noticeable delay even when streaming the output. I plan on doing more research on faster text to speech models, but for right now we are using Espeak, which provides realtime tts, albeit at worse quality.

In addition, I started to think about how a user would approach setting up the device when they first get it. Operating under the assumption that the device was not already connected to wifi, the user needed a way to access the frontend. This posed a challenge. I did some research and found a solution: Access point mode.

Wifi devices can operate in 1 of 2 modes at any given time. Client mode And access point mode. Typically for our devices we use client mode and routers use access point mode, but by leveraging access point mode we are able to allow the user to access the device frontend without a wifi connection.

How this works is that when the device starts up and detects it is not connected to wifi it will activate access point mode. This emits a network that the user can connect to by going into the wifi settings and inputting a password. They then just have to go their browser and input the ip address/port where the flask server is being hosted. I have written scripts to automate starting up access point mode and turning it off, but more needs to be done to allow the device to detect it is not connected to wifi and use those scripts.

In the same vain of user experience, I configured the raspberry pi so that our scripts run on startup/reboot. This eliminates the need of a monitor to run the code, which is the way we envision users interacting with our project anyways.

Lastly, as a group we did a full end to end test with a cloud hosted open source model. We were able to test all our core functionalities, including regular LLM queries, music playback and alarms.

I don’t for see any challeneges upcoming and I believe we are on track. This upcoming week will be spent researching more TTS models, allowing the device to detect that it is not connected, and error handling.

March 29, 2025March 29, 2025

Team Status report for 3/29

This week we spent getting everything ready for the demo and doing e2e tests to make sure everything is setup and ready. We got everything to work for the demo where the user can talk to the assistant to have a conversation, set/stop an alarm, or play/stop playing a song. We also got a VM with a GPU setup where we are doing model inference for quicker performance (for the demo we will be using llama3.2 with 8b params). We also made some updates to our UI for a better user experience. This week we did lots of integrations and putting everything together for the first time and are very pleased with our results.

We don’t foresee any big risks or challenges up ahead as we were able to overcome our biggest challenges which were getting the mic and speaker to work and to integrate all parts of the system together. One small issue that we may see is that we had set our latency requirements to be under 5 seconds, but lots of our current tests are a little bit over this. We are looking into changing text to speech models to a smaller one as it is a big bottleneck in our system right now and are also looking into streaming our responses from the model instead of waiting for the response to be fully complete which might give us some extra performance.

No changes were made to the existing design of the system this week.

March 29, 2025

Justin Ankrom’s Status Report for 3/29

This week I worked on getting things ready for the demo. First I worked on adding some new sections to the main website people can visit to learn about Voice Vault. I added a section describing what our product is and another section on how to setup the device. I also added the top nav bar to navigate through all the sections:
I also worked on setting up the pages needed for the device website. Originally, I had setup just one page after the user logged in where they could adjust their VM url and see their music. This week I worked on overhauling how this works. I made it so the home page just had 2 buttons where the user can change their configuration settings or go to the music page. This looks like this:

I also implemented functionality on everything on the settings page to be saved locally so we can access it later, but the functionality to actually change all the configs internally hasn’t been implemented yet. For example I can change the wake word and it will save but it won’t actually change the wake word being used by the device.

This week I also worked extensively with David to try and get our models containerized. I worked on developing the actual docker files being used and the docker compose. I came up with a solution found here: https://github.com/jankrom/Voice-Vault/commit/7501d4eebc4cd480b79e89e9fdfd27402f51c14f. With this solution, we just need to run “MODEL=“smollm2:135m” MODEL_TAG_DOCKER=“smollm2-135m” docker compose up –build” where we just change what the env variables to change which model is being downloaded form ollama. This makes it much more flexible and really easy for us to make new model containers. We struggled a lot trying to build this and get it to work. WE also had a lot of trouble to get ollama to use the GPU, so we spent many hours doing this. But we eventually got it to work. I then also spun up a VM on GCP with a GPU and set up up a container using llama3.2 8b version which we will use for the demo.

Lastly, I worked together with David and Kemdi doing final touches on getting everything ready for the demo. This included testing everything end to end for our 3 main features (talking to model, setting alarm, playing music) and fixing stuff as they came up. Ultimately, we got everything done we wanted to for the demo.

My progress is on schedule. Next week I want to do the prompt engineering tests to find the best prompt for us to use. I also want to finish the VM setup guide and I wand to finish up the main website to no longer include some placeholder values for the models (which will also include making all the model containers).