Kemdi Emegwa’s Status Report for 4/12

This week was mainly spent hardening our system and ironing out kinks and last minute problems. As mentioned in previous reports, we were encountering a dilemma where the text to speech model we were using was sub par, but the better one was a bit too slow for our use case. To combat this we introduced streaming into out architecture.

There were two main areas were were streaming needed to be introduced first was for the text to speech model itself. This would improve the time to first token output because rather than synthesizing the whole text we can synthesize in chunks and then output them. This already dramatically improved performance and allowed us to use the higher quality model. However this did not address the fact that if the model hosted on the cloud returned a very long response we would still have to wait for the entire thing before running the text-to-speech model.

In order to address this fault, we decided to stream the entire query architecture. This involved work on both the server side and the device side. We also had to change how the model returned its response to accommodate for this. However, immediately sending each chunk to the tts model to be outputted resulted in weird and choppy output. To rectify this, I made it so that the device buffers chunks until it sees a “.” or “,”, then it sends type buffered chunks to the tts model. This made it sound significantly more natural.

For this next week, I will mainly spend my time cleaning up code/error handling and also working with Justin to introduce TLS so we can query over https rather than http. I think we are definitely on track and I don’t for see us encountering any problems.

Leave a Reply Cancel reply