This week I mainly focused on tuning the hyperparameters used in our system to enhance the balance between robustness, accuracy and real-timeness of our system.
I first tested the robustness of our system across different voice input devices by finding the optimal silence threshold on different input devices by using my own laptop’s default speaker and the microphone of a headset in Wean computer lab. The test result showed that our silent chunking was susceptible to the variety of users’ input devices. Therefore, I changed our system’s silence threshold value to the minimum of the various optimal values observed so that we could have the fewest false positive detections of silence which could lead to chunking into a single word by mistake and getting false transcription.
Next, I tested the optimal minimum silence gap length that triggers a chunk. Through testing, I set it to a minimum 200ms of gap, which avoids breaking a word but promptly captures a complete chunk and triggers a transcription request for that chunk. A minimum silence gap longer than 200ms would sometimes cause a transcription request to be delayed for several seconds if the user is speaking with little pause, which violates our real-time transcription requirement.
Finally, I modified the frontend logic that combines multiple chunks’ transcription and fixed the problem of multiple chunks’ transcription being concatenated together (for example “big breakfast today” would be displayed as “big breakfasttoday”).
Next week, I will focus on finalizing the parameters and getting the input microphone ready for final demo. We expect our system to have a better performance in the demo if the input device can help with some noise cancellation.