Nick’s Status Report for 4/30

This week I was able to accomplish a lot of what I wanted to in the direction of trying one last time to implement and train a complete jointly-trained LID/ASR model. I was able to overcome the problems I was having previously with creating logically correct labels for English and Mandarin audio segments which would be needed to train the LID model. I did this by using my previously top performing combined model (which used a shaky heuristic technique for creating labels) and used it to segment audio during preprocessing and label it English, Mandarin, or Blank (essentially distilling the LID information from that model). These new labels were then used in training a fresh model with the hope that the ASR-half of the model would no longer need to hold any LID information at all. Training has so far been successful, though I have not yet broken through my previous best performance. The training procedure does take stages to complete as I decay certain hyper parameters over time and I anticipate that over the next 10-20 epochs, I should be able to reach my goal. I’m on track with our planned schedule and will also be working heavily on our final report this week as this model trains in tandem.

 

Team’s Status Report for 4/30

This week, we focused on tuning the model parameters and silence chunking system to further increase accuracy of our system and prepared for the final demo next week.

First, we finished making the final presentation this week and the final poster for the demo next week. In addition, we did some fine-tuning on the LID model and solved our model’s problem for switching languages really quickly to the point where the segment is shorter than the minimum input for the ASR model. The short segments are often inaccurate since they are shorter than an average normal utterance of 0.3 seconds. To address this issue, we set a minimum threshold for each segment to be at least 0.3 seconds long and merge the short segments with nearby long segments. This approach improved CER by about 1%.

On the web app end, this week we mainly focused on tuning the hyperparameters used in our silence chunking logic to enhance the balance between robustness, accuracy and real-timeness of our system.

By testing our app with different audio input devices, we observed that our system required different amplitude threshold values to work optimally for different input devices. Thus, we changed our system’s silence threshold value to the minimum of the various optimal values observed so that we could have the fewest false positive detections of silence which could lead to chunking into a single word by mistake and getting false transcription.

Next, we tested the optimal minimum silence gap length that triggers a chunk. Through testing, found the minimum 200ms of gap to be optimal, which avoids breaking a word but promptly captures a complete chunk and triggers a transcription request for that chunk. A minimum silence gap longer than 200ms would sometimes cause a transcription request to be delayed for several seconds if the user is speaking with little pause, which violates our real-time transcription requirement.

Finally, we modified the frontend logic that combines multiple chunks’ transcription and fixed the problem of multiple chunks’ transcription being concatenated together (for example “big breakfast today” would be displayed as “big breakfasttoday”).

For next week we plan on fixing some minor issues with the system, especially with the silence detection. Currently, the silence detection is using a constant decibel as the threshold, but this could be problematic in a noisy environment where the average decibel is higher. We will also finalize the hardware devices needed (including a noise cancellation microphone) for demo.