Nick’s Status Report for 4/30

This week I was able to accomplish a lot of what I wanted to in the direction of trying one last time to implement and train a complete jointly-trained LID/ASR model. I was able to overcome the problems I was having previously with creating logically correct labels for English and Mandarin audio segments which would be needed to train the LID model. I did this by using my previously top performing combined model (which used a shaky heuristic technique for creating labels) and used it to segment audio during preprocessing and label it English, Mandarin, or Blank (essentially distilling the LID information from that model). These new labels were then used in training a fresh model with the hope that the ASR-half of the model would no longer need to hold any LID information at all. Training has so far been successful, though I have not yet broken through my previous best performance. The training procedure does take stages to complete as I decay certain hyper parameters over time and I anticipate that over the next 10-20 epochs, I should be able to reach my goal. I’m on track with our planned schedule and will also be working heavily on our final report this week as this model trains in tandem.

 

Team’s Status Report for 4/30

This week, we focused on tuning the model parameters and silence chunking system to further increase accuracy of our system and prepared for the final demo next week.

First, we finished making the final presentation this week and the final poster for the demo next week. In addition, we did some fine-tuning on the LID model and solved our model’s problem for switching languages really quickly to the point where the segment is shorter than the minimum input for the ASR model. The short segments are often inaccurate since they are shorter than an average normal utterance of 0.3 seconds. To address this issue, we set a minimum threshold for each segment to be at least 0.3 seconds long and merge the short segments with nearby long segments. This approach improved CER by about 1%.

On the web app end, this week we mainly focused on tuning the hyperparameters used in our silence chunking logic to enhance the balance between robustness, accuracy and real-timeness of our system.

By testing our app with different audio input devices, we observed that our system required different amplitude threshold values to work optimally for different input devices. Thus, we changed our system’s silence threshold value to the minimum of the various optimal values observed so that we could have the fewest false positive detections of silence which could lead to chunking into a single word by mistake and getting false transcription.

Next, we tested the optimal minimum silence gap length that triggers a chunk. Through testing, found the minimum 200ms of gap to be optimal, which avoids breaking a word but promptly captures a complete chunk and triggers a transcription request for that chunk. A minimum silence gap longer than 200ms would sometimes cause a transcription request to be delayed for several seconds if the user is speaking with little pause, which violates our real-time transcription requirement.

Finally, we modified the frontend logic that combines multiple chunks’ transcription and fixed the problem of multiple chunks’ transcription being concatenated together (for example “big breakfast today” would be displayed as “big breakfasttoday”).

For next week we plan on fixing some minor issues with the system, especially with the silence detection. Currently, the silence detection is using a constant decibel as the threshold, but this could be problematic in a noisy environment where the average decibel is higher. We will also finalize the hardware devices needed (including a noise cancellation microphone) for demo.

Honghao’s Status Report for 4/30

This week I mainly focused on tuning the hyperparameters used in our system to enhance the balance between robustness, accuracy and real-timeness of our system.

I first tested the robustness of our system across different voice input devices by finding the optimal silence threshold on different input devices by using my own laptop’s default speaker and the microphone of a headset in Wean computer lab. The test result showed that our silent chunking was susceptible to the variety of users’ input devices. Therefore, I changed our system’s silence threshold value to the minimum of the various optimal values observed so that we could have the fewest false positive detections of silence which could lead to chunking into a single word by mistake and getting false transcription.

Next, I tested the optimal minimum silence gap length that triggers a chunk. Through testing, I set it to a minimum 200ms of gap, which avoids breaking a word but promptly captures a complete chunk and triggers a transcription request for that chunk. A minimum silence gap longer than 200ms would sometimes cause a transcription request to be delayed for several seconds if the user is speaking with little pause, which violates our real-time transcription requirement.

Finally, I modified the frontend logic that combines multiple chunks’ transcription and fixed the problem of multiple chunks’ transcription being concatenated together (for example “big breakfast today” would be displayed as “big breakfasttoday”).

Next week, I will focus on finalizing the parameters and getting the input microphone ready for final demo. We expect our system to have a better performance in the demo if the input device can help with some noise cancellation.

Marco’s Status Report for 4/30

This week I primarily focused on the final presentation and making the final poster for the demonstration next week. In addition, I did some fine-tuning on the LID model. One problem that I saw was that the system sometimes switches languages really quickly to the point where the segment is shorter than the minimum input for the ASR model. The short segments are often inaccurate since they are shorter than an average normal utterance of 0.3 seconds. To address this issue, I set a minimum threshold for each segment to be at least 0.3 seconds long and merge the short segments with nearby long segments. This approach improved CER by about 1%.

For next week I plan on fixing some minor issues with the system, especially with the silence detection. Currently, the silence detection is using a constant decibel as the threshold, but this could be problematic in a noisy environment where the average decibel is higher. Finally, I will prepare the system for the final demonstration next week.

Honghao’s Status Report for 4/23

This week I focused on implementing an audio silence detector on the JavaScript frontend and another backend transcription mechanism. The silence detector in the frontend detects a certain length of silence as the user is recording and only sends a transcription request when there is a silence gap detected and the backend model will only analyze the new audio chunk after the most recent silence gap.

The new backend transcription mechanism takes in a piece of audio, tags each frame of the input audio with a language tag (<eng> for English, <man> for Mandarin and <UNK> for silence), breaks the input audio sequence into chunks of smaller single-language audio sequences and feeds each chunk to either a English or a mandarin ASR model. In this way we can integrate advanced pretrained single language ASR models into our system and harness their capability to enhance our system accuracy. The single language ASR models we are using are jonatasgrosman/wav2vec2-large-xlsr-53-english, jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn. 

I finished both implementations and tested the performance+accuracy of our system. Below is the video demonstration of the silent chunking mechanism + our own mix-language transcription model.

https://drive.google.com/file/d/1YY_M3g54S8zmgkDc2RyQqt1IncXd5Wxo/view?usp=sharing

And below is the video demonstration of the silent chunking mechanism + LID sequence chunking + single language ASR models.

https://drive.google.com/file/d/1bcAi5p9H7i9nuqY2ZtgE7zb4wOuB0QsL/view?usp=sharing

With the silent chunking mechanism, we observed that the problem we had with a single spoken word being cut into two pieces audios was resolved. And we can see that the mechanism that integrates LID sequence chunking + single language ASR models demonstrates a higher transcription accuracy.

Next week, I will be focusing on running evaluation of our system on more diverse audios. So far we are a little delayed on our evaluation schedule because we were trying to enhance our system further by adding the mechanisms above. We do expect to finish the evaluation and further parameter tuning (silence gap length threshold and silence amplitude threshold) before the final demo.

Marco’s Status Report for 4/23

This week we constructed a new framework for our speech recognition system. The input audio is first fed into a LID model that we’ve trained to split the audio into segments of either English or Mandarin. Then, the segments are fed into either a English or a Mandarin speech recognition model. Through this approach, we drastically reduced the complexity of the problem. Consequently, we achieved a much higher accuracy.

In addition, I wrote a python script that can run the program locally on any computer. The script will detect silence, chunk the audio, and process the audio as the speaker is speaking.

For next week, I plan on fine-tuning some of the minor problems we saw this week. First, I will try to smooth out the segments by the LID module by combining segments that are too short (which usually means it’s inaccurate) with nearby longer segments. Second, an autocorrect module could be added as a post-processing step to further improve the accuracy.

Team Status Report for 4/23

We’ve continued to push the model performance by updating the architecture and techniques used to more closely align with our guiding paper. This week we achieved a 20% reduction in the WER on our toughest evaluation set by using a combined architecture which creates LID labels using a novel segmentation technique. This technique create “soft” labels which roughly infer the current language using audio segmentation which is then fused with the golden label text. Training and hyper-parameter tuning will continue to till the final demo, we anticipate there is likely room to get another 15% reduction in the error rate.

 

On the web app end, we finished implementing an audio silence detector on the JavaScript frontend and another backend transcription mechanism. The silence detector in the frontend detects a certain length of silence as the user is recording and only sends a transcription request when there is a silence gap detected and the backend model will only analyze the new audio chunk after the most recent silence gap. In this way, we would resolve the problem we had before with a single word being cut off since we were cutting the audio into 3 second trunks arbitrarily.

The new backend transcription mechanism takes in a piece of audio, tags each frame of the input audio with a language tag (<eng> for English, <man> for Mandarin and <UNK> for silence), breaks the input audio sequence into chunks of smaller single-language audio sequences and feeds each chunk to either a English or a mandarin ASR model. In this way we can integrate advanced pretrained single language ASR models into our system and harness their capability to enhance our system accuracy.

Below is the video demonstration of the silent chunking mechanism + our own mix-language transcription model.

https://drive.google.com/file/d/1YY_M3g54S8zmgkDc2RyQqt1IncXd5Wxo/view?usp=sharing

And below is the video demonstration of the silent chunking mechanism + LID sequence chunking + single language ASR models.

https://drive.google.com/file/d/1bcAi5p9H7i9nuqY2ZtgE7zb4wOuB0QsL/view?usp=sharing

With the silent chunking mechanism, we observed that the problem we had with a single spoken word being cut into two pieces audios was resolved. And we can see that the mechanism that integrates LID sequence chunking + single language ASR models demonstrates a higher transcription accuracy.

Next week, we will be focusing on running evaluation of our system on more diverse audios. We will also try adding Nick’s newly trained model into our system to compare with the current accuracy. So far we are a little delayed on our evaluation schedule because we were trying to enhance our system further by adding the mechanisms above. We do expect to finish the evaluation and further parameter tuning (silence gap length threshold and silence amplitude threshold) before the final demo.

Nick’s Status Report for 4/23

This week I pushed forward as far as I could with improving the model accuracy by using new techniques for training and combining the LID and ASR modules to almost exactly replicate the techniques used in the primary paper we’ve been using for guidance. The model completed over 40 epochs of training this week across 2-3 different configurations. The biggest and most successful change was using a segmentation process I devised to create soft labels for the current language. The audio is first segmented by volume to create candidate segments of audio for words or groups of words. I then perform a two-way reduction between each list of segments and the actual language labels calculated from the label string. This results in a 1-1 match between a language label and a segment which allows for performing cross-entropy loss across the LID model. Using this technique, I was able to achieve a 20% boost in performance across the WER of the hardest evaluation set we have, lowering the current best performance to 69%. This training process will continue until the final demo as we are able to easily substitute in a new version of the model from the cloud each time a new best is achieved. Great focus will also be placed on finishing our final presentation for Monday. 

Marco’s Status Report for 4/16

Following the discussion with the professor earlier this week, I went on to explore some of the options that he had mentioned. In particular, I started looking into a speaker-dependent speech recognition system instead of a speaker-independent one. While the scope or difficulty of the task drastically reduces, it is still interesting to experiment with the capabilities of a system that can perform recognition on code-switching speech.

According to a paper published in 1993, speaker-dependent ASR can reach very low error rates within 600-3000 sentences of speech in English. However, there have been no similar attempts in code-switching speech between Mandarin and English.

So far, I collected around 600 sentences of myself speaking in purely English, purely Mandarin, and mixed. For next week, I plan on training a model using the data I’ve gathered together with existing English and Mandarin corpora.

Nick’s Status Report for 4/16

This week we’ve been working diligently to improve our model’s performance and work on improving our projects overall insights and user experience. Within a day of our meeting earlier this week, I spent time retraining a fresh model with some slight modifications to which pre-trained feature extractor we were using as well as training hyper-parameters. This lead to a model with much better performance compared with what we were able to demo at our interim in both the character and word error rate metrics across all evaluation subsets and overall.

This is continued progress now being made in completing the training of a model with a slightly more nuanced architecture using both CTC and classification loss fused and with fused logits as well. Overall, the larger model should generally have the capability for greater knowledge retention and a finer ability to distinguish between mandarin and English by more explicitly focusing on separating the task of LID and ASR. We will continue work tomorrow to strategize about what we would like the final outcome of our project to be and what we would like to highlight during our final demo and report.