Honghao’s Status Report for 4/30

This week I mainly focused on tuning the hyperparameters used in our system to enhance the balance between robustness, accuracy and real-timeness of our system.

I first tested the robustness of our system across different voice input devices by finding the optimal silence threshold on different input devices by using my own laptop’s default speaker and the microphone of a headset in Wean computer lab. The test result showed that our silent chunking was susceptible to the variety of users’ input devices. Therefore, I changed our system’s silence threshold value to the minimum of the various optimal values observed so that we could have the fewest false positive detections of silence which could lead to chunking into a single word by mistake and getting false transcription.

Next, I tested the optimal minimum silence gap length that triggers a chunk. Through testing, I set it to a minimum 200ms of gap, which avoids breaking a word but promptly captures a complete chunk and triggers a transcription request for that chunk. A minimum silence gap longer than 200ms would sometimes cause a transcription request to be delayed for several seconds if the user is speaking with little pause, which violates our real-time transcription requirement.

Finally, I modified the frontend logic that combines multiple chunks’ transcription and fixed the problem of multiple chunks’ transcription being concatenated together (for example “big breakfast today” would be displayed as “big breakfasttoday”).

Next week, I will focus on finalizing the parameters and getting the input microphone ready for final demo. We expect our system to have a better performance in the demo if the input device can help with some noise cancellation.

Marco’s Status Report for 4/30

This week I primarily focused on the final presentation and making the final poster for the demonstration next week. In addition, I did some fine-tuning on the LID model. One problem that I saw was that the system sometimes switches languages really quickly to the point where the segment is shorter than the minimum input for the ASR model. The short segments are often inaccurate since they are shorter than an average normal utterance of 0.3 seconds. To address this issue, I set a minimum threshold for each segment to be at least 0.3 seconds long and merge the short segments with nearby long segments. This approach improved CER by about 1%.

For next week I plan on fixing some minor issues with the system, especially with the silence detection. Currently, the silence detection is using a constant decibel as the threshold, but this could be problematic in a noisy environment where the average decibel is higher. Finally, I will prepare the system for the final demonstration next week.

Honghao’s Status Report for 4/23

This week I focused on implementing an audio silence detector on the JavaScript frontend and another backend transcription mechanism. The silence detector in the frontend detects a certain length of silence as the user is recording and only sends a transcription request when there is a silence gap detected and the backend model will only analyze the new audio chunk after the most recent silence gap.

The new backend transcription mechanism takes in a piece of audio, tags each frame of the input audio with a language tag (<eng> for English, <man> for Mandarin and <UNK> for silence), breaks the input audio sequence into chunks of smaller single-language audio sequences and feeds each chunk to either a English or a mandarin ASR model. In this way we can integrate advanced pretrained single language ASR models into our system and harness their capability to enhance our system accuracy. The single language ASR models we are using are jonatasgrosman/wav2vec2-large-xlsr-53-english, jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn. 

I finished both implementations and tested the performance+accuracy of our system. Below is the video demonstration of the silent chunking mechanism + our own mix-language transcription model.

https://drive.google.com/file/d/1YY_M3g54S8zmgkDc2RyQqt1IncXd5Wxo/view?usp=sharing

And below is the video demonstration of the silent chunking mechanism + LID sequence chunking + single language ASR models.

https://drive.google.com/file/d/1bcAi5p9H7i9nuqY2ZtgE7zb4wOuB0QsL/view?usp=sharing

With the silent chunking mechanism, we observed that the problem we had with a single spoken word being cut into two pieces audios was resolved. And we can see that the mechanism that integrates LID sequence chunking + single language ASR models demonstrates a higher transcription accuracy.

Next week, I will be focusing on running evaluation of our system on more diverse audios. So far we are a little delayed on our evaluation schedule because we were trying to enhance our system further by adding the mechanisms above. We do expect to finish the evaluation and further parameter tuning (silence gap length threshold and silence amplitude threshold) before the final demo.

Marco’s Status Report for 4/23

This week we constructed a new framework for our speech recognition system. The input audio is first fed into a LID model that we’ve trained to split the audio into segments of either English or Mandarin. Then, the segments are fed into either a English or a Mandarin speech recognition model. Through this approach, we drastically reduced the complexity of the problem. Consequently, we achieved a much higher accuracy.

In addition, I wrote a python script that can run the program locally on any computer. The script will detect silence, chunk the audio, and process the audio as the speaker is speaking.

For next week, I plan on fine-tuning some of the minor problems we saw this week. First, I will try to smooth out the segments by the LID module by combining segments that are too short (which usually means it’s inaccurate) with nearby longer segments. Second, an autocorrect module could be added as a post-processing step to further improve the accuracy.

Team Status Report for 4/23

We’ve continued to push the model performance by updating the architecture and techniques used to more closely align with our guiding paper. This week we achieved a 20% reduction in the WER on our toughest evaluation set by using a combined architecture which creates LID labels using a novel segmentation technique. This technique create “soft” labels which roughly infer the current language using audio segmentation which is then fused with the golden label text. Training and hyper-parameter tuning will continue to till the final demo, we anticipate there is likely room to get another 15% reduction in the error rate.

 

On the web app end, we finished implementing an audio silence detector on the JavaScript frontend and another backend transcription mechanism. The silence detector in the frontend detects a certain length of silence as the user is recording and only sends a transcription request when there is a silence gap detected and the backend model will only analyze the new audio chunk after the most recent silence gap. In this way, we would resolve the problem we had before with a single word being cut off since we were cutting the audio into 3 second trunks arbitrarily.

The new backend transcription mechanism takes in a piece of audio, tags each frame of the input audio with a language tag (<eng> for English, <man> for Mandarin and <UNK> for silence), breaks the input audio sequence into chunks of smaller single-language audio sequences and feeds each chunk to either a English or a mandarin ASR model. In this way we can integrate advanced pretrained single language ASR models into our system and harness their capability to enhance our system accuracy.

Below is the video demonstration of the silent chunking mechanism + our own mix-language transcription model.

https://drive.google.com/file/d/1YY_M3g54S8zmgkDc2RyQqt1IncXd5Wxo/view?usp=sharing

And below is the video demonstration of the silent chunking mechanism + LID sequence chunking + single language ASR models.

https://drive.google.com/file/d/1bcAi5p9H7i9nuqY2ZtgE7zb4wOuB0QsL/view?usp=sharing

With the silent chunking mechanism, we observed that the problem we had with a single spoken word being cut into two pieces audios was resolved. And we can see that the mechanism that integrates LID sequence chunking + single language ASR models demonstrates a higher transcription accuracy.

Next week, we will be focusing on running evaluation of our system on more diverse audios. We will also try adding Nick’s newly trained model into our system to compare with the current accuracy. So far we are a little delayed on our evaluation schedule because we were trying to enhance our system further by adding the mechanisms above. We do expect to finish the evaluation and further parameter tuning (silence gap length threshold and silence amplitude threshold) before the final demo.

Nick’s Status Report for 4/23

This week I pushed forward as far as I could with improving the model accuracy by using new techniques for training and combining the LID and ASR modules to almost exactly replicate the techniques used in the primary paper we’ve been using for guidance. The model completed over 40 epochs of training this week across 2-3 different configurations. The biggest and most successful change was using a segmentation process I devised to create soft labels for the current language. The audio is first segmented by volume to create candidate segments of audio for words or groups of words. I then perform a two-way reduction between each list of segments and the actual language labels calculated from the label string. This results in a 1-1 match between a language label and a segment which allows for performing cross-entropy loss across the LID model. Using this technique, I was able to achieve a 20% boost in performance across the WER of the hardest evaluation set we have, lowering the current best performance to 69%. This training process will continue until the final demo as we are able to easily substitute in a new version of the model from the cloud each time a new best is achieved. Great focus will also be placed on finishing our final presentation for Monday. 

Marco’s Status Report for 4/16

Following the discussion with the professor earlier this week, I went on to explore some of the options that he had mentioned. In particular, I started looking into a speaker-dependent speech recognition system instead of a speaker-independent one. While the scope or difficulty of the task drastically reduces, it is still interesting to experiment with the capabilities of a system that can perform recognition on code-switching speech.

According to a paper published in 1993, speaker-dependent ASR can reach very low error rates within 600-3000 sentences of speech in English. However, there have been no similar attempts in code-switching speech between Mandarin and English.

So far, I collected around 600 sentences of myself speaking in purely English, purely Mandarin, and mixed. For next week, I plan on training a model using the data I’ve gathered together with existing English and Mandarin corpora.

Nick’s Status Report for 4/16

This week we’ve been working diligently to improve our model’s performance and work on improving our projects overall insights and user experience. Within a day of our meeting earlier this week, I spent time retraining a fresh model with some slight modifications to which pre-trained feature extractor we were using as well as training hyper-parameters. This lead to a model with much better performance compared with what we were able to demo at our interim in both the character and word error rate metrics across all evaluation subsets and overall.

This is continued progress now being made in completing the training of a model with a slightly more nuanced architecture using both CTC and classification loss fused and with fused logits as well. Overall, the larger model should generally have the capability for greater knowledge retention and a finer ability to distinguish between mandarin and English by more explicitly focusing on separating the task of LID and ASR. We will continue work tomorrow to strategize about what we would like the final outcome of our project to be and what we would like to highlight during our final demo and report.

 

Team’s Status Report for 4/16

On web app side, we focused on enhancing the accuracy of output transcriptions through some autocorrect libraries and building features of our web app that demonstrates the effects of using multiple approaches, including periodically resending the last x-second audio for re-transcription, resending the entire audio for re-transcription at the end of a recording session and chunking the audio by silence gaps with different silence chunking parameters.

We switched the app’s transcription model to Nick’s newly trained model, which shows a significantly higher English transcription accuracy; however, both languages’ transcription is not perfect yet with some misspelled English words and non-sense Chinese characters,  so aside from continuing our model training, we are looking for autocorrect libraries that can autocorrect the model output texts. The main challenge for using existing autocorrect packages is most of them (e.g. autocorrect library in python) only deals with well when the input is in one language, so we am experimenting with segmenting the texts into purely English character substrings and Chinese character substrings and run autocorrect on these substrings separately.

We also integrated all 3 approaches we tried for re-transcription before into our web app so that there is an entry point for each of these approaches, so during the final demo, we can show the effect of each of these to our audience.

Next week we will finish integrating an autocorrection libraries and also look for ways to map our transcription to a limited vocab space. After these two actions, we hope to eliminate the non-English-words being generated by our app. If time allows, we will also add a quick redirection link on our web app that quickly jumps to google translate to translate our codeswitching transcription, so that our audience that do not understand Chinese can understand transcription in our demo.

On the modeling side we are currently working on training a final iteration of fused LID ASR model which has some promise to give the best performance seen so far. Earlier this week we were also able to train a model which improved our CER and WER metrics across all evaluation subsets.

Honghao’s Status Report for 4/16

This week, I focused on researching solutions to enhance the accuracy of output transcriptions and building features of our web app that demonstrates the effects of using multiple approaches, including periodically resending the last x-second audio for re-transcription, resending the entire audio for re-transcription at the end of a recording session and chunking the audio by silence gaps with different silence chunking parameters.

I switched the transcription model to Nick’s newly trained model, which shows a significantly higher English transcription accuracy; however, both languages’ transcription is not perfect yet with some misspelled English words and non-sense Chinese characters,  so I am researching approaches to autocorrect the texts. The main challenge for using existing autocorrect packages is most of them (e.g. autocorrect library in python) only deals with well when the input is in one language, so I am experimenting with segmenting the texts into purely English character substrings and Chinese character substrings and run autocorrect on these substrings separately.

I also integrated all 3 approaches we tried for re-transcription before into our web app so that there is an entry point for each of these approaches, so during the final demo, we can show the effect of each of these to our audience.

Next week I will continue my experiment on autocorrection libraries and also look for ways to map our transcription to a limited vocab space. I am a little pressed on time for having the silent chunking page ready because I am still experiencing some duplicate chunks of transcription problem right now, but I should be able to fix it before next Monday. If time allows, I will also add a quick redirection link on our web app that quickly jumps to google translate to translate our codeswitching transcription, so that our audience that do not understand Chinese can understand transcription in our demo.