Team’s Status Report for 4/30

This week, we focused on tuning the model parameters and silence chunking system to further increase accuracy of our system and prepared for the final demo next week.

First, we finished making the final presentation this week and the final poster for the demo next week. In addition, we did some fine-tuning on the LID model and solved our model’s problem for switching languages really quickly to the point where the segment is shorter than the minimum input for the ASR model. The short segments are often inaccurate since they are shorter than an average normal utterance of 0.3 seconds. To address this issue, we set a minimum threshold for each segment to be at least 0.3 seconds long and merge the short segments with nearby long segments. This approach improved CER by about 1%.

On the web app end, this week we mainly focused on tuning the hyperparameters used in our silence chunking logic to enhance the balance between robustness, accuracy and real-timeness of our system.

By testing our app with different audio input devices, we observed that our system required different amplitude threshold values to work optimally for different input devices. Thus, we changed our system’s silence threshold value to the minimum of the various optimal values observed so that we could have the fewest false positive detections of silence which could lead to chunking into a single word by mistake and getting false transcription.

Next, we tested the optimal minimum silence gap length that triggers a chunk. Through testing, found the minimum 200ms of gap to be optimal, which avoids breaking a word but promptly captures a complete chunk and triggers a transcription request for that chunk. A minimum silence gap longer than 200ms would sometimes cause a transcription request to be delayed for several seconds if the user is speaking with little pause, which violates our real-time transcription requirement.

Finally, we modified the frontend logic that combines multiple chunks’ transcription and fixed the problem of multiple chunks’ transcription being concatenated together (for example “big breakfast today” would be displayed as “big breakfasttoday”).

For next week we plan on fixing some minor issues with the system, especially with the silence detection. Currently, the silence detection is using a constant decibel as the threshold, but this could be problematic in a noisy environment where the average decibel is higher. We will also finalize the hardware devices needed (including a noise cancellation microphone) for demo.

Honghao’s Status Report for 4/30

This week I mainly focused on tuning the hyperparameters used in our system to enhance the balance between robustness, accuracy and real-timeness of our system.

I first tested the robustness of our system across different voice input devices by finding the optimal silence threshold on different input devices by using my own laptop’s default speaker and the microphone of a headset in Wean computer lab. The test result showed that our silent chunking was susceptible to the variety of users’ input devices. Therefore, I changed our system’s silence threshold value to the minimum of the various optimal values observed so that we could have the fewest false positive detections of silence which could lead to chunking into a single word by mistake and getting false transcription.

Next, I tested the optimal minimum silence gap length that triggers a chunk. Through testing, I set it to a minimum 200ms of gap, which avoids breaking a word but promptly captures a complete chunk and triggers a transcription request for that chunk. A minimum silence gap longer than 200ms would sometimes cause a transcription request to be delayed for several seconds if the user is speaking with little pause, which violates our real-time transcription requirement.

Finally, I modified the frontend logic that combines multiple chunks’ transcription and fixed the problem of multiple chunks’ transcription being concatenated together (for example “big breakfast today” would be displayed as “big breakfasttoday”).

Next week, I will focus on finalizing the parameters and getting the input microphone ready for final demo. We expect our system to have a better performance in the demo if the input device can help with some noise cancellation.

Honghao’s Status Report for 4/23

This week I focused on implementing an audio silence detector on the JavaScript frontend and another backend transcription mechanism. The silence detector in the frontend detects a certain length of silence as the user is recording and only sends a transcription request when there is a silence gap detected and the backend model will only analyze the new audio chunk after the most recent silence gap.

The new backend transcription mechanism takes in a piece of audio, tags each frame of the input audio with a language tag (<eng> for English, <man> for Mandarin and <UNK> for silence), breaks the input audio sequence into chunks of smaller single-language audio sequences and feeds each chunk to either a English or a mandarin ASR model. In this way we can integrate advanced pretrained single language ASR models into our system and harness their capability to enhance our system accuracy. The single language ASR models we are using are jonatasgrosman/wav2vec2-large-xlsr-53-english, jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn. 

I finished both implementations and tested the performance+accuracy of our system. Below is the video demonstration of the silent chunking mechanism + our own mix-language transcription model.

https://drive.google.com/file/d/1YY_M3g54S8zmgkDc2RyQqt1IncXd5Wxo/view?usp=sharing

And below is the video demonstration of the silent chunking mechanism + LID sequence chunking + single language ASR models.

https://drive.google.com/file/d/1bcAi5p9H7i9nuqY2ZtgE7zb4wOuB0QsL/view?usp=sharing

With the silent chunking mechanism, we observed that the problem we had with a single spoken word being cut into two pieces audios was resolved. And we can see that the mechanism that integrates LID sequence chunking + single language ASR models demonstrates a higher transcription accuracy.

Next week, I will be focusing on running evaluation of our system on more diverse audios. So far we are a little delayed on our evaluation schedule because we were trying to enhance our system further by adding the mechanisms above. We do expect to finish the evaluation and further parameter tuning (silence gap length threshold and silence amplitude threshold) before the final demo.

Team’s Status Report for 4/16

On web app side, we focused on enhancing the accuracy of output transcriptions through some autocorrect libraries and building features of our web app that demonstrates the effects of using multiple approaches, including periodically resending the last x-second audio for re-transcription, resending the entire audio for re-transcription at the end of a recording session and chunking the audio by silence gaps with different silence chunking parameters.

We switched the app’s transcription model to Nick’s newly trained model, which shows a significantly higher English transcription accuracy; however, both languages’ transcription is not perfect yet with some misspelled English words and non-sense Chinese characters,  so aside from continuing our model training, we are looking for autocorrect libraries that can autocorrect the model output texts. The main challenge for using existing autocorrect packages is most of them (e.g. autocorrect library in python) only deals with well when the input is in one language, so we am experimenting with segmenting the texts into purely English character substrings and Chinese character substrings and run autocorrect on these substrings separately.

We also integrated all 3 approaches we tried for re-transcription before into our web app so that there is an entry point for each of these approaches, so during the final demo, we can show the effect of each of these to our audience.

Next week we will finish integrating an autocorrection libraries and also look for ways to map our transcription to a limited vocab space. After these two actions, we hope to eliminate the non-English-words being generated by our app. If time allows, we will also add a quick redirection link on our web app that quickly jumps to google translate to translate our codeswitching transcription, so that our audience that do not understand Chinese can understand transcription in our demo.

On the modeling side we are currently working on training a final iteration of fused LID ASR model which has some promise to give the best performance seen so far. Earlier this week we were also able to train a model which improved our CER and WER metrics across all evaluation subsets.

Honghao’s Status Report for 4/16

This week, I focused on researching solutions to enhance the accuracy of output transcriptions and building features of our web app that demonstrates the effects of using multiple approaches, including periodically resending the last x-second audio for re-transcription, resending the entire audio for re-transcription at the end of a recording session and chunking the audio by silence gaps with different silence chunking parameters.

I switched the transcription model to Nick’s newly trained model, which shows a significantly higher English transcription accuracy; however, both languages’ transcription is not perfect yet with some misspelled English words and non-sense Chinese characters,  so I am researching approaches to autocorrect the texts. The main challenge for using existing autocorrect packages is most of them (e.g. autocorrect library in python) only deals with well when the input is in one language, so I am experimenting with segmenting the texts into purely English character substrings and Chinese character substrings and run autocorrect on these substrings separately.

I also integrated all 3 approaches we tried for re-transcription before into our web app so that there is an entry point for each of these approaches, so during the final demo, we can show the effect of each of these to our audience.

Next week I will continue my experiment on autocorrection libraries and also look for ways to map our transcription to a limited vocab space. I am a little pressed on time for having the silent chunking page ready because I am still experiencing some duplicate chunks of transcription problem right now, but I should be able to fix it before next Monday. If time allows, I will also add a quick redirection link on our web app that quickly jumps to google translate to translate our codeswitching transcription, so that our audience that do not understand Chinese can understand transcription in our demo.

Team Status Report for 04/10

On the webapp development end, we created two new features that could increase the transcription accuracy while preserving a relatively good user experience. We compared both features for their speed and accuracy, and two features each wins in one aspect.

The first feature is to resend the entire audio to the model for re-evaluation after the user stops recording an audio. Then we present the re-evaluation transcription to the user at the bottom of the original transcription and allow the user to choose whether to use or ignore the new transcription and thereby achieving a whole-transcription-level autocorrection suggestion.

The second feature is to resend the last 3-second audio to be re-evaluated by the model every 3 seconds as the recording is still happening. Then when the re-evaluation output comes back for the 3-second audio, we replace the original transcription for that piece of audio and thereby achieving a similar autocorrecting effect of Siri.

We see that the transcription accuracy of the entire audio from the first feature is significantly better than before and than the 3-second-chunk “autocorrection” from the second feature; however, if one audio recording is very long, the “autocorrection” from the first feature could take a long time (approximately as long as the audio itself) to be returned by our model. In comparison, the second feature does slightly improve the transcription accuracy but experiences almost no sacrifice in throughput. With the re-transcription and autocorrection every 3 seconds, the user experience still remains smooth and close to real-time.

Our next step is to further improve our transcription accuracy by incorporating Nick’s newly trained ASR model (trained on a larger dataset which should have higher transcription accuracy) and evaluating our model using more diverse audio samples collected by Marco’s audio collection site. Our project in terms of  integration and evaluation are on schedule so far.

Honghao’s Status Report for 4/10

This week, I focused on experimenting 2 different approaches to improve the transcription accuracy while maintaining a good user experience as much as possible.

The first approach is to continue providing real-time transcription by feeding the model 1-second audio chunks. What’s new is at the end of each recording session, I rerun the model using the entire audio, which could take some significant time duration as long as the audio duration, but provide a significantly more accurate transcription, which acts as an “autocorrection” to the entire transcription. Then I present the output of the model running on the entire audio at the bottom of the real-time generated text and allow the user to decide whether to use or ignore the “autocorrection”. The experience is shown below:

The second approach is to also continue providing real-time transcription by feeding the model 1-second audio chunks but as the user is making the recording, every 3 seconds, I would resend the last 3 seconds’ audio to be re-evaluated by the model and replace the last 3 second’s real-time generated texts with the “autocorrected” texts. This experience is closer to the autocorrecting transcription experience provided by Siri. The transcription accuracy of the “autocorrected” transcription for 3-second chunks did get better compared to the original transcription from 1-second chunks, but the accuracy is worse than the output of running the model over the entire audio (approach 1) as expected.

The next step is to incorporate Nick’s newly trained transcription model (trained on large SEAME dataset), which could make the transcription accuracy higher, especially the English transcription. The model has relatively smooth performance. Also, after Marco’s site collects sufficient audio samples, I will run batch tests on our model to have a more rigorous evaluation when the input audios are more diverse.

So far, the integration and evaluation schedule has no delay.

Honghao’s Status Report for 4/2

This week I focused on integrating a codeswitching ASR model we trained to the web app and deploy the web app to a public IP address. The integration with our model was fairly smooth. Our own model does not suffer significant performance issues compared to the ASR models I tested running on our app before. When I select to send and process ongoing recorded audio by 1-second chunk, the user is able to get a real-time transcription back with a close to 1-second delay. The model was able to accurately capture the instances of language switching within a spoken sentence; however, suffering the same problem as the models before, since we are only feeding a 1-second audio chunk at a time to the model, the model transcription accuracy is not good because the context of a spoken sentence is lost. The model is only able to give the best transcription based on the audio feature within that 1-second chunk. 

I tried cutting the audio into chunks by 300ms-silence hoping that each chunk separated by silence could encapsulate a more complete local context; however, each chunk separated by silence is 3-4 second long so analyzing these longer chunks each time would significantly increase the delay of transcription breaking the real-time experience of our app.

I have finished deploying the app to a http public IP, but I found that most browsers only support getting user media (including microphone) if we run on localhost or on a secure https environment, so my next step plan is to purchase a domain name and finish deploying the app on a public domain.

We are also training another codeswitching model, which I will test integrating with our app next week. So far, the development is on schedule. We finished integrating the app with our model, and we are ready to start model tuning and research a better chunking mechanism to capture a better context in an audio chunk while keeping a chunk as short as possible. We will also collect mix-language audio samples from YouTube to evaluate our model performance for diverse speech features.

Honghao’s Status Report for 3/26

This week I focused on fixing the empty audio issues when sending ongoing recording audios in chunks to the backend and further tested our app performance when the model size used is large.

After my research on the cause of the audio chunk transfer issue, I found that the mediaRecorder API automatically inserts some header information into the audio bytes. This header information is crucial for proper functioning of the backend audio processing tool “ffmpeg”, so it means that I could not just naively needlepin into the frontend audio bytes, send them in chunks and expect the backend to successfully handle each chunk because not every chunk would have the necessary header information. 

As an alternative strategy to achieve the audio chunking effect (in order to make our ML model only run on each audio chunk once as opposed to have to redundantly run on the audio from the very beginning of the recording session), I performed experiments on the conversion ratio of byte sizes of a same audio in webm format and wav format. Having the conversion ratio, I can approximate on the backend where I should cut the new wav file (containing all audio data from the beginning of the recording session) to get the newest audio chunk. In this case, I can feed the approximate new audio chunk into our model and achieve the same effect of sending chunks between the client and server. 

After the fix, the system is able to run fairly efficiently achieving an almost real-time transcription experience. I further tested the system run speed with larger models.

For large model testing, I continued using “jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn” for the Mandarin ASR model and tried a new and larger “facebook/wav2vec2-large-960h-lv60-self” English ASR model. It verifies that our current deployment and chunking strategy is feasible for deploying our own model.

Next week, I am expecting to get the first trained version of the language detection model from my teammate Nicolas, and I will begin integrating language detection into the deployed app.

Currently on the web application side, there has not been any major timeline delay. Potential risks include that our own trained models might run much slower than the models I have tested so far. Also, it would require further research for tutorials on how to join multiple models’ logits together. 

Another concerning observation I saw as I achieved the audio chunking effect was that the model only suboptimally predicted each chunk while not achieving a high transcription accuracy at sentence level. This is a problem we need to discuss as a team for a solution.

Team Status Report for 3/19

This week on the web app development side, we have verified our deployment strategy by deploying and test-running an English ASR model and a Mandarin ASR model on an AWS server. We were able to make recording on our frontend app, which sends the audio data to the deployment server and receives the transcription outputs from the server as shown below.

Since they are only pretrained basic ASR models with no additional training, the transcription accuracy is not very high, but it verifies that our current deployment strategy is feasible for deploying our own model.

In the process of the test run, we noticed an issue with the audio chunks received on the server – audio chunks received on the server end after the first chunk are empty. While we are researching solutions to fix the issue of audio chunks received on server end after the first chunk is empty, we apply a workaround to have the frontend send the entire audio data starting from the beginning of a recording session in every request which removes the issue above; however, this workaround does sacrifice app performance because as we decrease the audio sending period, the model would have to run on more and more audio data. For example, if the audio sending period is 1 second, that would mean that the model would have to run on a 1s audio from the beginning of a session, a 2s audio from the beginning of a session, … , a n-second audio of the session. The total length of audio through the model for a n-second recording session is O(n2) whereas if we could manage to correctly send n 1-second audios, the total length of audio through the model for a n-second recording session is only O(n).

Next week, we will work on fixing the empty audio file issue and deploy a pretrained language detection model on our server so that the language detection model could decide for each audio frame which language model should be used to transcribe the audio to achieve codeswitching.

On the web app side, we do not expect the empty audio chunk issue to majorly delay our schedule. Potential risks for deployment schedule mainly reside on differences in the required environments and model complexity of the 2 tested ASR models and our own model.

On the model training side, we have finished data loader for loading SEAME dataset (the English-Mandarin code switching dataset). And we are still training the language detection model, English CTC model and Mandarin CTC model.