Team Status Report for 04/10

On the webapp development end, we created two new features that could increase the transcription accuracy while preserving a relatively good user experience. We compared both features for their speed and accuracy, and two features each wins in one aspect.

The first feature is to resend the entire audio to the model for re-evaluation after the user stops recording an audio. Then we present the re-evaluation transcription to the user at the bottom of the original transcription and allow the user to choose whether to use or ignore the new transcription and thereby achieving a whole-transcription-level autocorrection suggestion.

The second feature is to resend the last 3-second audio to be re-evaluated by the model every 3 seconds as the recording is still happening. Then when the re-evaluation output comes back for the 3-second audio, we replace the original transcription for that piece of audio and thereby achieving a similar autocorrecting effect of Siri.

We see that the transcription accuracy of the entire audio from the first feature is significantly better than before and than the 3-second-chunk “autocorrection” from the second feature; however, if one audio recording is very long, the “autocorrection” from the first feature could take a long time (approximately as long as the audio itself) to be returned by our model. In comparison, the second feature does slightly improve the transcription accuracy but experiences almost no sacrifice in throughput. With the re-transcription and autocorrection every 3 seconds, the user experience still remains smooth and close to real-time.

Our next step is to further improve our transcription accuracy by incorporating Nick’s newly trained ASR model (trained on a larger dataset which should have higher transcription accuracy) and evaluating our model using more diverse audio samples collected by Marco’s audio collection site. Our project in terms of  integration and evaluation are on schedule so far.

Honghao’s Status Report for 4/10

This week, I focused on experimenting 2 different approaches to improve the transcription accuracy while maintaining a good user experience as much as possible.

The first approach is to continue providing real-time transcription by feeding the model 1-second audio chunks. What’s new is at the end of each recording session, I rerun the model using the entire audio, which could take some significant time duration as long as the audio duration, but provide a significantly more accurate transcription, which acts as an “autocorrection” to the entire transcription. Then I present the output of the model running on the entire audio at the bottom of the real-time generated text and allow the user to decide whether to use or ignore the “autocorrection”. The experience is shown below:

The second approach is to also continue providing real-time transcription by feeding the model 1-second audio chunks but as the user is making the recording, every 3 seconds, I would resend the last 3 seconds’ audio to be re-evaluated by the model and replace the last 3 second’s real-time generated texts with the “autocorrected” texts. This experience is closer to the autocorrecting transcription experience provided by Siri. The transcription accuracy of the “autocorrected” transcription for 3-second chunks did get better compared to the original transcription from 1-second chunks, but the accuracy is worse than the output of running the model over the entire audio (approach 1) as expected.

The next step is to incorporate Nick’s newly trained transcription model (trained on large SEAME dataset), which could make the transcription accuracy higher, especially the English transcription. The model has relatively smooth performance. Also, after Marco’s site collects sufficient audio samples, I will run batch tests on our model to have a more rigorous evaluation when the input audios are more diverse.

So far, the integration and evaluation schedule has no delay.

Marco’s Status Report for 4/9

This week I continued the efforts of improving the performance of the model. I attempted to train with a dataset that included more English recordings. However, this approach was not very effective. The overall performance of the model actually regressed, with no particular improvement in the transcription of English sentences. The second approach that involves training a model using SEAME is still underway.

The biggest obstacle in this task, I believe, is not with the model but rather the lack of resources. While SEAME is by far the most comprehensive code-switching dataset, I have concerns about how well it will suit our task. In particular, the speakers are all from Malaysia and Singapore. The accent spoken in both English and Chinese drastically differs from the code-switching speech spoken by Mainland Chinese and Hongkongers. This difference may result in a lackluster performance during our final demo. Another argument that confirms my suspicion lies in a paper published by Tencent AI Lab. In their experiment, a model similar to the one we are training is used. The difference is that they used over 1000 hours of code-switching speech for training. Their result reached an incredible 7.6% CER.

To combat the resource problem, I wrote a website aiming to gather code-switching speeches from classmates and peers. For next week, I plan on finalizing and deploying the website by Monday and use the speeches gathered from the website to continue training the model.

Marco’s Status Report for 4/2

This week I mainly focused on improving the performance of the model. Following the approach outlined by a research paper, I started with a wav2vec2-large-xlsr-53 model that was pretrained on several Mandarin datasets and fine-tuned the model on a Mandarin-English code switching dataset called ASCEND. The model achieved a CER of 24%, which is very close to the 23% CER reported in the research paper. Upon closer inspection, I noticed that the model is good at recognizing when the speaker switches language. Also, the model performed extremely well on Mandarin inputs, but is lacking in the accuracy of English inputs. This is most likely due to the fact that the model was initially pretrained using Mandarin.

For next week, I plan on improving the model’s performance on English inputs through two approaches. The first approach will be to add more datapoints purely in English into the existing dataset. The second approach will be to train the model on SEAME, a much larger and comprehensive code switching dataset.

Honghao’s Status Report for 4/2

This week I focused on integrating a codeswitching ASR model we trained to the web app and deploy the web app to a public IP address. The integration with our model was fairly smooth. Our own model does not suffer significant performance issues compared to the ASR models I tested running on our app before. When I select to send and process ongoing recorded audio by 1-second chunk, the user is able to get a real-time transcription back with a close to 1-second delay. The model was able to accurately capture the instances of language switching within a spoken sentence; however, suffering the same problem as the models before, since we are only feeding a 1-second audio chunk at a time to the model, the model transcription accuracy is not good because the context of a spoken sentence is lost. The model is only able to give the best transcription based on the audio feature within that 1-second chunk. 

I tried cutting the audio into chunks by 300ms-silence hoping that each chunk separated by silence could encapsulate a more complete local context; however, each chunk separated by silence is 3-4 second long so analyzing these longer chunks each time would significantly increase the delay of transcription breaking the real-time experience of our app.

I have finished deploying the app to a http public IP, but I found that most browsers only support getting user media (including microphone) if we run on localhost or on a secure https environment, so my next step plan is to purchase a domain name and finish deploying the app on a public domain.

We are also training another codeswitching model, which I will test integrating with our app next week. So far, the development is on schedule. We finished integrating the app with our model, and we are ready to start model tuning and research a better chunking mechanism to capture a better context in an audio chunk while keeping a chunk as short as possible. We will also collect mix-language audio samples from YouTube to evaluate our model performance for diverse speech features.

Team Status Report for 04/02

LID work this week focused on continued training and fleshing out interfaces to interact with the model. Pre-loading and forward pass metrics were introduced to expose the functionality of the model through an importable class available from the LID’s GitHub. The model itself is loaded from a separate online repository (also based through GitHub) which is where improved version of the model have been automatically loaded as training has progressed. Focusing on integration and development of the first demo will take up most of the work for the next couple days along with beginning to build out the software suite for performing the various tests we’d prescribed in our design and architecture documents. The model could be about a half week further ahead so Nick plans on spending most of the next week focusing solely on these deliverables.

On the web app end, we have integrated a codeswitching model trained by Marco and got some promising result. The model is able to run efficiently when we chunk ongoing recording stream to 1-second chunks to feed to the model, the model could output the transcription in close to 1 second, which achieves a real-time experience of our app. The model is able to accurately capture the instance of language switching within a sentence, but since we are only feeding a 1-second audio chunk at a time to the model, the model is only able to give the best transcription based on the audio feature within that 1-second chunk. So far the integration is on schedule. We are ready to start evaluating our models using diverse audio samples from Youtube and tune our models accordingly. We will also incorporate Nick’s LID model to enhance our model accuracy and experiement with other chunking mechanism to encapsulate more context in an audio chunk while keeping the chunk short.

Nick’s Status Report for 04/02

This week’s work was focused on further fleshing out how the LID model and how it will interact with the rest of the components of our system. Currently a version of the model is available for to be pulled from a cloud repository, loaded, and run on raw speech utterances to produce a sequence of classifications. I’ve added methods which allow for my partners to preemptively load the model into system and Cuda memory (so that we can minimize loading times when actually running at transcription request since only the data then needs to be moved into and out of memory). I also exposed a method for actually making the forward call through the network. I anticipate the interface between the backend language model to be continue to be just a simple class which can be called from he software API level which Tom’s been working on. Integration and testing will continue to be our focus for the next couple of weeks. There is work to be done to set up testing frameworks both for accuracy as well as noise tolerance. In this aspect I feel a little behind but I plan on spending much of the next 3 or 4 days working on this. Delivering our first demo is the next major milestone as a team so we will need to continue meeting in person or through live zooms to flesh out that integration.