Team’s Status Report for 4/16

On web app side, we focused on enhancing the accuracy of output transcriptions through some autocorrect libraries and building features of our web app that demonstrates the effects of using multiple approaches, including periodically resending the last x-second audio for re-transcription, resending the entire audio for re-transcription at the end of a recording session and chunking the audio by silence gaps with different silence chunking parameters.

We switched the app’s transcription model to Nick’s newly trained model, which shows a significantly higher English transcription accuracy; however, both languages’ transcription is not perfect yet with some misspelled English words and non-sense Chinese characters,  so aside from continuing our model training, we are looking for autocorrect libraries that can autocorrect the model output texts. The main challenge for using existing autocorrect packages is most of them (e.g. autocorrect library in python) only deals with well when the input is in one language, so we am experimenting with segmenting the texts into purely English character substrings and Chinese character substrings and run autocorrect on these substrings separately.

We also integrated all 3 approaches we tried for re-transcription before into our web app so that there is an entry point for each of these approaches, so during the final demo, we can show the effect of each of these to our audience.

Next week we will finish integrating an autocorrection libraries and also look for ways to map our transcription to a limited vocab space. After these two actions, we hope to eliminate the non-English-words being generated by our app. If time allows, we will also add a quick redirection link on our web app that quickly jumps to google translate to translate our codeswitching transcription, so that our audience that do not understand Chinese can understand transcription in our demo.

On the modeling side we are currently working on training a final iteration of fused LID ASR model which has some promise to give the best performance seen so far. Earlier this week we were also able to train a model which improved our CER and WER metrics across all evaluation subsets.

Honghao’s Status Report for 4/16

This week, I focused on researching solutions to enhance the accuracy of output transcriptions and building features of our web app that demonstrates the effects of using multiple approaches, including periodically resending the last x-second audio for re-transcription, resending the entire audio for re-transcription at the end of a recording session and chunking the audio by silence gaps with different silence chunking parameters.

I switched the transcription model to Nick’s newly trained model, which shows a significantly higher English transcription accuracy; however, both languages’ transcription is not perfect yet with some misspelled English words and non-sense Chinese characters,  so I am researching approaches to autocorrect the texts. The main challenge for using existing autocorrect packages is most of them (e.g. autocorrect library in python) only deals with well when the input is in one language, so I am experimenting with segmenting the texts into purely English character substrings and Chinese character substrings and run autocorrect on these substrings separately.

I also integrated all 3 approaches we tried for re-transcription before into our web app so that there is an entry point for each of these approaches, so during the final demo, we can show the effect of each of these to our audience.

Next week I will continue my experiment on autocorrection libraries and also look for ways to map our transcription to a limited vocab space. I am a little pressed on time for having the silent chunking page ready because I am still experiencing some duplicate chunks of transcription problem right now, but I should be able to fix it before next Monday. If time allows, I will also add a quick redirection link on our web app that quickly jumps to google translate to translate our codeswitching transcription, so that our audience that do not understand Chinese can understand transcription in our demo.

Team Status Report for 04/10

On the webapp development end, we created two new features that could increase the transcription accuracy while preserving a relatively good user experience. We compared both features for their speed and accuracy, and two features each wins in one aspect.

The first feature is to resend the entire audio to the model for re-evaluation after the user stops recording an audio. Then we present the re-evaluation transcription to the user at the bottom of the original transcription and allow the user to choose whether to use or ignore the new transcription and thereby achieving a whole-transcription-level autocorrection suggestion.

The second feature is to resend the last 3-second audio to be re-evaluated by the model every 3 seconds as the recording is still happening. Then when the re-evaluation output comes back for the 3-second audio, we replace the original transcription for that piece of audio and thereby achieving a similar autocorrecting effect of Siri.

We see that the transcription accuracy of the entire audio from the first feature is significantly better than before and than the 3-second-chunk “autocorrection” from the second feature; however, if one audio recording is very long, the “autocorrection” from the first feature could take a long time (approximately as long as the audio itself) to be returned by our model. In comparison, the second feature does slightly improve the transcription accuracy but experiences almost no sacrifice in throughput. With the re-transcription and autocorrection every 3 seconds, the user experience still remains smooth and close to real-time.

Our next step is to further improve our transcription accuracy by incorporating Nick’s newly trained ASR model (trained on a larger dataset which should have higher transcription accuracy) and evaluating our model using more diverse audio samples collected by Marco’s audio collection site. Our project in terms of  integration and evaluation are on schedule so far.

Honghao’s Status Report for 4/10

This week, I focused on experimenting 2 different approaches to improve the transcription accuracy while maintaining a good user experience as much as possible.

The first approach is to continue providing real-time transcription by feeding the model 1-second audio chunks. What’s new is at the end of each recording session, I rerun the model using the entire audio, which could take some significant time duration as long as the audio duration, but provide a significantly more accurate transcription, which acts as an “autocorrection” to the entire transcription. Then I present the output of the model running on the entire audio at the bottom of the real-time generated text and allow the user to decide whether to use or ignore the “autocorrection”. The experience is shown below:

The second approach is to also continue providing real-time transcription by feeding the model 1-second audio chunks but as the user is making the recording, every 3 seconds, I would resend the last 3 seconds’ audio to be re-evaluated by the model and replace the last 3 second’s real-time generated texts with the “autocorrected” texts. This experience is closer to the autocorrecting transcription experience provided by Siri. The transcription accuracy of the “autocorrected” transcription for 3-second chunks did get better compared to the original transcription from 1-second chunks, but the accuracy is worse than the output of running the model over the entire audio (approach 1) as expected.

The next step is to incorporate Nick’s newly trained transcription model (trained on large SEAME dataset), which could make the transcription accuracy higher, especially the English transcription. The model has relatively smooth performance. Also, after Marco’s site collects sufficient audio samples, I will run batch tests on our model to have a more rigorous evaluation when the input audios are more diverse.

So far, the integration and evaluation schedule has no delay.

Marco’s Status Report for 4/9

This week I continued the efforts of improving the performance of the model. I attempted to train with a dataset that included more English recordings. However, this approach was not very effective. The overall performance of the model actually regressed, with no particular improvement in the transcription of English sentences. The second approach that involves training a model using SEAME is still underway.

The biggest obstacle in this task, I believe, is not with the model but rather the lack of resources. While SEAME is by far the most comprehensive code-switching dataset, I have concerns about how well it will suit our task. In particular, the speakers are all from Malaysia and Singapore. The accent spoken in both English and Chinese drastically differs from the code-switching speech spoken by Mainland Chinese and Hongkongers. This difference may result in a lackluster performance during our final demo. Another argument that confirms my suspicion lies in a paper published by Tencent AI Lab. In their experiment, a model similar to the one we are training is used. The difference is that they used over 1000 hours of code-switching speech for training. Their result reached an incredible 7.6% CER.

To combat the resource problem, I wrote a website aiming to gather code-switching speeches from classmates and peers. For next week, I plan on finalizing and deploying the website by Monday and use the speeches gathered from the website to continue training the model.

Marco’s Status Report for 4/2

This week I mainly focused on improving the performance of the model. Following the approach outlined by a research paper, I started with a wav2vec2-large-xlsr-53 model that was pretrained on several Mandarin datasets and fine-tuned the model on a Mandarin-English code switching dataset called ASCEND. The model achieved a CER of 24%, which is very close to the 23% CER reported in the research paper. Upon closer inspection, I noticed that the model is good at recognizing when the speaker switches language. Also, the model performed extremely well on Mandarin inputs, but is lacking in the accuracy of English inputs. This is most likely due to the fact that the model was initially pretrained using Mandarin.

For next week, I plan on improving the model’s performance on English inputs through two approaches. The first approach will be to add more datapoints purely in English into the existing dataset. The second approach will be to train the model on SEAME, a much larger and comprehensive code switching dataset.

Honghao’s Status Report for 4/2

This week I focused on integrating a codeswitching ASR model we trained to the web app and deploy the web app to a public IP address. The integration with our model was fairly smooth. Our own model does not suffer significant performance issues compared to the ASR models I tested running on our app before. When I select to send and process ongoing recorded audio by 1-second chunk, the user is able to get a real-time transcription back with a close to 1-second delay. The model was able to accurately capture the instances of language switching within a spoken sentence; however, suffering the same problem as the models before, since we are only feeding a 1-second audio chunk at a time to the model, the model transcription accuracy is not good because the context of a spoken sentence is lost. The model is only able to give the best transcription based on the audio feature within that 1-second chunk. 

I tried cutting the audio into chunks by 300ms-silence hoping that each chunk separated by silence could encapsulate a more complete local context; however, each chunk separated by silence is 3-4 second long so analyzing these longer chunks each time would significantly increase the delay of transcription breaking the real-time experience of our app.

I have finished deploying the app to a http public IP, but I found that most browsers only support getting user media (including microphone) if we run on localhost or on a secure https environment, so my next step plan is to purchase a domain name and finish deploying the app on a public domain.

We are also training another codeswitching model, which I will test integrating with our app next week. So far, the development is on schedule. We finished integrating the app with our model, and we are ready to start model tuning and research a better chunking mechanism to capture a better context in an audio chunk while keeping a chunk as short as possible. We will also collect mix-language audio samples from YouTube to evaluate our model performance for diverse speech features.

Team Status Report for 04/02

LID work this week focused on continued training and fleshing out interfaces to interact with the model. Pre-loading and forward pass metrics were introduced to expose the functionality of the model through an importable class available from the LID’s GitHub. The model itself is loaded from a separate online repository (also based through GitHub) which is where improved version of the model have been automatically loaded as training has progressed. Focusing on integration and development of the first demo will take up most of the work for the next couple days along with beginning to build out the software suite for performing the various tests we’d prescribed in our design and architecture documents. The model could be about a half week further ahead so Nick plans on spending most of the next week focusing solely on these deliverables.

On the web app end, we have integrated a codeswitching model trained by Marco and got some promising result. The model is able to run efficiently when we chunk ongoing recording stream to 1-second chunks to feed to the model, the model could output the transcription in close to 1 second, which achieves a real-time experience of our app. The model is able to accurately capture the instance of language switching within a sentence, but since we are only feeding a 1-second audio chunk at a time to the model, the model is only able to give the best transcription based on the audio feature within that 1-second chunk. So far the integration is on schedule. We are ready to start evaluating our models using diverse audio samples from Youtube and tune our models accordingly. We will also incorporate Nick’s LID model to enhance our model accuracy and experiement with other chunking mechanism to encapsulate more context in an audio chunk while keeping the chunk short.

Nick’s Status Report for 04/02

This week’s work was focused on further fleshing out how the LID model and how it will interact with the rest of the components of our system. Currently a version of the model is available for to be pulled from a cloud repository, loaded, and run on raw speech utterances to produce a sequence of classifications. I’ve added methods which allow for my partners to preemptively load the model into system and Cuda memory (so that we can minimize loading times when actually running at transcription request since only the data then needs to be moved into and out of memory). I also exposed a method for actually making the forward call through the network. I anticipate the interface between the backend language model to be continue to be just a simple class which can be called from he software API level which Tom’s been working on. Integration and testing will continue to be our focus for the next couple of weeks. There is work to be done to set up testing frameworks both for accuracy as well as noise tolerance. In this aspect I feel a little behind but I plan on spending much of the next 3 or 4 days working on this. Delivering our first demo is the next major milestone as a team so we will need to continue meeting in person or through live zooms to flesh out that integration.

Honghao’s Status Report for 3/26

This week I focused on fixing the empty audio issues when sending ongoing recording audios in chunks to the backend and further tested our app performance when the model size used is large.

After my research on the cause of the audio chunk transfer issue, I found that the mediaRecorder API automatically inserts some header information into the audio bytes. This header information is crucial for proper functioning of the backend audio processing tool “ffmpeg”, so it means that I could not just naively needlepin into the frontend audio bytes, send them in chunks and expect the backend to successfully handle each chunk because not every chunk would have the necessary header information. 

As an alternative strategy to achieve the audio chunking effect (in order to make our ML model only run on each audio chunk once as opposed to have to redundantly run on the audio from the very beginning of the recording session), I performed experiments on the conversion ratio of byte sizes of a same audio in webm format and wav format. Having the conversion ratio, I can approximate on the backend where I should cut the new wav file (containing all audio data from the beginning of the recording session) to get the newest audio chunk. In this case, I can feed the approximate new audio chunk into our model and achieve the same effect of sending chunks between the client and server. 

After the fix, the system is able to run fairly efficiently achieving an almost real-time transcription experience. I further tested the system run speed with larger models.

For large model testing, I continued using “jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn” for the Mandarin ASR model and tried a new and larger “facebook/wav2vec2-large-960h-lv60-self” English ASR model. It verifies that our current deployment and chunking strategy is feasible for deploying our own model.

Next week, I am expecting to get the first trained version of the language detection model from my teammate Nicolas, and I will begin integrating language detection into the deployed app.

Currently on the web application side, there has not been any major timeline delay. Potential risks include that our own trained models might run much slower than the models I have tested so far. Also, it would require further research for tutorials on how to join multiple models’ logits together. 

Another concerning observation I saw as I achieved the audio chunking effect was that the model only suboptimally predicted each chunk while not achieving a high transcription accuracy at sentence level. This is a problem we need to discuss as a team for a solution.