Honghao’s Status Report for 3/26

This week I focused on fixing the empty audio issues when sending ongoing recording audios in chunks to the backend and further tested our app performance when the model size used is large.

After my research on the cause of the audio chunk transfer issue, I found that the mediaRecorder API automatically inserts some header information into the audio bytes. This header information is crucial for proper functioning of the backend audio processing tool “ffmpeg”, so it means that I could not just naively needlepin into the frontend audio bytes, send them in chunks and expect the backend to successfully handle each chunk because not every chunk would have the necessary header information. 

As an alternative strategy to achieve the audio chunking effect (in order to make our ML model only run on each audio chunk once as opposed to have to redundantly run on the audio from the very beginning of the recording session), I performed experiments on the conversion ratio of byte sizes of a same audio in webm format and wav format. Having the conversion ratio, I can approximate on the backend where I should cut the new wav file (containing all audio data from the beginning of the recording session) to get the newest audio chunk. In this case, I can feed the approximate new audio chunk into our model and achieve the same effect of sending chunks between the client and server. 

After the fix, the system is able to run fairly efficiently achieving an almost real-time transcription experience. I further tested the system run speed with larger models.

For large model testing, I continued using “jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn” for the Mandarin ASR model and tried a new and larger “facebook/wav2vec2-large-960h-lv60-self” English ASR model. It verifies that our current deployment and chunking strategy is feasible for deploying our own model.

Next week, I am expecting to get the first trained version of the language detection model from my teammate Nicolas, and I will begin integrating language detection into the deployed app.

Currently on the web application side, there has not been any major timeline delay. Potential risks include that our own trained models might run much slower than the models I have tested so far. Also, it would require further research for tutorials on how to join multiple models’ logits together. 

Another concerning observation I saw as I achieved the audio chunking effect was that the model only suboptimally predicted each chunk while not achieving a high transcription accuracy at sentence level. This is a problem we need to discuss as a team for a solution.

Team Status Report for 3/26

This week training began successfully on the LID model using the data we intended. It was able to complete its first epoch in about ~12hrs with promising initial prediction abilities. Next steps will be integrating the model along with Marco’s ASR module. The current model size is about 1GB uncompressed so I do anticipate meeting our size requirements to be begin to be a challenge. Exploring ways to quantize or otherwise compress the model may be investigated over the next week as will testing and training no noisy data using things like SpecAugment. Actual inference times seem to be around a second which is a promising result with respect to meeting our timing targets.

On the webapp side, we were able to achieve analyzing audios by chunks to greatly improve the system run speed. On the user level, we were able to create a real-time experience for voice-to-text. We also tested our system speed on larger ASR models and still got near real-time transcription.

Next week, we are expecting to have a trained language detection model, which we can start integrating into our deployed web app. So far, the development progress is on schedule. Potential risks include that the training of two languages’ CTC models may take longer than expected to get workable results, which could delay our integration timeline.

Nick’s Status Report for 3/26

This week training began in earnest. The model was able to complete its first epoch of training in ~12hrs time. Preliminary values of its accuracy metric seem to show it sitting around 80% WER. This is not the ideal metric for this use case however, and after updating the metric I anticipate it to have a much better classification error rate. The outputs of the model so far have looked exceedingly reasonable so I feel good about it ability to be integrated with Marco’s model soon. Based on visual comparisons between golden transcriptions and predicted transcriptions, the model exclusively spits out M or E tokens for Mandarin or English respectively, never printing an ‘UNK’ character indicating confusion. I find this promising. For next week work will focus on continuing to train the model as well as integrating it with Marco’s work and exposing it for Tom’s work on web development. I feel that I am currently on schedule with the model’s progress. I may request additional limit allowances to speed up training but so far it would appear the data available is at least sufficient for basic language detection. I do not anticipate major blockages from here.

 

Nick’s Status Report for 3/19

Since the last update, I’ve finished writing all the infrastructure needed for complete training of our LID and ASR models on the SEAME dataset. Due to a unique directory configuration, multiple sets of labels, and un-split test and training datasets, this all needed to be accomplished by an indexing process. I completed a script that indexes and stores all of the labels associated with each file and each possible utterance within each file. It was about 10MB and can be loaded extremely quickly. Using this index, we are now able to create separate data loaders for the separate tasks which are then capable of labelling our data as needed for the application. The index has also separated the data into training and test sets. One test set is biased towards Mandarin while one is biased towards English. I also completed an implementation of a data collator which is be used to make training as efficient as possible during forward passes of batches. Training now continues on the LID model end-to-end.

I’m on schedule for delivering the working LID model. This week will just be about supporting and continuing training. There will also be a small amount of coding needed just to set up running both combined models jointly (combining their outputs etc.).

Team Status Report for 3/19

This week on the web app development side, we have verified our deployment strategy by deploying and test-running an English ASR model and a Mandarin ASR model on an AWS server. We were able to make recording on our frontend app, which sends the audio data to the deployment server and receives the transcription outputs from the server as shown below.

Since they are only pretrained basic ASR models with no additional training, the transcription accuracy is not very high, but it verifies that our current deployment strategy is feasible for deploying our own model.

In the process of the test run, we noticed an issue with the audio chunks received on the server – audio chunks received on the server end after the first chunk are empty. While we are researching solutions to fix the issue of audio chunks received on server end after the first chunk is empty, we apply a workaround to have the frontend send the entire audio data starting from the beginning of a recording session in every request which removes the issue above; however, this workaround does sacrifice app performance because as we decrease the audio sending period, the model would have to run on more and more audio data. For example, if the audio sending period is 1 second, that would mean that the model would have to run on a 1s audio from the beginning of a session, a 2s audio from the beginning of a session, … , a n-second audio of the session. The total length of audio through the model for a n-second recording session is O(n2) whereas if we could manage to correctly send n 1-second audios, the total length of audio through the model for a n-second recording session is only O(n).

Next week, we will work on fixing the empty audio file issue and deploy a pretrained language detection model on our server so that the language detection model could decide for each audio frame which language model should be used to transcribe the audio to achieve codeswitching.

On the web app side, we do not expect the empty audio chunk issue to majorly delay our schedule. Potential risks for deployment schedule mainly reside on differences in the required environments and model complexity of the 2 tested ASR models and our own model.

On the model training side, we have finished data loader for loading SEAME dataset (the English-Mandarin code switching dataset). And we are still training the language detection model, English CTC model and Mandarin CTC model.

 

Honghao’s Status Report for 3/19

This week I focused on researching for and deploying pretrained ASR models to verify that the current deployment strategy works. After successfully deploying a pretrained English ASR model and a pretrained Mandarin model, I noticed a critical issue that is slowing down our app performance when testing them, which I am still trying to resolve.

The ASR models I tested were from huggingface repositories “facebook/wav2vec2-base-960h” (an English ASR model) and “jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn” (a Mandarin ASR model). The deployment server was able to run our application, load an ASR model, take audio data sent from the frontend through HTTP requests, write audio data into .wav format and input .wav file into the model and send back predicted texts to the frontend. As demonstrated in screenshots below. Since this is only a pretrained basic ASR model with no additional training, the transcription accuracy is not very high, but it verifies that our current deployment strategy is feasible for deploying our own model.

Similarly, I successfully deployed a Mandarin ASR model.

On the way, I encountered 2 issues, one of which has been resolved and another still requires further researching. The first issue was that the running django server would automatically stop itself after a few sessions of start/stop recording. Through research, I found that the issue was caused by a type of automatic request made by a django module, so I customized a django middleware that checks and deletes such requests.

The second issue was that as the frontend was recording audio chunks to the server, the audio chunks starting from the second chunk will result in empty .wav files being formed. I am still researching the cause and solution for this issue. But for now, I changed the frontend logic to send the entire audio track from the beginning of a start/stop session to the current time to the server instead of sending chunks, which removed the issue. The drawback of this workaround is that the model needs to run on the entire audio data from the beginning of a record session instead of only running on a chunk every time. Plus, as we decrease the audio sending period, the model would have to run on more and more audio data. For example, if the audio sending period is 1 second, that would mean that the model would have to run on a 1s audio from the beginning of a session, a 2s audio from the beginning of a session, … , a n-second audio of the session. The total length of audio through the model for a n-second recording session is O(n2) whereas if we could manage to correctly send n 1-second audios, the total length of audio through the model for a n-second recording session is only O(n).

 

Next week, I will work on fixing the empty audio file issue and deploy a pretrained language detection model on our server so that the language detection model could decide for each audio frame which language model should be used to transcribe the audio to achieve codeswitching.

 

Currently on the web application side, there has not been any major timeline delay. I expect the empty audio file issue to be resolved within next week. Potential risks include that our own models might have environment requirements or performance requirements that are different from the models I have tested so far, so in that case the development effort for deployment and integration would be larger.