Team Status Report for 3/26

This week training began successfully on the LID model using the data we intended. It was able to complete its first epoch in about ~12hrs with promising initial prediction abilities. Next steps will be integrating the model along with Marco’s ASR module. The current model size is about 1GB uncompressed so I do anticipate meeting our size requirements to be begin to be a challenge. Exploring ways to quantize or otherwise compress the model may be investigated over the next week as will testing and training no noisy data using things like SpecAugment. Actual inference times seem to be around a second which is a promising result with respect to meeting our timing targets.

On the webapp side, we were able to achieve analyzing audios by chunks to greatly improve the system run speed. On the user level, we were able to create a real-time experience for voice-to-text. We also tested our system speed on larger ASR models and still got near real-time transcription.

Next week, we are expecting to have a trained language detection model, which we can start integrating into our deployed web app. So far, the development progress is on schedule. Potential risks include that the training of two languages’ CTC models may take longer than expected to get workable results, which could delay our integration timeline.

Nick’s Status Report for 3/26

This week training began in earnest. The model was able to complete its first epoch of training in ~12hrs time. Preliminary values of its accuracy metric seem to show it sitting around 80% WER. This is not the ideal metric for this use case however, and after updating the metric I anticipate it to have a much better classification error rate. The outputs of the model so far have looked exceedingly reasonable so I feel good about it ability to be integrated with Marco’s model soon. Based on visual comparisons between golden transcriptions and predicted transcriptions, the model exclusively spits out M or E tokens for Mandarin or English respectively, never printing an ‘UNK’ character indicating confusion. I find this promising. For next week work will focus on continuing to train the model as well as integrating it with Marco’s work and exposing it for Tom’s work on web development. I feel that I am currently on schedule with the model’s progress. I may request additional limit allowances to speed up training but so far it would appear the data available is at least sufficient for basic language detection. I do not anticipate major blockages from here.

 

Nick’s Status Report for 3/19

Since the last update, I’ve finished writing all the infrastructure needed for complete training of our LID and ASR models on the SEAME dataset. Due to a unique directory configuration, multiple sets of labels, and un-split test and training datasets, this all needed to be accomplished by an indexing process. I completed a script that indexes and stores all of the labels associated with each file and each possible utterance within each file. It was about 10MB and can be loaded extremely quickly. Using this index, we are now able to create separate data loaders for the separate tasks which are then capable of labelling our data as needed for the application. The index has also separated the data into training and test sets. One test set is biased towards Mandarin while one is biased towards English. I also completed an implementation of a data collator which is be used to make training as efficient as possible during forward passes of batches. Training now continues on the LID model end-to-end.

I’m on schedule for delivering the working LID model. This week will just be about supporting and continuing training. There will also be a small amount of coding needed just to set up running both combined models jointly (combining their outputs etc.).

Team Status Report for 3/19

This week on the web app development side, we have verified our deployment strategy by deploying and test-running an English ASR model and a Mandarin ASR model on an AWS server. We were able to make recording on our frontend app, which sends the audio data to the deployment server and receives the transcription outputs from the server as shown below.

Since they are only pretrained basic ASR models with no additional training, the transcription accuracy is not very high, but it verifies that our current deployment strategy is feasible for deploying our own model.

In the process of the test run, we noticed an issue with the audio chunks received on the server – audio chunks received on the server end after the first chunk are empty. While we are researching solutions to fix the issue of audio chunks received on server end after the first chunk is empty, we apply a workaround to have the frontend send the entire audio data starting from the beginning of a recording session in every request which removes the issue above; however, this workaround does sacrifice app performance because as we decrease the audio sending period, the model would have to run on more and more audio data. For example, if the audio sending period is 1 second, that would mean that the model would have to run on a 1s audio from the beginning of a session, a 2s audio from the beginning of a session, … , a n-second audio of the session. The total length of audio through the model for a n-second recording session is O(n2) whereas if we could manage to correctly send n 1-second audios, the total length of audio through the model for a n-second recording session is only O(n).

Next week, we will work on fixing the empty audio file issue and deploy a pretrained language detection model on our server so that the language detection model could decide for each audio frame which language model should be used to transcribe the audio to achieve codeswitching.

On the web app side, we do not expect the empty audio chunk issue to majorly delay our schedule. Potential risks for deployment schedule mainly reside on differences in the required environments and model complexity of the 2 tested ASR models and our own model.

On the model training side, we have finished data loader for loading SEAME dataset (the English-Mandarin code switching dataset). And we are still training the language detection model, English CTC model and Mandarin CTC model.

 

Honghao’s Status Report for 3/19

This week I focused on researching for and deploying pretrained ASR models to verify that the current deployment strategy works. After successfully deploying a pretrained English ASR model and a pretrained Mandarin model, I noticed a critical issue that is slowing down our app performance when testing them, which I am still trying to resolve.

The ASR models I tested were from huggingface repositories “facebook/wav2vec2-base-960h” (an English ASR model) and “jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn” (a Mandarin ASR model). The deployment server was able to run our application, load an ASR model, take audio data sent from the frontend through HTTP requests, write audio data into .wav format and input .wav file into the model and send back predicted texts to the frontend. As demonstrated in screenshots below. Since this is only a pretrained basic ASR model with no additional training, the transcription accuracy is not very high, but it verifies that our current deployment strategy is feasible for deploying our own model.

Similarly, I successfully deployed a Mandarin ASR model.

On the way, I encountered 2 issues, one of which has been resolved and another still requires further researching. The first issue was that the running django server would automatically stop itself after a few sessions of start/stop recording. Through research, I found that the issue was caused by a type of automatic request made by a django module, so I customized a django middleware that checks and deletes such requests.

The second issue was that as the frontend was recording audio chunks to the server, the audio chunks starting from the second chunk will result in empty .wav files being formed. I am still researching the cause and solution for this issue. But for now, I changed the frontend logic to send the entire audio track from the beginning of a start/stop session to the current time to the server instead of sending chunks, which removed the issue. The drawback of this workaround is that the model needs to run on the entire audio data from the beginning of a record session instead of only running on a chunk every time. Plus, as we decrease the audio sending period, the model would have to run on more and more audio data. For example, if the audio sending period is 1 second, that would mean that the model would have to run on a 1s audio from the beginning of a session, a 2s audio from the beginning of a session, … , a n-second audio of the session. The total length of audio through the model for a n-second recording session is O(n2) whereas if we could manage to correctly send n 1-second audios, the total length of audio through the model for a n-second recording session is only O(n).

 

Next week, I will work on fixing the empty audio file issue and deploy a pretrained language detection model on our server so that the language detection model could decide for each audio frame which language model should be used to transcribe the audio to achieve codeswitching.

 

Currently on the web application side, there has not been any major timeline delay. I expect the empty audio file issue to be resolved within next week. Potential risks include that our own models might have environment requirements or performance requirements that are different from the models I have tested so far, so in that case the development effort for deployment and integration would be larger.

Marco’s Status Report for 2/26

For this week, I primarily focused on getting different tokenized outputs for Chinese characters. As discussed in last week’s report, there are many different ways to define the output space, and each possibility may result in different performances in the final model. The most simple and naive way is to use the actual characters. Some existing models I found using this approach achieved a character error rate of around 20%. This approach works, but there is definitely room for improvement. The second approach is to use pinyin as the output space. This drastically reduces the size of the output space, but does require a language model to convert pinyin into actual characters. A paper published this year by Jiaotong University reported that their model achieved a character error rate of 9.85%.

We also discussed some implementation ideas for performing a real-time transcription system. One idea is to use a sliding window. For instance, an audio segment from 0s to 4s will be sent to the server, then 1 second later the audio segment from 1s to 5s will be sent. The overlapped regions are then compared and aggregated by the relative output probability. This approach, theoretically, would maintain context information, which leads to a higher transcription accuracy, and process in a short span of time.

Overall, I am on schedule for this week. For next week, I plan on training some models transcribing Chinese using the output spaces mentioned above on Amazon Web Service. I also plan on doing some research on Amazon SageMaker, which is a fully managed machine learning service that could be easier and cheaper to use than a EC2 instance.

Team Status Report 2/26

This week was primarily focused on finalizing our design and system specifications for the coming design document next week. The final document will still need some work in terms of formatting, but a majority of our metrics, testing, and explanations are currently in a shared document which will be used to populate the final template. On the language model side, we’ve obtained our AWS credits and have uploaded our training dataset. The LID model has been initialized and Nick is conducting his first runs with the dataset and a smaller version of our intended final architecture using only cross-entropy loss. The architecture itself remains stable. 

On the web app side, we conducted research on deployment strategies that support our requirements (mainly continuous push of audio chunks from client to server and model service instantiation on server boot time). We already decided to deploy our app following the logic of deploying a Flask API, but since our app is built using Django, we still need some further research and code modifications before finishing deployment.

 

Nothing has changed in terms of our schedule. On the modeling side, this week will be ramping up our training. One of our risks has been limited training time so it will be essential to make sure that the language model is fully ready and trainable so that it may run for hours at a time overnight over the course of the next two weeks. Initial results will be important to understand and characterize within the next week so that we can make training adjustments as needed to achieve our best performance. On the web app side, this week and next week will be focusing on app deployment with the goal to have our deployed app capable of running a pretrained ASR model and returning the prediction text results to client frontend.

Honghao’s Status Report for 2/26

This week I focused on researching and devising concrete plans for our web app deployment. The key points that influence our deployment strategy include: (1) there needs to be architecture that supports continuous push of ongoing audio recording in chunks from client to server; (2) the server needs to support single-instance service for our code-switching model, as we only need to load model once on server boot time and create only one instance of code-switching model for future incoming request; (3) we need to temporarily store audio files (.webm and .wav) on server so that they can be fed into our model as inputs. 

According to my research, there are two ways of deploying an ASR model on AWS server: deploying the model as a microservice or deploying the model as a python class which gets instantiated and ready to use (by calling the class methods such as “.predict( )”). After reading through tutorials and documentations (https://www.youtube.com/watch?v=7vWuoci8nUk) and (https://medium.com/google-cloud/building-a-client-side-web-app-which-streams-audio-from-a-browser-microphone-to-a-server-part-ii-df20ddb47d4e) of two deployment strategies, I decided to take the simpler strategy of deploying the model as a python class. The greatest advantage of this strategy is its simplicity. Deploying a microservice involves work to build the model into a Docker container and uses AWS Lambda etc. in addition to deploying our app on AWS EC2 server. Also, this strategy still allows great flexibility on the technology used at the client end. Therefore, to accomplish continuous push of new audio chunks from client to server, I can use either RecordRTC and socket.io in Node.js or websocket in Django to emit new audios to server efficiently. From the tutorial, I already familiarized myself with how to achieve single-instance service following this deployment strategy. And lastly, the temporary storage of audio files can simply be done through I/O calls wrapped in some class methods of the model instance.

Having devised a detailed deployment strategy, I began deploying our app. I finished writing the model service class including necessary functions like predict( ), preprocess( ) and the instance constructor of the service itself that guarantees only a single instance is launched.

Next week, I will continue researching and deploying our app on AWS server. The goal is to have our app deployed capable of loading a pretrained ASR model and send back prediction outputs to the client frontend.

Nick’s Status Report for 2/26

This week I was heavily focused finalizing and documenting the specifics of our system’s language identification model as well as other part of our systems overall language model design. I was able to upload the SEAME dataset to my storage instance on AWS and have begun working on setting up my training schema (data loading, validation and evaluation set partitioning etc..). In this I do find myself a day or two behind our schedule. I plan on spending all of Sunday working exclusively on getting the system entirely orchestrated so that all I have left is to make model architecture or hyper-parameter adjustments. I’ll be ramping up this week starting with a very small model and ramping it up as I validate each successive round of results. Given that we have a break coming up, it is crucial that the full system be able to train overnight by the end of the next school week. This will be my primary focus in terms of implementation.

Otherwise, most of my time this week was spent on our design document and sub-module specifications including unit tests, metrics and validation. Though our final design document still needs work in terms of formatting, I did a lot of our specification and explanation in a separate shared document which will allow for easy transfer into the document for the coming deadline.

 

Team Status Report for 2/19

This week our team focused on breaking our codeswitching speech recognition system into small modules, distributing task ownerships on these modules and detailing design requirements, metric goals and test plans for each module. We began with making a matrix of user requirements and each of their corresponding numerical design requirements and testing strategies. Then we came up with a total of 7 modules that are needed by these requirements: Language Detection, English ASR, Mandarin ASR, Audio Data Transfer, Web Frontend, Backend API and Language Model. 

After dividing the ownership of these modules, each of us drafted the design documents for our responsible modules. We then assembled our modular design documents to check for compatibility of our design metrics in order to avoid potential critical paths during module integration.

In terms of Audio Data Transfer, Web Frontend and Backend API modules, we have finished the proof of concept on a localhost environment and measured the time for audio data processing (.wav generation) on server end and received a promising result of 200ms to 300ms for each 2-second audio chunk. We might need to shorten this time after our later experiment with network latency between client and deployment server and ASR model running time on an AWS server. So next week, we will deploy an AWS server to proceed with testing those latencies to confirm the practicality of our current metric goals for these modules.

On the language modeling side we have overcome one of the most concerning aspects about training an end-to-end ASR model by securing access to the SEAME multilingual (Mandarin & English) audio/text dataset. This is a large dataset used previously to achieve SOTA results and was specifically curated for code-switching use cases. Should we lose access to this dataset, we’ve had contingency plans already developed before we knew we had access in which a dataset could be constructed by augmenting and mixing existing Mandarin and English dedicated datasets. Nick was also able to successfully complete setting up his remote development environment and now has a dedicated AWS block-store and GPU instance ready for training. He was also able to download Facebook’s wav2vec feature extraction model directly to the GPU instance and was able to successfully initialize it. 

Other than acquiring our dataset, most of our work was focused on finalizing the specifics of our model architecture further detailing what our training, validation, and testing processes were going to be for each submodule of the language model. A more detailed version of our previously “high-level” system architecture will be ready for the design review presentation. 

Overall, we remain on track with our schedule. Due to the surprise of having access to the SEAME dataset, we were able to shorten our “dataset compilation” period and extend model training periods as well as system testing and integration periods slightly. Official development will begin mid next-week.