Honghao’s Status Report for 3/19

This week I focused on researching for and deploying pretrained ASR models to verify that the current deployment strategy works. After successfully deploying a pretrained English ASR model and a pretrained Mandarin model, I noticed a critical issue that is slowing down our app performance when testing them, which I am still trying to resolve.

The ASR models I tested were from huggingface repositories “facebook/wav2vec2-base-960h” (an English ASR model) and “jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn” (a Mandarin ASR model). The deployment server was able to run our application, load an ASR model, take audio data sent from the frontend through HTTP requests, write audio data into .wav format and input .wav file into the model and send back predicted texts to the frontend. As demonstrated in screenshots below. Since this is only a pretrained basic ASR model with no additional training, the transcription accuracy is not very high, but it verifies that our current deployment strategy is feasible for deploying our own model.

Similarly, I successfully deployed a Mandarin ASR model.

On the way, I encountered 2 issues, one of which has been resolved and another still requires further researching. The first issue was that the running django server would automatically stop itself after a few sessions of start/stop recording. Through research, I found that the issue was caused by a type of automatic request made by a django module, so I customized a django middleware that checks and deletes such requests.

The second issue was that as the frontend was recording audio chunks to the server, the audio chunks starting from the second chunk will result in empty .wav files being formed. I am still researching the cause and solution for this issue. But for now, I changed the frontend logic to send the entire audio track from the beginning of a start/stop session to the current time to the server instead of sending chunks, which removed the issue. The drawback of this workaround is that the model needs to run on the entire audio data from the beginning of a record session instead of only running on a chunk every time. Plus, as we decrease the audio sending period, the model would have to run on more and more audio data. For example, if the audio sending period is 1 second, that would mean that the model would have to run on a 1s audio from the beginning of a session, a 2s audio from the beginning of a session, … , a n-second audio of the session. The total length of audio through the model for a n-second recording session is O(n2) whereas if we could manage to correctly send n 1-second audios, the total length of audio through the model for a n-second recording session is only O(n).

 

Next week, I will work on fixing the empty audio file issue and deploy a pretrained language detection model on our server so that the language detection model could decide for each audio frame which language model should be used to transcribe the audio to achieve codeswitching.

 

Currently on the web application side, there has not been any major timeline delay. I expect the empty audio file issue to be resolved within next week. Potential risks include that our own models might have environment requirements or performance requirements that are different from the models I have tested so far, so in that case the development effort for deployment and integration would be larger.

Team Status Report 2/26

This week was primarily focused on finalizing our design and system specifications for the coming design document next week. The final document will still need some work in terms of formatting, but a majority of our metrics, testing, and explanations are currently in a shared document which will be used to populate the final template. On the language model side, we’ve obtained our AWS credits and have uploaded our training dataset. The LID model has been initialized and Nick is conducting his first runs with the dataset and a smaller version of our intended final architecture using only cross-entropy loss. The architecture itself remains stable. 

On the web app side, we conducted research on deployment strategies that support our requirements (mainly continuous push of audio chunks from client to server and model service instantiation on server boot time). We already decided to deploy our app following the logic of deploying a Flask API, but since our app is built using Django, we still need some further research and code modifications before finishing deployment.

 

Nothing has changed in terms of our schedule. On the modeling side, this week will be ramping up our training. One of our risks has been limited training time so it will be essential to make sure that the language model is fully ready and trainable so that it may run for hours at a time overnight over the course of the next two weeks. Initial results will be important to understand and characterize within the next week so that we can make training adjustments as needed to achieve our best performance. On the web app side, this week and next week will be focusing on app deployment with the goal to have our deployed app capable of running a pretrained ASR model and returning the prediction text results to client frontend.

Honghao’s Status Report for 2/26

This week I focused on researching and devising concrete plans for our web app deployment. The key points that influence our deployment strategy include: (1) there needs to be architecture that supports continuous push of ongoing audio recording in chunks from client to server; (2) the server needs to support single-instance service for our code-switching model, as we only need to load model once on server boot time and create only one instance of code-switching model for future incoming request; (3) we need to temporarily store audio files (.webm and .wav) on server so that they can be fed into our model as inputs. 

According to my research, there are two ways of deploying an ASR model on AWS server: deploying the model as a microservice or deploying the model as a python class which gets instantiated and ready to use (by calling the class methods such as “.predict( )”). After reading through tutorials and documentations (https://www.youtube.com/watch?v=7vWuoci8nUk) and (https://medium.com/google-cloud/building-a-client-side-web-app-which-streams-audio-from-a-browser-microphone-to-a-server-part-ii-df20ddb47d4e) of two deployment strategies, I decided to take the simpler strategy of deploying the model as a python class. The greatest advantage of this strategy is its simplicity. Deploying a microservice involves work to build the model into a Docker container and uses AWS Lambda etc. in addition to deploying our app on AWS EC2 server. Also, this strategy still allows great flexibility on the technology used at the client end. Therefore, to accomplish continuous push of new audio chunks from client to server, I can use either RecordRTC and socket.io in Node.js or websocket in Django to emit new audios to server efficiently. From the tutorial, I already familiarized myself with how to achieve single-instance service following this deployment strategy. And lastly, the temporary storage of audio files can simply be done through I/O calls wrapped in some class methods of the model instance.

Having devised a detailed deployment strategy, I began deploying our app. I finished writing the model service class including necessary functions like predict( ), preprocess( ) and the instance constructor of the service itself that guarantees only a single instance is launched.

Next week, I will continue researching and deploying our app on AWS server. The goal is to have our app deployed capable of loading a pretrained ASR model and send back prediction outputs to the client frontend.

Team Status Report for 2/19

This week our team focused on breaking our codeswitching speech recognition system into small modules, distributing task ownerships on these modules and detailing design requirements, metric goals and test plans for each module. We began with making a matrix of user requirements and each of their corresponding numerical design requirements and testing strategies. Then we came up with a total of 7 modules that are needed by these requirements: Language Detection, English ASR, Mandarin ASR, Audio Data Transfer, Web Frontend, Backend API and Language Model. 

After dividing the ownership of these modules, each of us drafted the design documents for our responsible modules. We then assembled our modular design documents to check for compatibility of our design metrics in order to avoid potential critical paths during module integration.

In terms of Audio Data Transfer, Web Frontend and Backend API modules, we have finished the proof of concept on a localhost environment and measured the time for audio data processing (.wav generation) on server end and received a promising result of 200ms to 300ms for each 2-second audio chunk. We might need to shorten this time after our later experiment with network latency between client and deployment server and ASR model running time on an AWS server. So next week, we will deploy an AWS server to proceed with testing those latencies to confirm the practicality of our current metric goals for these modules.

On the language modeling side we have overcome one of the most concerning aspects about training an end-to-end ASR model by securing access to the SEAME multilingual (Mandarin & English) audio/text dataset. This is a large dataset used previously to achieve SOTA results and was specifically curated for code-switching use cases. Should we lose access to this dataset, we’ve had contingency plans already developed before we knew we had access in which a dataset could be constructed by augmenting and mixing existing Mandarin and English dedicated datasets. Nick was also able to successfully complete setting up his remote development environment and now has a dedicated AWS block-store and GPU instance ready for training. He was also able to download Facebook’s wav2vec feature extraction model directly to the GPU instance and was able to successfully initialize it. 

Other than acquiring our dataset, most of our work was focused on finalizing the specifics of our model architecture further detailing what our training, validation, and testing processes were going to be for each submodule of the language model. A more detailed version of our previously “high-level” system architecture will be ready for the design review presentation. 

Overall, we remain on track with our schedule. Due to the surprise of having access to the SEAME dataset, we were able to shorten our “dataset compilation” period and extend model training periods as well as system testing and integration periods slightly. Official development will begin mid next-week. 

Honghao’s Status Report for 2/19

This week I focused on refining the design requirements, numeric goals and testing plans for audio transfer module, web frontend text displayer module and backend wav generation API module. After drafting detailed documentation of design metrics, I continued researching for tools for audio transfer whose performance can meet our design and finished implementing an audio transfer module that can periodically send requests embedded with audio data to the server.

To test if the audio transfer module matches our design requirement for the latency of .wav generation, I logged the server time at receiving the audio data and the time at finishing generating the corresponding .wav file at the web frontend. The .wav file generation time exceeds our intended 50ms but is under 300ms. We will tolerate this result until having further tested the client to server transmission latency (after having our server deployed) and the ML model average run time on a 2-second audio.

Next week, I will move on to deploying an AWS server with GPU so that we can load a pre-trained ASR model to experiment with network latency and model run time so that we can better estimate the practicality of our current design metrics.

Team Status Reports for 2/12

This week our team focused on researching and experimenting ML models and web packages that are suitable for individual modules of our project, including the web app, backend speech recognition model and language detection model. After our research, we will include the models that show promising results and a proof of concept web app demo in our design document that is due 2/20/2022.

We started developing a web app that will become a proof of concept. This web app will support users to record their voice and submit the audio data to the server. The server will feed the audio into a speech recognition model and return the model output back to the frontend. This week we have finished implementing the voice recording on the frontend and audio data transfer to the server. The foreseeable risks on the web app development so far includes the loss of quality of audio transferred from frontend to server and the speed of our backend model processing the audio. Although we have finished implementing an audio transfer module that can successfully transfer audio data recorded on a web page to our server, the audio file generated on the server is noisy, which will impede our speech recognition accuracy. This issue should be fixed after finding a way to retrieve the original audio’s frame rate and number of channels so that these metrics can be used when writing the audio data into a .wav file on the server. So far, we are confident that this issue can be fixed by 2/20/2022. 

We will also try running a speech recognition model on our development server to estimate the speed of our system. If the speed of processing is too slow, we will try switching to a more powerful server instance type.

We also made the necessary resource requests for increasing our GPU instance limits as well as for AWS credits. We set up our respective development environments (CoLab or Jupyterlab) for remotely developing our software on the AWS GPU instances. Next steps will include uploading our desired development data sets onto our respective storage servers and curating our first smaller development sets for training the ASR and LID models. By next week we hope to have subtasks specifically divided in terms of model development and early models running for each sub-model of the whole DL system. We’ll also aim to have the detailed architecture of several iterations of the system finalized for the upcoming design presentation and document. The major risks we have in this area are those of our model’s performance and training time. We can’t be sure of exactly how quickly to expect progress or what level of performance to expect by integration and deployment time without beginning to characterize these metrics by beginning development. Taking as many early steps now to begin development will help address these risks and also help us understand exactly how large our vocabulary and dataset needs to be to achieve the user experience we are seeking. 

Honghao’s Status Report for 2/12

This week I focused on the setup of the Django environment for our web app. I finished researching the Javascript library for constructing and sending audio streams to the backend. On localhost, we can start a web frontend page that has a record button for users to record their voice. Once they stop recording, we will display an audio playback at the frontend page and at the same time send the audio stream to the backend which will in turn store the audio data in a .wav file on the server side.

This audio stream transfer is crucial to our project because we need to send audio streams from frontend to backend and store them as .wav files to feed into our speech recognition model.

However, I still have some trouble with the noise of the .wav file created on the server side. Based on current research, the reason for the noise is probably that I arbitrarily set the number of channels and frame rate for the audio when writing the audio data to the .wav file on the server side. Solving this issue will require further research on how to retrieve the framerate and counts of channels in the original audio data.

So far the progress for finishing a proof of concept for our audio stream transfer module is on schedule. I still have a week before the deadline of the design document (2/20/2022) to finish the proof of concept and visualize the module with a detailed sketch. Next week I will resolve the audio noise issue and draw a detailed sketch of the audio transfer module that includes flow charts and packages used along the way.

Our Idea

Most speech recognition apps have good performance today. From Siri to voice texting on most smart phones, the accuracy and processing speed of speech recognition in these apps are great — with annoying limit though: most apps only support one language mode. Siri only supports single-language recognition. If you set your Siri supporting language to English and try speaking into your Siri with a mix of two languages, Mandarin and English for example, you will find Siri treat everything you said as English and translate the part you speak in Mandarin as gibberish.

So we want to build an app that provides accurate and real-time recognition for speeches mixed with Mandarin and English. It will be very useful for (1) voice texting when a speaker is bilingual and speaks in a mix of English and Mandarin; (2) transcription for international conferences when attendants from different countries start a mixed-language dialogue. Our goal for our app is to reach an word error rate (~10%) that matches existing single-language speech recognition apps and an end-to-end latency of less than 1 seconds so that the recognition can catch up with human normal speaking rate (~100 words per second).