Marco’s Status Report for 2/26

For this week, I primarily focused on getting different tokenized outputs for Chinese characters. As discussed in last week’s report, there are many different ways to define the output space, and each possibility may result in different performances in the final model. The most simple and naive way is to use the actual characters. Some existing models I found using this approach achieved a character error rate of around 20%. This approach works, but there is definitely room for improvement. The second approach is to use pinyin as the output space. This drastically reduces the size of the output space, but does require a language model to convert pinyin into actual characters. A paper published this year by Jiaotong University reported that their model achieved a character error rate of 9.85%.

We also discussed some implementation ideas for performing a real-time transcription system. One idea is to use a sliding window. For instance, an audio segment from 0s to 4s will be sent to the server, then 1 second later the audio segment from 1s to 5s will be sent. The overlapped regions are then compared and aggregated by the relative output probability. This approach, theoretically, would maintain context information, which leads to a higher transcription accuracy, and process in a short span of time.

Overall, I am on schedule for this week. For next week, I plan on training some models transcribing Chinese using the output spaces mentioned above on Amazon Web Service. I also plan on doing some research on Amazon SageMaker, which is a fully managed machine learning service that could be easier and cheaper to use than a EC2 instance.

Team Status Report 2/26

This week was primarily focused on finalizing our design and system specifications for the coming design document next week. The final document will still need some work in terms of formatting, but a majority of our metrics, testing, and explanations are currently in a shared document which will be used to populate the final template. On the language model side, we’ve obtained our AWS credits and have uploaded our training dataset. The LID model has been initialized and Nick is conducting his first runs with the dataset and a smaller version of our intended final architecture using only cross-entropy loss. The architecture itself remains stable. 

On the web app side, we conducted research on deployment strategies that support our requirements (mainly continuous push of audio chunks from client to server and model service instantiation on server boot time). We already decided to deploy our app following the logic of deploying a Flask API, but since our app is built using Django, we still need some further research and code modifications before finishing deployment.

 

Nothing has changed in terms of our schedule. On the modeling side, this week will be ramping up our training. One of our risks has been limited training time so it will be essential to make sure that the language model is fully ready and trainable so that it may run for hours at a time overnight over the course of the next two weeks. Initial results will be important to understand and characterize within the next week so that we can make training adjustments as needed to achieve our best performance. On the web app side, this week and next week will be focusing on app deployment with the goal to have our deployed app capable of running a pretrained ASR model and returning the prediction text results to client frontend.

Honghao’s Status Report for 2/26

This week I focused on researching and devising concrete plans for our web app deployment. The key points that influence our deployment strategy include: (1) there needs to be architecture that supports continuous push of ongoing audio recording in chunks from client to server; (2) the server needs to support single-instance service for our code-switching model, as we only need to load model once on server boot time and create only one instance of code-switching model for future incoming request; (3) we need to temporarily store audio files (.webm and .wav) on server so that they can be fed into our model as inputs. 

According to my research, there are two ways of deploying an ASR model on AWS server: deploying the model as a microservice or deploying the model as a python class which gets instantiated and ready to use (by calling the class methods such as “.predict( )”). After reading through tutorials and documentations (https://www.youtube.com/watch?v=7vWuoci8nUk) and (https://medium.com/google-cloud/building-a-client-side-web-app-which-streams-audio-from-a-browser-microphone-to-a-server-part-ii-df20ddb47d4e) of two deployment strategies, I decided to take the simpler strategy of deploying the model as a python class. The greatest advantage of this strategy is its simplicity. Deploying a microservice involves work to build the model into a Docker container and uses AWS Lambda etc. in addition to deploying our app on AWS EC2 server. Also, this strategy still allows great flexibility on the technology used at the client end. Therefore, to accomplish continuous push of new audio chunks from client to server, I can use either RecordRTC and socket.io in Node.js or websocket in Django to emit new audios to server efficiently. From the tutorial, I already familiarized myself with how to achieve single-instance service following this deployment strategy. And lastly, the temporary storage of audio files can simply be done through I/O calls wrapped in some class methods of the model instance.

Having devised a detailed deployment strategy, I began deploying our app. I finished writing the model service class including necessary functions like predict( ), preprocess( ) and the instance constructor of the service itself that guarantees only a single instance is launched.

Next week, I will continue researching and deploying our app on AWS server. The goal is to have our app deployed capable of loading a pretrained ASR model and send back prediction outputs to the client frontend.

Nick’s Status Report for 2/26

This week I was heavily focused finalizing and documenting the specifics of our system’s language identification model as well as other part of our systems overall language model design. I was able to upload the SEAME dataset to my storage instance on AWS and have begun working on setting up my training schema (data loading, validation and evaluation set partitioning etc..). In this I do find myself a day or two behind our schedule. I plan on spending all of Sunday working exclusively on getting the system entirely orchestrated so that all I have left is to make model architecture or hyper-parameter adjustments. I’ll be ramping up this week starting with a very small model and ramping it up as I validate each successive round of results. Given that we have a break coming up, it is crucial that the full system be able to train overnight by the end of the next school week. This will be my primary focus in terms of implementation.

Otherwise, most of my time this week was spent on our design document and sub-module specifications including unit tests, metrics and validation. Though our final design document still needs work in terms of formatting, I did a lot of our specification and explanation in a separate shared document which will allow for easy transfer into the document for the coming deadline.

 

Team Status Report for 2/19

This week our team focused on breaking our codeswitching speech recognition system into small modules, distributing task ownerships on these modules and detailing design requirements, metric goals and test plans for each module. We began with making a matrix of user requirements and each of their corresponding numerical design requirements and testing strategies. Then we came up with a total of 7 modules that are needed by these requirements: Language Detection, English ASR, Mandarin ASR, Audio Data Transfer, Web Frontend, Backend API and Language Model. 

After dividing the ownership of these modules, each of us drafted the design documents for our responsible modules. We then assembled our modular design documents to check for compatibility of our design metrics in order to avoid potential critical paths during module integration.

In terms of Audio Data Transfer, Web Frontend and Backend API modules, we have finished the proof of concept on a localhost environment and measured the time for audio data processing (.wav generation) on server end and received a promising result of 200ms to 300ms for each 2-second audio chunk. We might need to shorten this time after our later experiment with network latency between client and deployment server and ASR model running time on an AWS server. So next week, we will deploy an AWS server to proceed with testing those latencies to confirm the practicality of our current metric goals for these modules.

On the language modeling side we have overcome one of the most concerning aspects about training an end-to-end ASR model by securing access to the SEAME multilingual (Mandarin & English) audio/text dataset. This is a large dataset used previously to achieve SOTA results and was specifically curated for code-switching use cases. Should we lose access to this dataset, we’ve had contingency plans already developed before we knew we had access in which a dataset could be constructed by augmenting and mixing existing Mandarin and English dedicated datasets. Nick was also able to successfully complete setting up his remote development environment and now has a dedicated AWS block-store and GPU instance ready for training. He was also able to download Facebook’s wav2vec feature extraction model directly to the GPU instance and was able to successfully initialize it. 

Other than acquiring our dataset, most of our work was focused on finalizing the specifics of our model architecture further detailing what our training, validation, and testing processes were going to be for each submodule of the language model. A more detailed version of our previously “high-level” system architecture will be ready for the design review presentation. 

Overall, we remain on track with our schedule. Due to the surprise of having access to the SEAME dataset, we were able to shorten our “dataset compilation” period and extend model training periods as well as system testing and integration periods slightly. Official development will begin mid next-week. 

Nick’s Status Report for 2/19

This week I was focused heavily on two things: aggregating all of the datasets and material I will need for development and getting them uploaded on my remote development environment (AWS), drilling down into the metrics, requirements, and detailed design of our whole system for the design review. On the development side, I was able to retrieve and store the latest wav2vec 2.0 model for feature extraction specifically targeting mixed language inputs. Marco and I will be able to further fine-tune this model once we reach system-level training of the ASR model. I was also able to register and am waiting for final approval to gain full access to the SEAME code-switching dataset for Mandarin and English. Marco was granted access within a day or two so I plan to have full access to that data by Monday. My final setup for my remote Jupyter notebook is also fully configured. Using dedicated GPU instances, we’ll be able to train continuously overnight without having to worry about up-time interruption.

On the design side I completed detailed design documentation for each of the modules I will be either entirely or partially responsible for (Modules 1, 3, 7) with traceability matrices for requirements, unit testing, and validation. Each requirement can trace upwards to design-level requirements, and downwards to a specific test for easy tracking of how lower level decisions have been informed by high-level use case targeting. I added all of these matrices along with system-level versions of them to our ongoing design review document which also includes module-level architecture descriptions and interface details between modules.

I’m currently on-track with my planned work. Since Marco was able to gain access to the SEAME database, it has freed both of us up with an extra two-days in the schedule for either training work or system integration work at the end of the semester. This week I will plan to finish our design review presentation, our design review document, and target having a first version of the LID model initialized and ready for initial training by next weekend.

Marco’s Status Report for 2/19

This week I mainly focused on getting the dataset that we need for training our model. The scarcity of Mandarin-English code-switching speech corpus has always been one of our main concerns. We have previously gathered a dataset from an English-teaching language institution in China, but we were worried about its quality. In particular, the setting is confined, which may not be representative of the natural, fluent speech that we are looking for. Another concern is with the variability of the speech. If most of the speech is recording only a few teachers, then the model will be heavily skewed and overfit to their voices.

Another idea was to piece together existing speech from both languages to create data. However, one particular concern that we had about this approach is the fluency of the speech. If the segments are abrupt, then it doesn’t represent the fluent, ongoing speech that we are looking for. A potential consequence is that the deep neural model will try to pick up on these segment changes to identify language changes.

Fortunately, after a lot of digging online, I found access to two field-proven datasets: ASCEND and SEAME. ASCEND is a dataset released in 2021 containing code-switching speech from both Hongkongers and Mainland Chinese. It sums to a total of 10 hours. SEAME is a universal dataset used in code-switching speech recognition with recordings of Malaysians and Singaporeans. It sums to a total of 90 hours. These two datasets will sufficiently satisfy our needs for training data.

Honghao’s Status Report for 2/19

This week I focused on refining the design requirements, numeric goals and testing plans for audio transfer module, web frontend text displayer module and backend wav generation API module. After drafting detailed documentation of design metrics, I continued researching for tools for audio transfer whose performance can meet our design and finished implementing an audio transfer module that can periodically send requests embedded with audio data to the server.

To test if the audio transfer module matches our design requirement for the latency of .wav generation, I logged the server time at receiving the audio data and the time at finishing generating the corresponding .wav file at the web frontend. The .wav file generation time exceeds our intended 50ms but is under 300ms. We will tolerate this result until having further tested the client to server transmission latency (after having our server deployed) and the ML model average run time on a 2-second audio.

Next week, I will move on to deploying an AWS server with GPU so that we can load a pre-trained ASR model to experiment with network latency and model run time so that we can better estimate the practicality of our current design metrics.

Marco’s Status Report for 2/12

This week I primarily worked on getting a naive model working on speech recognition. Based on our research on speech recognition, we will be using connectionist temporal classification, which is a machine learning algorithm that can predict output, in our case words, based on input with no constraint on length. In addition, the deep neural network does not make any assumptions on the language, which allows it to potentially learn syntax and other structures about the language.

I chose Google Colab for prototyping due to its portability and convenience in setup. I decided to use huggingface for training model because of its portability to use on both PyTorch and Tensorflow. I tried setting up two models. The first one was purely doing speech recognition in English. This is relatively easy to set up since the model only need to classify 26 letters in the alphabet. It achieved a relatively high WER within a few hours of training.

For the second model I tried training on Mandarin. For the sake of simplicity, I decided to use a naive approach and treat Chinese characters as tokens. This drastically increases the output space since common Chinese characters can range over 5000. However, to my surprise, the model worked better than I expected. The mistakes made with this model were mostly of characters with the same sounds or similar sounds. This could mean that the deep neural network is learning and classifying characters by sound without explicit labeling of pinyin.

For the next approach, I will attempt to create a model that tokenizes both English letters and Chinese characters as output and test the accuracy of the model.

Team Status Reports for 2/12

This week our team focused on researching and experimenting ML models and web packages that are suitable for individual modules of our project, including the web app, backend speech recognition model and language detection model. After our research, we will include the models that show promising results and a proof of concept web app demo in our design document that is due 2/20/2022.

We started developing a web app that will become a proof of concept. This web app will support users to record their voice and submit the audio data to the server. The server will feed the audio into a speech recognition model and return the model output back to the frontend. This week we have finished implementing the voice recording on the frontend and audio data transfer to the server. The foreseeable risks on the web app development so far includes the loss of quality of audio transferred from frontend to server and the speed of our backend model processing the audio. Although we have finished implementing an audio transfer module that can successfully transfer audio data recorded on a web page to our server, the audio file generated on the server is noisy, which will impede our speech recognition accuracy. This issue should be fixed after finding a way to retrieve the original audio’s frame rate and number of channels so that these metrics can be used when writing the audio data into a .wav file on the server. So far, we are confident that this issue can be fixed by 2/20/2022. 

We will also try running a speech recognition model on our development server to estimate the speed of our system. If the speed of processing is too slow, we will try switching to a more powerful server instance type.

We also made the necessary resource requests for increasing our GPU instance limits as well as for AWS credits. We set up our respective development environments (CoLab or Jupyterlab) for remotely developing our software on the AWS GPU instances. Next steps will include uploading our desired development data sets onto our respective storage servers and curating our first smaller development sets for training the ASR and LID models. By next week we hope to have subtasks specifically divided in terms of model development and early models running for each sub-model of the whole DL system. We’ll also aim to have the detailed architecture of several iterations of the system finalized for the upcoming design presentation and document. The major risks we have in this area are those of our model’s performance and training time. We can’t be sure of exactly how quickly to expect progress or what level of performance to expect by integration and deployment time without beginning to characterize these metrics by beginning development. Taking as many early steps now to begin development will help address these risks and also help us understand exactly how large our vocabulary and dataset needs to be to achieve the user experience we are seeking.