Team Status Report for 2/19

This week our team focused on breaking our codeswitching speech recognition system into small modules, distributing task ownerships on these modules and detailing design requirements, metric goals and test plans for each module. We began with making a matrix of user requirements and each of their corresponding numerical design requirements and testing strategies. Then we came up with a total of 7 modules that are needed by these requirements: Language Detection, English ASR, Mandarin ASR, Audio Data Transfer, Web Frontend, Backend API and Language Model. 

After dividing the ownership of these modules, each of us drafted the design documents for our responsible modules. We then assembled our modular design documents to check for compatibility of our design metrics in order to avoid potential critical paths during module integration.

In terms of Audio Data Transfer, Web Frontend and Backend API modules, we have finished the proof of concept on a localhost environment and measured the time for audio data processing (.wav generation) on server end and received a promising result of 200ms to 300ms for each 2-second audio chunk. We might need to shorten this time after our later experiment with network latency between client and deployment server and ASR model running time on an AWS server. So next week, we will deploy an AWS server to proceed with testing those latencies to confirm the practicality of our current metric goals for these modules.

On the language modeling side we have overcome one of the most concerning aspects about training an end-to-end ASR model by securing access to the SEAME multilingual (Mandarin & English) audio/text dataset. This is a large dataset used previously to achieve SOTA results and was specifically curated for code-switching use cases. Should we lose access to this dataset, we’ve had contingency plans already developed before we knew we had access in which a dataset could be constructed by augmenting and mixing existing Mandarin and English dedicated datasets. Nick was also able to successfully complete setting up his remote development environment and now has a dedicated AWS block-store and GPU instance ready for training. He was also able to download Facebook’s wav2vec feature extraction model directly to the GPU instance and was able to successfully initialize it. 

Other than acquiring our dataset, most of our work was focused on finalizing the specifics of our model architecture further detailing what our training, validation, and testing processes were going to be for each submodule of the language model. A more detailed version of our previously “high-level” system architecture will be ready for the design review presentation. 

Overall, we remain on track with our schedule. Due to the surprise of having access to the SEAME dataset, we were able to shorten our “dataset compilation” period and extend model training periods as well as system testing and integration periods slightly. Official development will begin mid next-week. 

Leave a Reply

Your email address will not be published. Required fields are marked *