Nick’s Status Report for 4/30

This week I was able to accomplish a lot of what I wanted to in the direction of trying one last time to implement and train a complete jointly-trained LID/ASR model. I was able to overcome the problems I was having previously with creating logically correct labels for English and Mandarin audio segments which would be needed to train the LID model. I did this by using my previously top performing combined model (which used a shaky heuristic technique for creating labels) and used it to segment audio during preprocessing and label it English, Mandarin, or Blank (essentially distilling the LID information from that model). These new labels were then used in training a fresh model with the hope that the ASR-half of the model would no longer need to hold any LID information at all. Training has so far been successful, though I have not yet broken through my previous best performance. The training procedure does take stages to complete as I decay certain hyper parameters over time and I anticipate that over the next 10-20 epochs, I should be able to reach my goal. I’m on track with our planned schedule and will also be working heavily on our final report this week as this model trains in tandem.

 

Team Status Report for 4/23

We’ve continued to push the model performance by updating the architecture and techniques used to more closely align with our guiding paper. This week we achieved a 20% reduction in the WER on our toughest evaluation set by using a combined architecture which creates LID labels using a novel segmentation technique. This technique create “soft” labels which roughly infer the current language using audio segmentation which is then fused with the golden label text. Training and hyper-parameter tuning will continue to till the final demo, we anticipate there is likely room to get another 15% reduction in the error rate.

 

On the web app end, we finished implementing an audio silence detector on the JavaScript frontend and another backend transcription mechanism. The silence detector in the frontend detects a certain length of silence as the user is recording and only sends a transcription request when there is a silence gap detected and the backend model will only analyze the new audio chunk after the most recent silence gap. In this way, we would resolve the problem we had before with a single word being cut off since we were cutting the audio into 3 second trunks arbitrarily.

The new backend transcription mechanism takes in a piece of audio, tags each frame of the input audio with a language tag (<eng> for English, <man> for Mandarin and <UNK> for silence), breaks the input audio sequence into chunks of smaller single-language audio sequences and feeds each chunk to either a English or a mandarin ASR model. In this way we can integrate advanced pretrained single language ASR models into our system and harness their capability to enhance our system accuracy.

Below is the video demonstration of the silent chunking mechanism + our own mix-language transcription model.

https://drive.google.com/file/d/1YY_M3g54S8zmgkDc2RyQqt1IncXd5Wxo/view?usp=sharing

And below is the video demonstration of the silent chunking mechanism + LID sequence chunking + single language ASR models.

https://drive.google.com/file/d/1bcAi5p9H7i9nuqY2ZtgE7zb4wOuB0QsL/view?usp=sharing

With the silent chunking mechanism, we observed that the problem we had with a single spoken word being cut into two pieces audios was resolved. And we can see that the mechanism that integrates LID sequence chunking + single language ASR models demonstrates a higher transcription accuracy.

Next week, we will be focusing on running evaluation of our system on more diverse audios. We will also try adding Nick’s newly trained model into our system to compare with the current accuracy. So far we are a little delayed on our evaluation schedule because we were trying to enhance our system further by adding the mechanisms above. We do expect to finish the evaluation and further parameter tuning (silence gap length threshold and silence amplitude threshold) before the final demo.

Nick’s Status Report for 4/23

This week I pushed forward as far as I could with improving the model accuracy by using new techniques for training and combining the LID and ASR modules to almost exactly replicate the techniques used in the primary paper we’ve been using for guidance. The model completed over 40 epochs of training this week across 2-3 different configurations. The biggest and most successful change was using a segmentation process I devised to create soft labels for the current language. The audio is first segmented by volume to create candidate segments of audio for words or groups of words. I then perform a two-way reduction between each list of segments and the actual language labels calculated from the label string. This results in a 1-1 match between a language label and a segment which allows for performing cross-entropy loss across the LID model. Using this technique, I was able to achieve a 20% boost in performance across the WER of the hardest evaluation set we have, lowering the current best performance to 69%. This training process will continue until the final demo as we are able to easily substitute in a new version of the model from the cloud each time a new best is achieved. Great focus will also be placed on finishing our final presentation for Monday. 

Nick’s Status Report for 4/16

This week we’ve been working diligently to improve our model’s performance and work on improving our projects overall insights and user experience. Within a day of our meeting earlier this week, I spent time retraining a fresh model with some slight modifications to which pre-trained feature extractor we were using as well as training hyper-parameters. This lead to a model with much better performance compared with what we were able to demo at our interim in both the character and word error rate metrics across all evaluation subsets and overall.

This is continued progress now being made in completing the training of a model with a slightly more nuanced architecture using both CTC and classification loss fused and with fused logits as well. Overall, the larger model should generally have the capability for greater knowledge retention and a finer ability to distinguish between mandarin and English by more explicitly focusing on separating the task of LID and ASR. We will continue work tomorrow to strategize about what we would like the final outcome of our project to be and what we would like to highlight during our final demo and report.

 

Team Status Report for 04/02

LID work this week focused on continued training and fleshing out interfaces to interact with the model. Pre-loading and forward pass metrics were introduced to expose the functionality of the model through an importable class available from the LID’s GitHub. The model itself is loaded from a separate online repository (also based through GitHub) which is where improved version of the model have been automatically loaded as training has progressed. Focusing on integration and development of the first demo will take up most of the work for the next couple days along with beginning to build out the software suite for performing the various tests we’d prescribed in our design and architecture documents. The model could be about a half week further ahead so Nick plans on spending most of the next week focusing solely on these deliverables.

On the web app end, we have integrated a codeswitching model trained by Marco and got some promising result. The model is able to run efficiently when we chunk ongoing recording stream to 1-second chunks to feed to the model, the model could output the transcription in close to 1 second, which achieves a real-time experience of our app. The model is able to accurately capture the instance of language switching within a sentence, but since we are only feeding a 1-second audio chunk at a time to the model, the model is only able to give the best transcription based on the audio feature within that 1-second chunk. So far the integration is on schedule. We are ready to start evaluating our models using diverse audio samples from Youtube and tune our models accordingly. We will also incorporate Nick’s LID model to enhance our model accuracy and experiement with other chunking mechanism to encapsulate more context in an audio chunk while keeping the chunk short.

Nick’s Status Report for 04/02

This week’s work was focused on further fleshing out how the LID model and how it will interact with the rest of the components of our system. Currently a version of the model is available for to be pulled from a cloud repository, loaded, and run on raw speech utterances to produce a sequence of classifications. I’ve added methods which allow for my partners to preemptively load the model into system and Cuda memory (so that we can minimize loading times when actually running at transcription request since only the data then needs to be moved into and out of memory). I also exposed a method for actually making the forward call through the network. I anticipate the interface between the backend language model to be continue to be just a simple class which can be called from he software API level which Tom’s been working on. Integration and testing will continue to be our focus for the next couple of weeks. There is work to be done to set up testing frameworks both for accuracy as well as noise tolerance. In this aspect I feel a little behind but I plan on spending much of the next 3 or 4 days working on this. Delivering our first demo is the next major milestone as a team so we will need to continue meeting in person or through live zooms to flesh out that integration.

Team Status Report for 3/26

This week training began successfully on the LID model using the data we intended. It was able to complete its first epoch in about ~12hrs with promising initial prediction abilities. Next steps will be integrating the model along with Marco’s ASR module. The current model size is about 1GB uncompressed so I do anticipate meeting our size requirements to be begin to be a challenge. Exploring ways to quantize or otherwise compress the model may be investigated over the next week as will testing and training no noisy data using things like SpecAugment. Actual inference times seem to be around a second which is a promising result with respect to meeting our timing targets.

On the webapp side, we were able to achieve analyzing audios by chunks to greatly improve the system run speed. On the user level, we were able to create a real-time experience for voice-to-text. We also tested our system speed on larger ASR models and still got near real-time transcription.

Next week, we are expecting to have a trained language detection model, which we can start integrating into our deployed web app. So far, the development progress is on schedule. Potential risks include that the training of two languages’ CTC models may take longer than expected to get workable results, which could delay our integration timeline.

Nick’s Status Report for 3/26

This week training began in earnest. The model was able to complete its first epoch of training in ~12hrs time. Preliminary values of its accuracy metric seem to show it sitting around 80% WER. This is not the ideal metric for this use case however, and after updating the metric I anticipate it to have a much better classification error rate. The outputs of the model so far have looked exceedingly reasonable so I feel good about it ability to be integrated with Marco’s model soon. Based on visual comparisons between golden transcriptions and predicted transcriptions, the model exclusively spits out M or E tokens for Mandarin or English respectively, never printing an ‘UNK’ character indicating confusion. I find this promising. For next week work will focus on continuing to train the model as well as integrating it with Marco’s work and exposing it for Tom’s work on web development. I feel that I am currently on schedule with the model’s progress. I may request additional limit allowances to speed up training but so far it would appear the data available is at least sufficient for basic language detection. I do not anticipate major blockages from here.

 

Nick’s Status Report for 3/19

Since the last update, I’ve finished writing all the infrastructure needed for complete training of our LID and ASR models on the SEAME dataset. Due to a unique directory configuration, multiple sets of labels, and un-split test and training datasets, this all needed to be accomplished by an indexing process. I completed a script that indexes and stores all of the labels associated with each file and each possible utterance within each file. It was about 10MB and can be loaded extremely quickly. Using this index, we are now able to create separate data loaders for the separate tasks which are then capable of labelling our data as needed for the application. The index has also separated the data into training and test sets. One test set is biased towards Mandarin while one is biased towards English. I also completed an implementation of a data collator which is be used to make training as efficient as possible during forward passes of batches. Training now continues on the LID model end-to-end.

I’m on schedule for delivering the working LID model. This week will just be about supporting and continuing training. There will also be a small amount of coding needed just to set up running both combined models jointly (combining their outputs etc.).

Nick’s Status Report for 2/26

This week I was heavily focused finalizing and documenting the specifics of our system’s language identification model as well as other part of our systems overall language model design. I was able to upload the SEAME dataset to my storage instance on AWS and have begun working on setting up my training schema (data loading, validation and evaluation set partitioning etc..). In this I do find myself a day or two behind our schedule. I plan on spending all of Sunday working exclusively on getting the system entirely orchestrated so that all I have left is to make model architecture or hyper-parameter adjustments. I’ll be ramping up this week starting with a very small model and ramping it up as I validate each successive round of results. Given that we have a break coming up, it is crucial that the full system be able to train overnight by the end of the next school week. This will be my primary focus in terms of implementation.

Otherwise, most of my time this week was spent on our design document and sub-module specifications including unit tests, metrics and validation. Though our final design document still needs work in terms of formatting, I did a lot of our specification and explanation in a separate shared document which will allow for easy transfer into the document for the coming deadline.