Team C0: CodeSwitch – Page 4 – Carnegie Mellon ECE Capstone, Spring 2022; Nick Toldalagi, Tom Chen, Marco Yu

February 20, 2022February 20, 2022

Nick’s Status Report for 2/19

This week I was focused heavily on two things: aggregating all of the datasets and material I will need for development and getting them uploaded on my remote development environment (AWS), drilling down into the metrics, requirements, and detailed design of our whole system for the design review. On the development side, I was able to retrieve and store the latest wav2vec 2.0 model for feature extraction specifically targeting mixed language inputs. Marco and I will be able to further fine-tune this model once we reach system-level training of the ASR model. I was also able to register and am waiting for final approval to gain full access to the SEAME code-switching dataset for Mandarin and English. Marco was granted access within a day or two so I plan to have full access to that data by Monday. My final setup for my remote Jupyter notebook is also fully configured. Using dedicated GPU instances, we’ll be able to train continuously overnight without having to worry about up-time interruption.

On the design side I completed detailed design documentation for each of the modules I will be either entirely or partially responsible for (Modules 1, 3, 7) with traceability matrices for requirements, unit testing, and validation. Each requirement can trace upwards to design-level requirements, and downwards to a specific test for easy tracking of how lower level decisions have been informed by high-level use case targeting. I added all of these matrices along with system-level versions of them to our ongoing design review document which also includes module-level architecture descriptions and interface details between modules.

I’m currently on-track with my planned work. Since Marco was able to gain access to the SEAME database, it has freed both of us up with an extra two-days in the schedule for either training work or system integration work at the end of the semester. This week I will plan to finish our design review presentation, our design review document, and target having a first version of the LID model initialized and ready for initial training by next weekend.

February 20, 2022April 10, 2022

Marco’s Status Report for 2/19

This week I mainly focused on getting the dataset that we need for training our model. The scarcity of Mandarin-English code-switching speech corpus has always been one of our main concerns. We have previously gathered a dataset from an English-teaching language institution in China, but we were worried about its quality. In particular, the setting is confined, which may not be representative of the natural, fluent speech that we are looking for. Another concern is with the variability of the speech. If most of the speech is recording only a few teachers, then the model will be heavily skewed and overfit to their voices.

Another idea was to piece together existing speech from both languages to create data. However, one particular concern that we had about this approach is the fluency of the speech. If the segments are abrupt, then it doesn’t represent the fluent, ongoing speech that we are looking for. A potential consequence is that the deep neural model will try to pick up on these segment changes to identify language changes.

Fortunately, after a lot of digging online, I found access to two field-proven datasets: ASCEND and SEAME. ASCEND is a dataset released in 2021 containing code-switching speech from both Hongkongers and Mainland Chinese. It sums to a total of 10 hours. SEAME is a universal dataset used in code-switching speech recognition with recordings of Malaysians and Singaporeans. It sums to a total of 90 hours. These two datasets will sufficiently satisfy our needs for training data.

February 19, 2022February 20, 2022

Honghao’s Status Report for 2/19

This week I focused on refining the design requirements, numeric goals and testing plans for audio transfer module, web frontend text displayer module and backend wav generation API module. After drafting detailed documentation of design metrics, I continued researching for tools for audio transfer whose performance can meet our design and finished implementing an audio transfer module that can periodically send requests embedded with audio data to the server.

To test if the audio transfer module matches our design requirement for the latency of .wav generation, I logged the server time at receiving the audio data and the time at finishing generating the corresponding .wav file at the web frontend. The .wav file generation time exceeds our intended 50ms but is under 300ms. We will tolerate this result until having further tested the client to server transmission latency (after having our server deployed) and the ML model average run time on a 2-second audio.

Next week, I will move on to deploying an AWS server with GPU so that we can load a pre-trained ASR model to experiment with network latency and model run time so that we can better estimate the practicality of our current design metrics.

February 13, 2022April 10, 2022

Marco’s Status Report for 2/12

This week I primarily worked on getting a naive model working on speech recognition. Based on our research on speech recognition, we will be using connectionist temporal classification, which is a machine learning algorithm that can predict output, in our case words, based on input with no constraint on length. In addition, the deep neural network does not make any assumptions on the language, which allows it to potentially learn syntax and other structures about the language.

I chose Google Colab for prototyping due to its portability and convenience in setup. I decided to use huggingface for training model because of its portability to use on both PyTorch and Tensorflow. I tried setting up two models. The first one was purely doing speech recognition in English. This is relatively easy to set up since the model only need to classify 26 letters in the alphabet. It achieved a relatively high WER within a few hours of training.

For the second model I tried training on Mandarin. For the sake of simplicity, I decided to use a naive approach and treat Chinese characters as tokens. This drastically increases the output space since common Chinese characters can range over 5000. However, to my surprise, the model worked better than I expected. The mistakes made with this model were mostly of characters with the same sounds or similar sounds. This could mean that the deep neural network is learning and classifying characters by sound without explicit labeling of pinyin.

For the next approach, I will attempt to create a model that tokenizes both English letters and Chinese characters as output and test the accuracy of the model.

February 13, 2022February 13, 2022

Team Status Reports for 2/12

This week our team focused on researching and experimenting ML models and web packages that are suitable for individual modules of our project, including the web app, backend speech recognition model and language detection model. After our research, we will include the models that show promising results and a proof of concept web app demo in our design document that is due 2/20/2022.

We started developing a web app that will become a proof of concept. This web app will support users to record their voice and submit the audio data to the server. The server will feed the audio into a speech recognition model and return the model output back to the frontend. This week we have finished implementing the voice recording on the frontend and audio data transfer to the server. The foreseeable risks on the web app development so far includes the loss of quality of audio transferred from frontend to server and the speed of our backend model processing the audio. Although we have finished implementing an audio transfer module that can successfully transfer audio data recorded on a web page to our server, the audio file generated on the server is noisy, which will impede our speech recognition accuracy. This issue should be fixed after finding a way to retrieve the original audio’s frame rate and number of channels so that these metrics can be used when writing the audio data into a .wav file on the server. So far, we are confident that this issue can be fixed by 2/20/2022.

We will also try running a speech recognition model on our development server to estimate the speed of our system. If the speed of processing is too slow, we will try switching to a more powerful server instance type.

We also made the necessary resource requests for increasing our GPU instance limits as well as for AWS credits. We set up our respective development environments (CoLab or Jupyterlab) for remotely developing our software on the AWS GPU instances. Next steps will include uploading our desired development data sets onto our respective storage servers and curating our first smaller development sets for training the ASR and LID models. By next week we hope to have subtasks specifically divided in terms of model development and early models running for each sub-model of the whole DL system. We’ll also aim to have the detailed architecture of several iterations of the system finalized for the upcoming design presentation and document. The major risks we have in this area are those of our model’s performance and training time. We can’t be sure of exactly how quickly to expect progress or what level of performance to expect by integration and deployment time without beginning to characterize these metrics by beginning development. Taking as many early steps now to begin development will help address these risks and also help us understand exactly how large our vocabulary and dataset needs to be to achieve the user experience we are seeking.

February 13, 2022February 13, 2022

Nick’s Status Report for 2/12

This week I worked on getting AWS configured for the DL language model we intended to deploy. I made resource requests for AWS credits and a limit increase of GPU instance types. We plan to use G-type instances for most development, though we may deploy some p-type instances for especially heavy system-wide training in later stages. I was able to download and setup the latest version of Jupiter Lab for remote development and was able to ssh in properly to my first instances configured with an AWS Deep Learning AMI. I experienced some issues ssh’ing in originally so I spent some significant time re-configuring my AWS security groups and VPC with the correct permission to allow my to access the servers now and with any future instances we may launch.

Progress is on track currently. We are currently ahead of schedule on our implementation as the official period does not begin for at least another week and a half and we’ve already had success with several early steps of development. This week I will also be focusing heavily on making and documenting key design decisions in detail. These will be presented next week at the design presentation which I will be conducting.

There will be several major things I plan to complete by next week. I’d like to have finalized detailed architectures finished for several versions of the LID or ASR models. There are a couple of different formulations which I’d like to experiment with. Marco and I will also need to finalize the actual task division we’d like to use for developing the sub-models of the overall system. This way he and I will also be able to document and finalize the different datasets we may need to compile or augment for module-level training. By next weekend we should have small development versions of both LID and ASR models running on remote instances and completely ready for further training and development.

February 13, 2022February 13, 2022

Honghao’s Status Report for 2/12

This week I focused on the setup of the Django environment for our web app. I finished researching the Javascript library for constructing and sending audio streams to the backend. On localhost, we can start a web frontend page that has a record button for users to record their voice. Once they stop recording, we will display an audio playback at the frontend page and at the same time send the audio stream to the backend which will in turn store the audio data in a .wav file on the server side.

This audio stream transfer is crucial to our project because we need to send audio streams from frontend to backend and store them as .wav files to feed into our speech recognition model.

However, I still have some trouble with the noise of the .wav file created on the server side. Based on current research, the reason for the noise is probably that I arbitrarily set the number of channels and frame rate for the audio when writing the audio data to the .wav file on the server side. Solving this issue will require further research on how to retrieve the framerate and counts of channels in the original audio data.

So far the progress for finishing a proof of concept for our audio stream transfer module is on schedule. I still have a week before the deadline of the design document (2/20/2022) to finish the proof of concept and visualize the module with a detailed sketch. Next week I will resolve the audio noise issue and draw a detailed sketch of the audio transfer module that includes flow charts and packages used along the way.

February 2, 2022

Our Idea

Most speech recognition apps have good performance today. From Siri to voice texting on most smart phones, the accuracy and processing speed of speech recognition in these apps are great — with annoying limit though: most apps only support one language mode. Siri only supports single-language recognition. If you set your Siri supporting language to English and try speaking into your Siri with a mix of two languages, Mandarin and English for example, you will find Siri treat everything you said as English and translate the part you speak in Mandarin as gibberish.

So we want to build an app that provides accurate and real-time recognition for speeches mixed with Mandarin and English. It will be very useful for (1) voice texting when a speaker is bilingual and speaks in a mix of English and Mandarin; (2) transcription for international conferences when attendants from different countries start a mixed-language dialogue. Our goal for our app is to reach an word error rate (~10%) that matches existing single-language speech recognition apps and an end-to-end latency of less than 1 seconds so that the recognition can catch up with human normal speaking rate (~100 words per second).