Nick’s Status Report for 2/12

This week I worked on getting AWS configured for the DL language model we intended to deploy. I made resource requests for AWS credits and a limit increase of GPU instance types. We plan to use G-type instances for most development, though we may deploy some p-type instances for especially heavy system-wide training in later stages. I was able to download and setup the latest version of Jupiter Lab for remote development and was able to ssh in properly to my first instances configured with an AWS Deep Learning AMI. I experienced some issues ssh’ing in originally so I spent some significant time re-configuring my AWS security groups and VPC with the correct permission to allow my to access the servers now and with any future instances we may launch.

Progress is on track currently. We are currently ahead of schedule on our implementation as the official period does not begin for at least another week and a half and we’ve already had success with several early steps of development. This week I will also be focusing heavily on making and documenting key design decisions in detail. These will be presented next week at the design presentation which I will be conducting.

There will be several major things I plan to complete by next week. I’d like to have finalized detailed architectures finished for several versions of the LID or ASR models. There are a couple of different formulations which I’d like to experiment with. Marco and I will also need to finalize the actual task division we’d like to use for developing the sub-models of the overall system. This way he and I will also be able to document and finalize the different datasets we may need to compile or augment for module-level training. By next weekend we should have small development versions of both LID and ASR models running on remote instances and completely ready for further training and development.

Honghao’s Status Report for 2/12

This week I focused on the setup of the Django environment for our web app. I finished researching the Javascript library for constructing and sending audio streams to the backend. On localhost, we can start a web frontend page that has a record button for users to record their voice. Once they stop recording, we will display an audio playback at the frontend page and at the same time send the audio stream to the backend which will in turn store the audio data in a .wav file on the server side.

This audio stream transfer is crucial to our project because we need to send audio streams from frontend to backend and store them as .wav files to feed into our speech recognition model.

However, I still have some trouble with the noise of the .wav file created on the server side. Based on current research, the reason for the noise is probably that I arbitrarily set the number of channels and frame rate for the audio when writing the audio data to the .wav file on the server side. Solving this issue will require further research on how to retrieve the framerate and counts of channels in the original audio data.

So far the progress for finishing a proof of concept for our audio stream transfer module is on schedule. I still have a week before the deadline of the design document (2/20/2022) to finish the proof of concept and visualize the module with a detailed sketch. Next week I will resolve the audio noise issue and draw a detailed sketch of the audio transfer module that includes flow charts and packages used along the way.

Our Idea

Most speech recognition apps have good performance today. From Siri to voice texting on most smart phones, the accuracy and processing speed of speech recognition in these apps are great — with annoying limit though: most apps only support one language mode. Siri only supports single-language recognition. If you set your Siri supporting language to English and try speaking into your Siri with a mix of two languages, Mandarin and English for example, you will find Siri treat everything you said as English and translate the part you speak in Mandarin as gibberish.

So we want to build an app that provides accurate and real-time recognition for speeches mixed with Mandarin and English. It will be very useful for (1) voice texting when a speaker is bilingual and speaks in a mix of English and Mandarin; (2) transcription for international conferences when attendants from different countries start a mixed-language dialogue. Our goal for our app is to reach an word error rate (~10%) that matches existing single-language speech recognition apps and an end-to-end latency of less than 1 seconds so that the recognition can catch up with human normal speaking rate (~100 words per second).