This week I focused on integrating a codeswitching ASR model we trained to the web app and deploy the web app to a public IP address. The integration with our model was fairly smooth. Our own model does not suffer significant performance issues compared to the ASR models I tested running on our app before. When I select to send and process ongoing recorded audio by 1-second chunk, the user is able to get a real-time transcription back with a close to 1-second delay. The model was able to accurately capture the instances of language switching within a spoken sentence; however, suffering the same problem as the models before, since we are only feeding a 1-second audio chunk at a time to the model, the model transcription accuracy is not good because the context of a spoken sentence is lost. The model is only able to give the best transcription based on the audio feature within that 1-second chunk.
I tried cutting the audio into chunks by 300ms-silence hoping that each chunk separated by silence could encapsulate a more complete local context; however, each chunk separated by silence is 3-4 second long so analyzing these longer chunks each time would significantly increase the delay of transcription breaking the real-time experience of our app.
I have finished deploying the app to a http public IP, but I found that most browsers only support getting user media (including microphone) if we run on localhost or on a secure https environment, so my next step plan is to purchase a domain name and finish deploying the app on a public domain.
We are also training another codeswitching model, which I will test integrating with our app next week. So far, the development is on schedule. We finished integrating the app with our model, and we are ready to start model tuning and research a better chunking mechanism to capture a better context in an audio chunk while keeping a chunk as short as possible. We will also collect mix-language audio samples from YouTube to evaluate our model performance for diverse speech features.