This week on the web app development side, we have verified our deployment strategy by deploying and test-running an English ASR model and a Mandarin ASR model on an AWS server. We were able to make recording on our frontend app, which sends the audio data to the deployment server and receives the transcription outputs from the server as shown below.
Since they are only pretrained basic ASR models with no additional training, the transcription accuracy is not very high, but it verifies that our current deployment strategy is feasible for deploying our own model.
In the process of the test run, we noticed an issue with the audio chunks received on the server – audio chunks received on the server end after the first chunk are empty. While we are researching solutions to fix the issue of audio chunks received on server end after the first chunk is empty, we apply a workaround to have the frontend send the entire audio data starting from the beginning of a recording session in every request which removes the issue above; however, this workaround does sacrifice app performance because as we decrease the audio sending period, the model would have to run on more and more audio data. For example, if the audio sending period is 1 second, that would mean that the model would have to run on a 1s audio from the beginning of a session, a 2s audio from the beginning of a session, … , a n-second audio of the session. The total length of audio through the model for a n-second recording session is O(n2) whereas if we could manage to correctly send n 1-second audios, the total length of audio through the model for a n-second recording session is only O(n).
Next week, we will work on fixing the empty audio file issue and deploy a pretrained language detection model on our server so that the language detection model could decide for each audio frame which language model should be used to transcribe the audio to achieve codeswitching.
On the web app side, we do not expect the empty audio chunk issue to majorly delay our schedule. Potential risks for deployment schedule mainly reside on differences in the required environments and model complexity of the 2 tested ASR models and our own model.
On the model training side, we have finished data loader for loading SEAME dataset (the English-Mandarin code switching dataset). And we are still training the language detection model, English CTC model and Mandarin CTC model.