This week, I focused on experimenting 2 different approaches to improve the transcription accuracy while maintaining a good user experience as much as possible.
The first approach is to continue providing real-time transcription by feeding the model 1-second audio chunks. What’s new is at the end of each recording session, I rerun the model using the entire audio, which could take some significant time duration as long as the audio duration, but provide a significantly more accurate transcription, which acts as an “autocorrection” to the entire transcription. Then I present the output of the model running on the entire audio at the bottom of the real-time generated text and allow the user to decide whether to use or ignore the “autocorrection”. The experience is shown below:
The second approach is to also continue providing real-time transcription by feeding the model 1-second audio chunks but as the user is making the recording, every 3 seconds, I would resend the last 3 seconds’ audio to be re-evaluated by the model and replace the last 3 second’s real-time generated texts with the “autocorrected” texts. This experience is closer to the autocorrecting transcription experience provided by Siri. The transcription accuracy of the “autocorrected” transcription for 3-second chunks did get better compared to the original transcription from 1-second chunks, but the accuracy is worse than the output of running the model over the entire audio (approach 1) as expected.
The next step is to incorporate Nick’s newly trained transcription model (trained on large SEAME dataset), which could make the transcription accuracy higher, especially the English transcription. The model has relatively smooth performance. Also, after Marco’s site collects sufficient audio samples, I will run batch tests on our model to have a more rigorous evaluation when the input audios are more diverse.
So far, the integration and evaluation schedule has no delay.