Marco’s Status Report for 4/23

This week we constructed a new framework for our speech recognition system. The input audio is first fed into a LID model that we’ve trained to split the audio into segments of either English or Mandarin. Then, the segments are fed into either a English or a Mandarin speech recognition model. Through this approach, we drastically reduced the complexity of the problem. Consequently, we achieved a much higher accuracy.

In addition, I wrote a python script that can run the program locally on any computer. The script will detect silence, chunk the audio, and process the audio as the speaker is speaking.

For next week, I plan on fine-tuning some of the minor problems we saw this week. First, I will try to smooth out the segments by the LID module by combining segments that are too short (which usually means it’s inaccurate) with nearby longer segments. Second, an autocorrect module could be added as a post-processing step to further improve the accuracy.

Leave a Reply

Your email address will not be published. Required fields are marked *