Progress
I was working on finalizing the speech recognition pipeline and did basic testing toward the subsystem this week. To simplify the script and save memory, I decided not to temporarily save the audio inputs as .wav files. Instead, I choose to feed the byte frames directly into the noisereduce and speech recognition methods.
After researching and doing some primary testing on the primary version of the script, I realized that it is not necessary to include both a “start” and “end” event to manually control audio recording. With the modified version of the audio recording and speech recognition pipeline, the recording process will terminate automatically after a specific time, and would end the current session if do not hear from the speaker for another set time period.
The current script could recognize the standard commands with acceptable accuracy:
A major focus of the speech recognition process is the price number. Currently, the price could be accurately recognized if the “dollar” keyword is included. On the other hand, if the speaker gives vague word commands such as “four-sixty”, the recognizer would directly convert it to “460”, which is discrepant from the expected value. We may need further discussion on how to deal with this.
Schedule
I am a little behind schedule for testing the scripts on RPi, but I will catch that next week.
Next Step
I will test the pipeline on the RPi in both quiet environments and crowded environments. Also, I will work with Yuxuan to implement the web app and connect the front-end buttons to the speech recognition pipeline.