Introduction and Project Summary

Most speech recognition apps have good performance today. From Siri to voice texting on most smart phones, the accuracy and processing speed of speech recognition in these apps are great — with annoying limit though: most apps only support one language mode. Siri only supports single-language recognition. If you set your Siri supporting language to English and try speaking into your Siri with a mix of two languages, Mandarin and English for example, you will find Siri treat everything you said as English and translate the part you speak in Mandarin as gibberish.

So we want to build an app that provides accurate and real-time recognition for speeches mixed with Mandarin and English. It will be very useful for (1) voice texting when a speaker is bilingual and speaks in a mix of English and Mandarin; (2) transcription for international conferences when attendants from different countries start a mixed-language dialogue. Our goal for our app is to reach an word error rate (~10%) that matches existing single-language speech recognition apps and an end-to-end latency of less than 1 seconds so that the recognition can catch up with human normal speaking rate (~100 words per second).