This week I primarily worked on getting a naive model working on speech recognition. Based on our research on speech recognition, we will be using connectionist temporal classification, which is a machine learning algorithm that can predict output, in our case words, based on input with no constraint on length. In addition, the deep neural network does not make any assumptions on the language, which allows it to potentially learn syntax and other structures about the language.
I chose Google Colab for prototyping due to its portability and convenience in setup. I decided to use huggingface for training model because of its portability to use on both PyTorch and Tensorflow. I tried setting up two models. The first one was purely doing speech recognition in English. This is relatively easy to set up since the model only need to classify 26 letters in the alphabet. It achieved a relatively high WER within a few hours of training.
For the second model I tried training on Mandarin. For the sake of simplicity, I decided to use a naive approach and treat Chinese characters as tokens. This drastically increases the output space since common Chinese characters can range over 5000. However, to my surprise, the model worked better than I expected. The mistakes made with this model were mostly of characters with the same sounds or similar sounds. This could mean that the deep neural network is learning and classifying characters by sound without explicit labeling of pinyin.
For the next approach, I will attempt to create a model that tokenizes both English letters and Chinese characters as output and test the accuracy of the model.