This week I continued the efforts of improving the performance of the model. I attempted to train with a dataset that included more English recordings. However, this approach was not very effective. The overall performance of the model actually regressed, with no particular improvement in the transcription of English sentences. The second approach that involves training a model using SEAME is still underway.
The biggest obstacle in this task, I believe, is not with the model but rather the lack of resources. While SEAME is by far the most comprehensive code-switching dataset, I have concerns about how well it will suit our task. In particular, the speakers are all from Malaysia and Singapore. The accent spoken in both English and Chinese drastically differs from the code-switching speech spoken by Mainland Chinese and Hongkongers. This difference may result in a lackluster performance during our final demo. Another argument that confirms my suspicion lies in a paper published by Tencent AI Lab. In their experiment, a model similar to the one we are training is used. The difference is that they used over 1000 hours of code-switching speech for training. Their result reached an incredible 7.6% CER.
To combat the resource problem, I wrote a website aiming to gather code-switching speeches from classmates and peers. For next week, I plan on finalizing and deploying the website by Monday and use the speeches gathered from the website to continue training the model.