This week I mainly focused on getting the dataset that we need for training our model. The scarcity of Mandarin-English code-switching speech corpus has always been one of our main concerns. We have previously gathered a dataset from an English-teaching language institution in China, but we were worried about its quality. In particular, the setting is confined, which may not be representative of the natural, fluent speech that we are looking for. Another concern is with the variability of the speech. If most of the speech is recording only a few teachers, then the model will be heavily skewed and overfit to their voices.
Another idea was to piece together existing speech from both languages to create data. However, one particular concern that we had about this approach is the fluency of the speech. If the segments are abrupt, then it doesn’t represent the fluent, ongoing speech that we are looking for. A potential consequence is that the deep neural model will try to pick up on these segment changes to identify language changes.
Fortunately, after a lot of digging online, I found access to two field-proven datasets: ASCEND and SEAME. ASCEND is a dataset released in 2021 containing code-switching speech from both Hongkongers and Mainland Chinese. It sums to a total of 10 hours. SEAME is a universal dataset used in code-switching speech recognition with recordings of Malaysians and Singaporeans. It sums to a total of 90 hours. These two datasets will sufficiently satisfy our needs for training data.