Marco’s Status Report for 4/30

This week I primarily focused on the final presentation and making the final poster for the demonstration next week. In addition, I did some fine-tuning on the LID model. One problem that I saw was that the system sometimes switches languages really quickly to the point where the segment is shorter than the minimum input for the ASR model. The short segments are often inaccurate since they are shorter than an average normal utterance of 0.3 seconds. To address this issue, I set a minimum threshold for each segment to be at least 0.3 seconds long and merge the short segments with nearby long segments. This approach improved CER by about 1%.

For next week I plan on fixing some minor issues with the system, especially with the silence detection. Currently, the silence detection is using a constant decibel as the threshold, but this could be problematic in a noisy environment where the average decibel is higher. Finally, I will prepare the system for the final demonstration next week.

Marco’s Status Report for 4/23

This week we constructed a new framework for our speech recognition system. The input audio is first fed into a LID model that we’ve trained to split the audio into segments of either English or Mandarin. Then, the segments are fed into either a English or a Mandarin speech recognition model. Through this approach, we drastically reduced the complexity of the problem. Consequently, we achieved a much higher accuracy.

In addition, I wrote a python script that can run the program locally on any computer. The script will detect silence, chunk the audio, and process the audio as the speaker is speaking.

For next week, I plan on fine-tuning some of the minor problems we saw this week. First, I will try to smooth out the segments by the LID module by combining segments that are too short (which usually means it’s inaccurate) with nearby longer segments. Second, an autocorrect module could be added as a post-processing step to further improve the accuracy.

Marco’s Status Report for 4/16

Following the discussion with the professor earlier this week, I went on to explore some of the options that he had mentioned. In particular, I started looking into a speaker-dependent speech recognition system instead of a speaker-independent one. While the scope or difficulty of the task drastically reduces, it is still interesting to experiment with the capabilities of a system that can perform recognition on code-switching speech.

According to a paper published in 1993, speaker-dependent ASR can reach very low error rates within 600-3000 sentences of speech in English. However, there have been no similar attempts in code-switching speech between Mandarin and English.

So far, I collected around 600 sentences of myself speaking in purely English, purely Mandarin, and mixed. For next week, I plan on training a model using the data I’ve gathered together with existing English and Mandarin corpora.

Marco’s Status Report for 4/9

This week I continued the efforts of improving the performance of the model. I attempted to train with a dataset that included more English recordings. However, this approach was not very effective. The overall performance of the model actually regressed, with no particular improvement in the transcription of English sentences. The second approach that involves training a model using SEAME is still underway.

The biggest obstacle in this task, I believe, is not with the model but rather the lack of resources. While SEAME is by far the most comprehensive code-switching dataset, I have concerns about how well it will suit our task. In particular, the speakers are all from Malaysia and Singapore. The accent spoken in both English and Chinese drastically differs from the code-switching speech spoken by Mainland Chinese and Hongkongers. This difference may result in a lackluster performance during our final demo. Another argument that confirms my suspicion lies in a paper published by Tencent AI Lab. In their experiment, a model similar to the one we are training is used. The difference is that they used over 1000 hours of code-switching speech for training. Their result reached an incredible 7.6% CER.

To combat the resource problem, I wrote a website aiming to gather code-switching speeches from classmates and peers. For next week, I plan on finalizing and deploying the website by Monday and use the speeches gathered from the website to continue training the model.

Marco’s Status Report for 4/2

This week I mainly focused on improving the performance of the model. Following the approach outlined by a research paper, I started with a wav2vec2-large-xlsr-53 model that was pretrained on several Mandarin datasets and fine-tuned the model on a Mandarin-English code switching dataset called ASCEND. The model achieved a CER of 24%, which is very close to the 23% CER reported in the research paper. Upon closer inspection, I noticed that the model is good at recognizing when the speaker switches language. Also, the model performed extremely well on Mandarin inputs, but is lacking in the accuracy of English inputs. This is most likely due to the fact that the model was initially pretrained using Mandarin.

For next week, I plan on improving the model’s performance on English inputs through two approaches. The first approach will be to add more datapoints purely in English into the existing dataset. The second approach will be to train the model on SEAME, a much larger and comprehensive code switching dataset.

Marco’s Status Report for 2/26

For this week, I primarily focused on getting different tokenized outputs for Chinese characters. As discussed in last week’s report, there are many different ways to define the output space, and each possibility may result in different performances in the final model. The most simple and naive way is to use the actual characters. Some existing models I found using this approach achieved a character error rate of around 20%. This approach works, but there is definitely room for improvement. The second approach is to use pinyin as the output space. This drastically reduces the size of the output space, but does require a language model to convert pinyin into actual characters. A paper published this year by Jiaotong University reported that their model achieved a character error rate of 9.85%.

We also discussed some implementation ideas for performing a real-time transcription system. One idea is to use a sliding window. For instance, an audio segment from 0s to 4s will be sent to the server, then 1 second later the audio segment from 1s to 5s will be sent. The overlapped regions are then compared and aggregated by the relative output probability. This approach, theoretically, would maintain context information, which leads to a higher transcription accuracy, and process in a short span of time.

Overall, I am on schedule for this week. For next week, I plan on training some models transcribing Chinese using the output spaces mentioned above on Amazon Web Service. I also plan on doing some research on Amazon SageMaker, which is a fully managed machine learning service that could be easier and cheaper to use than a EC2 instance.

Marco’s Status Report for 2/19

This week I mainly focused on getting the dataset that we need for training our model. The scarcity of Mandarin-English code-switching speech corpus has always been one of our main concerns. We have previously gathered a dataset from an English-teaching language institution in China, but we were worried about its quality. In particular, the setting is confined, which may not be representative of the natural, fluent speech that we are looking for. Another concern is with the variability of the speech. If most of the speech is recording only a few teachers, then the model will be heavily skewed and overfit to their voices.

Another idea was to piece together existing speech from both languages to create data. However, one particular concern that we had about this approach is the fluency of the speech. If the segments are abrupt, then it doesn’t represent the fluent, ongoing speech that we are looking for. A potential consequence is that the deep neural model will try to pick up on these segment changes to identify language changes.

Fortunately, after a lot of digging online, I found access to two field-proven datasets: ASCEND and SEAME. ASCEND is a dataset released in 2021 containing code-switching speech from both Hongkongers and Mainland Chinese. It sums to a total of 10 hours. SEAME is a universal dataset used in code-switching speech recognition with recordings of Malaysians and Singaporeans. It sums to a total of 90 hours. These two datasets will sufficiently satisfy our needs for training data.

Marco’s Status Report for 2/12

This week I primarily worked on getting a naive model working on speech recognition. Based on our research on speech recognition, we will be using connectionist temporal classification, which is a machine learning algorithm that can predict output, in our case words, based on input with no constraint on length. In addition, the deep neural network does not make any assumptions on the language, which allows it to potentially learn syntax and other structures about the language.

I chose Google Colab for prototyping due to its portability and convenience in setup. I decided to use huggingface for training model because of its portability to use on both PyTorch and Tensorflow. I tried setting up two models. The first one was purely doing speech recognition in English. This is relatively easy to set up since the model only need to classify 26 letters in the alphabet. It achieved a relatively high WER within a few hours of training.

For the second model I tried training on Mandarin. For the sake of simplicity, I decided to use a naive approach and treat Chinese characters as tokens. This drastically increases the output space since common Chinese characters can range over 5000. However, to my surprise, the model worked better than I expected. The mistakes made with this model were mostly of characters with the same sounds or similar sounds. This could mean that the deep neural network is learning and classifying characters by sound without explicit labeling of pinyin.

For the next approach, I will attempt to create a model that tokenizes both English letters and Chinese characters as output and test the accuracy of the model.