This week, while my group and I completed the design report, I continued to train a model with data from the DSL-10 dataset. I extracted the entire dataset, consisting of 75 videos for each of the 10 phrases. After a few rounds of training and adjusting the hyperparameters and structure from the initial training, I ended up with a model containing 3 LSTM layers and 3 dense layers, and included 100 epochs of training. This resulted in a training accuracy of around 96% and validation accuracy of around 94%. I visualized the confusion matrix, which seemed to predict a balance of phrases, and the training and validation accuracy plots, which showed a steady increase. After this, I used Ran’s computer vision processing code with MediaPipe and my trained model to display a prediction. However, the prediction was not very accurate as it heavily displayed one phrase regardless of the gesture being signed.
My progress is on schedule, as I am working on the model training for word translation, and currently have a model that shows accuracies ~95% during training.
Next week I hope to continue working on displaying an accurate prediction, debugging where the issue in displaying predictions might lie. I also hope to expand the data to incorporate more phrases to be detected, and go through a few rounds of training and optimization.