For this week, aside from preparing for proposals, my main goal was to do dataset and model exploration. I spent a good amount of time downloading and going through the dataset to determine what exactly we are working with. Our video inputs are about 2-to-4-second video clips of a person signing a single word. Along with these 2-second hand gestures, we also have numerous JSON files that encode a word to each video. The difference between the JSON files is the number of video-word combinations that said JSON file accesses. As a result, we not only have a few pre-defined subsets that we can access, but we also can make our own splits in case we want to train on fewer words in the case that we need to make our model smaller.
I have been looking into a couple of ideas and libraries for pre-processing. There are a lot of cases where people have used general video classification, but in a lot of those cases, those models are trained so that they capture the entirety of the image. In the case of ASL detection, we do not necessarily want to do that, as there can be an issue of recognizing signs based on extraneous data, such as the color of the background or the person signing. When looking into human pose estimation to counter this, there are two main libraries that we have been considering, being Mediapipe and OpenPose. The main thing that I was looking at was what we wanted to use for landmarking. Mediapipe has pre-trained models for both full-body landmarks and hand orientation landmarks. The amin issue is that full-body does not give us enough hand estimation details, but the hand estimation does not give use any data outside of the hand. There have been numerous of past cases in which OpenPose has been used for human pose estimation, so I am looking into whether there is any repositories that would fit our needs. If there isn’t, we might have to create our own model, but I think it would be best to avoid this so that we do not have to create our own dataset. I have also been talking with Kavish about the feasibility of these various options on an FPGA.
For classification itself, I have been looking into what type of RNN model to use, as human pose estimation might be able to give us vectors that we can pass into the RNN model instead of images (I still need to look more into this before finalizing this possible design change). Otherwise, there are a lot of possibilities with CNN-RNN models from Keras, OpenCV, and Scipy that we can use.
As of right now, I believe that I am slightly ahead of schedule. There are a few decisions that I need to ask the team and revise, as mentioned above. After this, my main deliverable would be to find/develop a good model for our human-pose estimation pre-processing and determining whether it would be reasonable to use that for our RNN input.