aishwarj – Team E5: ASL Learning Platform

May 1, 2022

Aishwarya’s Status Report for 4/30/22

This week, I worked on the final presentation with my team, taking pictures and documenting the current state of our project. I worked on adding details to our demo poster (such as system block diagrams and discussion regarding overall system structure). I also have been working on experimenting with further data collection concerning our neural networks (e.g. observing the effect of learning rate on model accuracy), to see if these metrics could provide more content for our discussion of quantitative results and tradeoffs.

My progress is on schedule. Next week I hope to complete my portion of the video demonstration (explaining the ML models and testing accuracy metrics), as well as my portions of the final design report.

April 24, 2022April 24, 2022

Aishwarya’s Status Report for 4/23/22

This week, I completed integrating model execution with the randomized testing feature that Valeria created for the web app. The user proceeds through a set of mixed questions and the models execute on their inputs, so that scores are accrued in the background, and then presented to the user at the end in a score board format. Further, I resolved the bug from last week where the stop action triggered by the user or the timer was executing repeatedly, preventing the user from making further inputs. Now, the user can create multiple attempts at a sign without this bug hindering them.

I also gathered metrics for model training and testing accuracy vs number of epochs for training. This data will be included in our final presentation next week, and it also revealed that some of our models need additional data (created by us) to retrain and improve testing accuracy. Additionally, I conducted user tests with Valeria in order to obtain feedback about our platform, so that we may improve it further before final demo.

My progress in on schedule. The web app and the models are fully integrated. This next week I will focus on tuning the models and gathering more data (concerning model testing accuracy and execution time to generate a prediction) for our documentation of results.

April 16, 2022April 16, 2022

Aishwarya’s Status Report for 4/16/22

This week I parsed the video data for dynamic signs (movement required) and trained two models for interpreting them. There was a bug with the video data being unsuccessfully feature extracted, and I realized this was because the beginning and end of many of the videos had no hands in them (due to the user starting and stopping the camera). I changed it such that these frames are padded as zeros when formatting the landmark data in arrays. I also debugged why the models were not being correctly loaded up in the back end of the web app (which turned out to be an issue with the way the model was being saved as a file after training). I further looked into how we could conveniently record accuracy data about models at various epochs, and changed the code such that multiple trained models are saved at intervals until the final epoch is reached.

My progress is on schedule. The only concern I have is for a bug with the timer in our code that stops the user’s input feed after 5 seconds. It executes multiple times such that sometimes the user cannot record a second or third attempt to their answer for a given sign. During the next week, I hope to resolve this issue and tune the dynamic model such that it may be at a high enough accuracy to add to the app. I also hope to continue to collect data on model accuracy while varying parameters for our final design report/documentation. Lastly, I hope to add a bit more logic to the back end of the web app where sign grading is completed (e.g. say the sign is incorrect if the user has the wrong number of hands present in frame).

April 11, 2022April 11, 2022

Aishwarya’s Status Report for 4/10/22

This week, I refined the integration of the web app and our neural networks. Previously, for static signs, we have been downloading a video and sending that to our python code to extract features and use one of the models to execute with this input data and generate a prediction for the sign the user completed. I changed it such that feature extraction is done directly in the javascript backend portion of the web app for each frame of the video camera input. An array of this data is sent as part of POST request to a python server to generate and send back a prediction response. I brought 5 separate models into the backend that are loaded upon webapp start-up. This removed the additional latency I observed the week before due to having to load a model with its structure and weights every time a prediction needed to be made. This integration appears to work smoothly, though we still need to refine an implemention taking a video input from the user in order to support dynamic sign prediction. In addition to this work with the web app, I continued tuning the models and created some additional video data to be used as training samples for our dynamic signs (conversational sign language that requires hand movements).

My progress is on schedule. This coming week, I hope to tune a model to support predictions for dynamic sign language. If the dynamic models have minimal issues, I also plan to help Valeria work on the web app support for user video inputs during the later half of the week.

April 2, 2022

Aishwarya’s Status Report for 4/2/22

I integrated the model execution with the web app (such that the user’s input is parsed and passed to the model for generating a prediction). I also parsed all of the new data (that we collected in order to replace incorrect signs in the original training dataset we were using), by extracting image frames from a series of videos our group made, and then extracting landmarks from each image frame. I retrained all the models with this newly formatted data.

My progress is mildly hindered due to having covid this past week, so I haven’t been able to tune the models as much as I would like to. The models in general have slight trouble identifying unknown signs. The fist sign category model in particular seems to have the most difficulty identifying letters such as A and S. I hope that after recovering this next week, I can tune the models further in order to deal with these issues. I will have to experiment with the number of training epochs, and the model structure itself (increasing/decreasing the number of layers and nodes within each layer).

Next week, I hope to fix some of these prediction issues currently observed with the models. I also want to work on making the web app more smoothly integrated with the model execution service. Currently it requires downloading the video input from a user locally, but it would be better to cut out this middle step to improve latency.

March 26, 2022March 26, 2022

Aishwarya’s Status Report for 3/26/22

I trained models for 4 of our model groups (1-finger, 2-finger, 3-finger, and fist-shaped). With testing these, we noticed some unexpected behavior, particularly with the 3-finger model, and realized that the training dataset had incorrect samples for letters such as M and N. I, along with my other group members, recorded videos to create new data that would replace these samples. I wrote a script to extract frames from these videos to store as jpg images, allowing us to generate a few thousand images for the labels that needed to have their samples replaced. Due to these issues we discovered with the datasets, I will need to reformat the training data and retrain some of the models with these newly created samples.

Our progress is on schedule. During this next week, I hope to integrate the web app video input with the model execution code in preparation for our interim demo. I will also complete re-parsing the data with our new samples for training and retrain the models.

The video linked is a mini-demonstration of one of my models performing real-time predictions.

March 20, 2022March 20, 2022

Aishwarya’s Status Report for 3/19/22

I completed the code to parse the data for images and videos, passing them through MediaPipe and extracting and formatted the landmark coordinate data. The rough table below shows my initial findings for training and testing accuracy using a dataset for letters D,I,L, and X with 30 images per letter class. Over varying parameters to see how this affected the testing accuracy, the best test accuracy I could achieve was 80.56%. Overall, this seems to be an issue with overfitting (expecially since this initial data set is small).

Another dataset was found with 3000 images per letter class (though many of these fail to have landmark data extracted by MediaPipe). With using this dataset, overfitting still seemed to be an issue, though the model seems to perform well when testing in real time (I made signs in front of my web camera and found it to identify them pretty accurately). During this realtime evaluation, I found that it worked for my left hand. This means I will need to mirror the images the correct way to train the models for each right-handed and left-handed signs.

My progress is on schedule. To combat issues with overfitting during the next week, I will continue trying to train with a larger dataset, varying parameters, and modifying the model structure. By the end of next week, I hope to have the models trained for each ASL grouping.

February 26, 2022

Aishwarya’s Status Report for 2/26/22

This week, I presented our intended design during the in-class design presentations. We received feedback on the feasibility for our ML models with respect to feature extraction data from MediaPipe being compatible with our models as inputs to LSTM nodes. I reviewed this component of our design during this past week in order to provide adequate justification for the necessity of LSTM cells in our network (in order to support temporal information as part of our model learning), as well as its feasibility (outlining what the input and output data formatting/dimensionality is expected to be at each layer, as well as researching examples of MediaPipe data being used with LSTMs). I also worked more on our code for data formatting (converting landmark data from MediaPipe into numpy arrays that can be fed to our models). I now just need to add resampling from video data to grab the necessary number of frames. We received AWS credits towards the end of this past week, so we have not been able to work much on feature extraction and model training within an EC2 instance. Although, our schedule still indicates we have time for model training, I am a little concerned that we are slightly behind schedule on this front.

In order to catch up, we will hopefully be able to spend more time on implementation once the design report is completed. So far, a good amount of our time has gone towards presentations and documenting vs implementation. Once these deliverables are met, I hope to be able to shift my attention more towards building up our solution.

February 19, 2022February 19, 2022

Team Status Report for 2/19/22

This week, our team finalized the type of neural network we want to use for generating ASL predictions. We gathered more research about tools to help us with model training (e.g. training in an EC2 instance) and planned out the website UI more. We worked on creating our database of ASL test data, and worked on the design report.

The most significant risks right now are if our RNN does not meet requirements for prediction accuracy and execution time. In addition, the RNN will require a large amount of time and data for training. If we increase the number of layers or neurons in an effort to improve prediction accuracy, this could increase training time. Another risk is doing feature exraction without enough efficiency. This is critical because we have a large amount of data to format so that it can be fed into the neural network.

To manage these risks, we have come up with a contigency plan to use a CNN (which can be fed frames directly). For now, we are not using a CNN because it’s performance may be much slower than that of an RNN. For feature exraction, we’re considering doing this in an EC2, so that our personal computer resources are not overwhelmed.

A design change we made was the groupings for our signs (to have separate RNNs for each). Before, we grouped simply by category (number, letter, etc.), but now, we are grouping by similarity. This will allow us to more effectively distinguish if the user is doing a sign correctly, and detect minute details that may affect this correctness.

There have been no changes to our schedule thus far.

February 19, 2022February 19, 2022

Aishwarya’s Status Report for 2/19/22

This week, I worked with Tensorflow to gain familiarity with how it allows us to instantiate a model, add layers to it, and train the model. I also experimented with how we would need to format the data using numpy so that it can be fed into the model. Feeding dummy data in the form of numpy arrays to the model, I generated a timing report to see how long the model would take to generate a prediction for every 10 frames processed in mediapipe (during real-time video processing), so that we could get an idea of how the model’s structure impacted execution time.

Our team is also working on creating a test data set of ASL video/image data, so I recorded 5 videos for each of the signs for numbers 0-4 and a-m and uploaded them to a git repo we are storing them in.

The exact network structure that would optimize accuracy and execution time still needs to be determined, but this must be done through some trial and error. We will be using at least one LSTM layer, followed by a dense layer, but knowing the exact number of hidden layers and their number of neurons will be more clear after we have the chance to measure performance of the initial model structure and optimize from there.

Our progress is on schedule. Next week, I hope to complete the feature extraction code with my partners (both for real-time video feed and for processing our training data acquired from external sources).