Sejal’s Status Report for 4/27/24

Accomplishments

This week, I worked with my teammates on improving the existing web app, deploying, and integrating with bluetooth. Continuing from the progress since last week, we displayed the translated sentence on the HTML page, integrating end of sentence logic and the LLM to structure the sentence. After this, I worked on the frontend UI by adding a home page, instructions page, and our main functionality on a separate detect page, as shown in the images below. I positioned the elements similar to how we originally decided from our wireframes in our design presentation and also included some text for a more seamless user experience. After having some issues deploying our web page with AWS, I tried another method of deploying. However, I am still running into issues with building all the libraries and packages into the rendering.

 

My progress is on schedule as deployment is the last step we have to complete in terms of the software before the final demo. 

Next week, we will complete this, connect with the hardware and work on the rest of the final deliverables.

Sejal’s Status Report for 4/20/24

Accomplishments

This week, my team and I worked on integrating our parts together. Since we had some issues with the mobile app and using CoreML, we decided to switch to a web app after evaluating the trade offs. To do this, Ran and I developed the frontend and backend functionalities. MediaPipe provides examples and support on JavaScript, so we extracted MediaPipe landmarks in the frontend, and sent these landmarks to the backend. I structured these landmarks in a way that was readable by the Python code, and used the existing model to output a prediction. I then sent this prediction back to the frontend, as depicted in the image below.

My progress is on schedule as we are working on integrating our parts, but we will have to do some more testing as a team tomorrow before the final presentation, ensuring our validation and verification metrics are satisfied by our working solution.

Next week, I will complete our integration, add to the UI to ensure a seamless user experience, and perform more testing as my team prepares for the final presentation, final report and final demo.

Additional question for this week:

To accomplish our tasks, I needed to refer to documentation regarding the OpenAI API to integrate the LLM processing to our sentence prediction functionality. After switching to a web app, we needed tools regarding the best way to send information between the frontend and backend using Django, the framework we are using to accomplish this. We also found it necessary to use demos and existing implementations regarding incorporating MediaPipe with Javascript. Some learning strategies learned were how to use existing video tutorials or demos and iterate upon them in order to accomplish our unique task. I also utilized github and github issue forums to help debug, if others had any similar errors as I did.

Sejal’s Status Report for 4/06/24

Accomplishments

This week, my team did our interim demo and began to integrate our respective parts. To prepare for the demo, I made some small modifications to the prompt taken in API request for the LLM to ensure that it was outputting sentences that made sense according to the predicted gestures.

Since we are using CoreML to use machine learning within our IOS app, my teammate Ran and I worked on the swift code to integrate machine learning into our app. I converted my model to CoreML and wrote a function that behaves the same way the large language model in the machine learning processing works.

I also continued trying to debug why the expanded dataset from last week wasn’t working. I re-recorded some videos ensuring that MediaPipe was recognizing all the landmarks in the frame, with good lighting, and the same format as the rest of the dataset. Again, while the training and validation accuracies were high, while testing the gesture prediction with the model, it recognized very little of the gestures. This might suggest that with the expanded amount of data, the model is not complex enough to handle this. So, I continued to try to add layers to make the model more complex, but there didn’t seem to be any improvement.

My progress is on schedule as we are working on integrating our parts.

Next week, I am going to continue to work with Ran to integrate the ML into the IOS app. I will also try to fine tune the model structure some more to attempt to improve the existing one, and perform the testing described below.

Verification and Validation

Tests ran so far: I have done some informal testing by signing into the webcam and seeing if it displayed the signs accurately

Tests I plan to run: 

  • Quantitatively measure latency of gesture prediction 
    • Since one of our use case requirements was to have a 1-3 second latency for gesture recognition, I will measure how long it takes after a gesture is signed for a prediction to appear.
  • Quantitatively measure latency of LLM
    • Similar to measuring the  latency of gesture prediction, it is important to also measure how long it takes the LLM to process the prediction and output a sentence, so I will measure this as well.
  • Quantitatively measure accuracy of gesture prediction 
    • Since one of our use case requirements was to have a gesture prediction accuracy of > 95%, I will measure the accuracy of a signed gesture against its prediction
  • Qualitatively determine accuracy of LLM
    • Since there is no right/wrong output of the LLM, this testing is done qualitatively, to determine whether the sentence output makes sense based on the direct predictions, and in a conversational sentence.

I will do the above tests in various lightings, backgrounds, and with distractions to ensure it corresponds to our use case requirements in different settings, simulating where the device might be used.

Sejal’s Status Report for 3/30/24

This week, I trained and tested the broader dataset I incorporated last week with 10 additional classes. I originally recorded 13 videos, and then performed data augmentation to create more data. For example, I increased and decreased the brightness a bit, and rotated the video by a few degrees. After performing data augmentation, I had the same amount of data for each class, since data balance is important in model training. When I tested this new model, it did not seem to accurately predict the new signs correctly, and even decreased the accuracy of predicting the original signs. After trying to diagnose the issue for a bit, I went back to my original model and only added 1 new sign, ensuring that each video was consistent in terms of the amount of features MediaPipe could extract from them. However, this one additional sign was not being accurately predicted either. After this, I fine tuned some of the model parameters, such as adding more LSTM and dense layers, to see if model complexity was the issue. 

While training this, I created some support for sentence displaying and structuring. I signaled the end of a sentence by detecting if hands are not in the frame for 5 seconds, which would reset the displayed words on the screen. Since sign language contains different word order than written English, I worked on the LLM that would detect and modify this structuring. To do this, I used the OpenAI API to send an API request after words have been predicted. This request asks the gpt3.5 engine to modify the sentence into readable English, and then display it on the webcam screen. After working with the prompt for a while, the LLM eventually modified the words into accurate sentences and displayed this to the user. In the images below, the green text is what is being directly translated and the white text is the output from the LLM.

My progress is mostly on schedule since I have added the LLM for sentence structuring.

Next week, I will continue trying to optimize the machine learning model to incorporate more phrases successfully. I will also work with my teammates to integrate the model to the IOS app using coreML.

Sejal’s Status Report for 3/23/24

This week, I worked on continuing to broaden the data set and train the model. Unfortunately, it was difficult to find other dynamic ASL datasets readily available. I tried to download the how2sign dataset, but there was an incompatibility issue with the script to download it. I tried to debug this for a bit and even reached out to the creator of the script, but haven’t gotten to a solution yet. I tried the MS-ASL dataset from Microsoft, but the data linked to YouTube videos that were all set to private. I requested permission to access the Purdue RVL-SLLL dataset, but I haven’t gotten a response yet. I also looked at ASL-LEX, but it is a network of 1 video corresponding to each sign, which is not very helpful. At this point, since it’s difficult to find datasets, I’ve just been continuing to create my own videos, following the details of the DSL-10 dataset videos I currently have trained, such as the same number of frames, and amount of videos per class. I have added 32 classes of the most common phrases used in conversation for our use case: “good”, “morning”, “afternoon”, “evening”, “bye”, “what”, “when”, “where”, “why”, “who”, “how”, “eat”, “drink”, “sleep”, “run”, “walk”, “sit”, “stand”, “book”, “pen”, “table”, “chair”, “phone”, “computer”, “happy”, “sad”, “angry”, “excited”, “confused”, “I’m hungry”, “I’m tired”, “I’m thirsty”. Because there are a lot of videos and there will be more, I am running into storage issues on my device. I am wondering if there is a method or separate server that allows quicker processing of large datasets like this.

My progress is still slightly behind schedule because I am still working on word translation. I plan to catch up this week as we prepare for the interim demo.

Next week, I will continue to train my custom dataset to allow for more variety in translated gestures. Also, I will work on continuous translation since right now I am working on word translation, but we need to eventually work on continuous sentences. I will also be working with my teammates to integrate our parts right now on the IOS app for deployment of our current product state.

Sejal’s Status Report for 3/16/24

This week, I worked on fixing the issue with predicting dynamic signs using the trained model. Previously, it would not create an accurate prediction and instead predict the same gesture regardless of the sign. I spent time debugging and iterating through the steps. I attempted to predict a gesture from a video instead of the webcam. I found that it was ~99% accurate, so I found that the issue was related to the differences in frame rate when using the webcam. Fixing this, I tested the model again and found that it was successfully predicting gestures. However, it was predicting accurately only about 70% of the time. Using the script I made for predicting gestures from videos, I found that when I inserted my own videos, the accuracy went down, meaning that the model needs to be further trained to allow recognition of diverse signing conditions and environments. After this, I created some of my own videos for each phrase, and inserted them into the dataset and further trained the model.

My progress is slightly behind schedule as the schedule said that milestone 3, word translation, should have been completed by this week, but I am still working on improving accuracy for word translation.

Next week, I hope to continue adding to the dataset and improve accuracy of detecting signs. I will do this by continuing to create my own videos and trying to integrate online datasets. The challenge with this is that the videos need to have a consistent amount of frames, so I might need to do additional preprocessing when adding data. Additionally, as we approach the interim demo, I will also be working with Ran and Leia to integrate our machine learning model into our swift application.

Sejal’s Status Report for 3/09/24

This week, while my group and I completed the design report, I continued to train a model with data from the DSL-10 dataset. I extracted the entire dataset, consisting of 75 videos for each of the 10 phrases. After a few rounds of training and adjusting the hyperparameters and structure from the initial training, I ended up with a model containing 3 LSTM layers and 3 dense layers, and included 100 epochs of training. This resulted in a training accuracy of around 96% and validation accuracy of around 94%. I visualized the confusion matrix, which seemed to predict a balance of phrases, and the training and validation accuracy plots, which showed a steady increase. After this, I used Ran’s computer vision processing code with MediaPipe and my trained model to display a prediction. However, the prediction was not very accurate as it heavily displayed one phrase regardless of the gesture being signed.

My progress is on schedule, as I am working on the model training for word translation, and currently have a model that shows accuracies ~95% during training.

Next week I hope to continue working on displaying an accurate prediction, debugging where the issue in displaying predictions might lie. I also hope to expand the data to incorporate more phrases to be detected, and go through a few rounds of training and optimization.

Sejal’s Status Report for 2/24/24

This week, my team and I finished creating the slides for the design review presentation, and my teammate presented it in class. Along with that, I’ve been working on the training of the ML model beyond the basic alphabet training. I gathered two datasets, how2sign and DSL-10, which both contain a multitude of videos of users signing common words and phrases. For now, I am just working with the DSL-10 dataset, containing signage of ten common daily vocabularies. From this dataset, I took 12 input samples from 3 different dynamic phrases, and trained a small model from this data. I did this by first loading and preprocessing the data, and randomly splitting into testing and training sets. Then, I extracted features from MediaPipe’s hand and pose objects. I created an array of these landmarks and performed any necessary padding if the amount of landmarks from the pose, right hand, or left hand weren’t the same. Next, I created a simple sequential model with two LSTM layers followed by a dense layer. With 10 epochs of training, this created a model that I will further expand.

My progress is on schedule, but I am a little worried about the next steps taking longer than I expect. Since this dataset only contains 10 phrases, it will be necessary to further train additional datasets to recognize more phrases. Since datasets might be of different formats, one other consideration I will have to make is how to ensure compatibility in the training process.

Next week, I hope to continue adding input samples from the DSL-10 dataset until I have significant data for all 10 phrases. I would also like to test this model iteratively to see if there are any additional necessary steps, such as additional feature extraction or preprocessing of frames. To do this, I will determine the accuracy of this trained model on my teammate’s computer vision processing by testing out signs. This will progress the process of training the model. Additionally, my team and I will be working on the Design Review Report that is due on Friday.

Sejal’s Status Report for 2/17/24

This week I got started on a simple ML model and combined it with Ran’s computer vision algorithm for hand detection. I trained a dataset from Kaggle’s ASL MNIST using a CNN. Using the trained model, I took the video processing from the OpenCV and Mediapipe code, processed the prediction of what character was being displayed, and displayed this prediction on the webcam screen, as shown below.

(Code on github https://github.com/LunaFang1016/GiveMeASign/tree/feature-cv-ml)

Training this simple model allowed me to think about the complexities required beyond this, such as incorporating both static and dynamic signs, and combining letters into words to form readable sentences. After doing further research on the structure of neural network to use, I decided to go with a combination of CNN for static signs, and LSTM for dynamic signs. I also gathered datasets that display both static and dynamic signs from a variety of sources (How2Sign, MS-ASL, DSL-10, RWTH-PHOENIX Weather 2014, Sign Language MNIST, ASLLRP).

My progress on schedule is on track as I’ve been working on model testing and gathering data from existing datasets.

Next week, I hope to accomplish more training of the model using the gathered datasets and hopefully be able to display a more accurate prediction of not just letters, but words and phrases. We will also be working on the Design Review presentation and report.

Sejal’s Status Report for 2/10/24

After presenting the project proposal Monday, my group and I reflected on the questions and feedback we got, and prepared to start each of our parts of the project. I started doing further research into the machine learning algorithm that will recognize ASL gestures. Since my teammate will be processing the datasets using OpenCV, I will begin by using publicly available datasets that provide preprocessed images for sign language recognition tasks. For example, ASL Alphabet Dataset on Kaggle and Sign Language MNIST on Kaggle. Since we decided to use TensorFlow and Keras, I looked into how existing projects utilized these technologies. In regards to training the neural network, I learned that convolutional neural networks (CNNs) or recurrent neural networks (RNNs) are commonly used. However, 3D CNNs are also used for image classification, especially with spatiotemporal data. Hybrid models combining CNNs and RNNs might also be a good approach. 

Our progress is on track relative to our schedule. During the next week, Ran and I will begin preparing a dataset. We will also allocate some time to learn ASL so we can use some of our own data. I also hope to do more research into the structures of neural networks and consider the best ones