Team Status Reports – Team E6: TransLingualVisionary

April 27, 2024April 28, 2024

Team Status Report 4/27/24

Coming into the final week, our project design consists of running MediaPipe’s HPE and classification model directly on the Jetson and connecting to OpenAI’s API for the LLM portion. A web server will be hosted locally through Node.js and display the video feed alongside any detected text. This week, we’ve focused on retraining our word classification algorithm and finalizing the integration of project components.

The main risk in the coming week before project demos would be unexpected failure of a subsystem before demo day. To combat this, we will document our system as soon as desired behavior is achieved so there is proof of success for final proof of concept. If retraining the word classification model lowers accuracy or creates failure due to the complexity of 2000 classification labels, we will demo using the smaller word classification models that have been trained earlier in the semester.

We are on track with our last schedule update and will spend the majority of the next week finishing project details.

When measuring project performance for final documentation, we used the following unit tests and overall system tests:

Overall and Unit Latencies were calculated by running a timer around specific project components.
- The speed of our overall latency encouraged us to focus on accuracy and range of word classification rather than reducing latency. As such, we’re retraining our word classification model to identify 2000 words as opposed to our current 100.
Recognition Rate was tested through measuring output/no output of the classification model in tandem with short video clips.
Word Classification Accuracy was measured by running video clips of signs through the classification model and checking the output value against the video’s correct label.
Inference Accuracy has been primarily gauged through human feedback when testing various LLM models’ need for reinforcement or fine tuning. A more complete dataset, predominantly informed by SigningSavvy, will soon be used to collect model-specific accuracy metrics.

April 21, 2024

Team’s Report for 4/20/24

Our project design has changed a lot throughout the last two weeks. Since the pose estimation model was not quantizable for the DPU, it could not be efficiently accelerated on the FPGA. Due to these reasons, even after improving the inference timings of the HPE model, it would make more sense data latency wise to actually run both the HPE and the classification model directly on the Jetson. This was one of our backup plans when we first decided on the project. There are not additional costs to this change. We are currently finishing the integration of the project and then measuring the performance of our final product. One risk is if the Llama LLM API use does not end up working then we will have to quickly switch to another LLM API such as GPT4. There is no updated schedule unless we cannot finish the final demo in the next couple of days.

April 6, 2024April 7, 2024

Team’s Status Report for 4/6/24

After our demo this week, the two main risks we face are being able to accelerate the HPE on the FPGA with a low enough latency and ensuring that our classification model has a high enough accuracy. Our contingency plan for HPE remains the same, if the FPGA accelerations does not make it fast enough, we will switch to accelerating it on the GPU on our Jetson nano. As for our classification model, our contingency is retraining it on an altered dataset with better parameter tuning. Our subsystems and main idea have not changed yet. No updated schedule is necessary, the one submitted on slack channel for interim demo (seen below) is correct.

Validation We need to check for two main things when doing end to end testing: latency and accuracy. For latency, we plan to run pre-recording videos through the entire system and benchmark the time of each subsystem’s output alongside end-to-end latency. This will help us ensure that each part is within the latency limits listed in our design report and the entire project’s latency is within our set goals. For accuracy, we plan on running live stream of us signing words, phrases, and sentences and recording the given output. We will then compare our intended sentence with the output to calculate a BLUE score to ensure it is greater than or equal to 40%.

March 30, 2024March 31, 2024

Team’s Status Report for 3/30/24

Our main risks echo those of our previous reports, namely models not working/accuracy, inability to use the FPGA for HPE, etc., which we have provided mitigation strategies for in previous reports. However, the main risk that we are considering right now, especially when considering the interim demo and our final product is integration between our components. As we are working to get our components up and running, we will soon begin integration, which could raise some issues in getting stuff working together. Our main mitigation strategy for this is focusing on integration early before improving the individual components. As such, we can have a working pipeline first before we focus on developing, debugging, and improving the individual components of our model further.

At this point, we have not made any large design changes to the system, as we have not found a need too. Similarly, we have not changed our schedule from the update we made last week.

March 24, 2024March 24, 2024

Team’s Status Report for 3/23/24

Overall, apart from the risks and mitigations discussed in the previous reports about the new spoter classification model and the FPGA integration, the new risk we found was the ability to determine the end of sentences. As of right now, our classification model simply outputs words or phrases based on the input video stream; however, it is unable to analyze the stream to determine end of sentence. In order to mitigate this problem, our current solution is to run a double query to the LLM while maintaining a log of previous responses to determine the end of sentence. If this idea does not work, we will introduce timeouts and ask the users to take a second of pause between sentences. No big design changes have been made until now since we are still building and modifying both the pose estimation and classification models. We are a bit behind schedule due to the various problems with the model and FPGA, and have thus updated the schedule below. We hope to have at a partially integrated product by next week before our interim demo.

March 16, 2024

Team’s Report for 3/16/24

Apart from the risks and contingencies mentioned from our previous status reports, the latest risk is with out Muse-RNN model. The github model we found was developed in Matlab and we planned on translating and re-adopting that code to python for our use. However, the good part of the code is p-code (encrypted). Thus, we have two options: go forward with our current RNN and plans or try a new model we found called spoter. Assuming that there are problems with our current plan, our contingency is to explore this new model of spoter. The other risks of FPGA and LLM remain the same (since development is ongoing) and our contingencies also remain the same.

There have been no major updates to the design just yet and we are still mostly on schedule. Although we might change it next week because Neeraj might finish development of LLM first before we finalize our RNN.

March 9, 2024March 10, 2024

Team’s Status Report for 3/9/24

As of right now, our largest risk remains to be working HPE on the FPGA, but as mentioned before, we are currently in the process of developing that and we have a backup plan of running HPE on the Jetson if necessary. The other major risk as of right now is our RNN model and whether a standard GRU architecture is enough. While we do believe it should be, we have planned for this in being able to switch out GRU cells with LSTM cells, using new architectures such as MUSE-RNN, and considering the possibility of moving to transformers if necessary.

As of right now, there has been no significant change in the design plans of our project. We have options to pivot in the case of a competent failure or lack of performance, as mentioned above, but we have not yet had to commit to one of these pivots.

With consideration of global factors, our biggest impact will be on the deaf and hard of hearing community – particularly the ones who use the American Sign Language (ASL) and are reliant on the digital video communication platforms like zoom. For all those individuals, it is quite a challenge to participate in online platforms like zoom and have to rely on either chat features or a translator if other people in the meeting are not familiar with ASL. Our project helps those individuals globally by removing that reliance and bringing accessibility in communication. This will bring a greater, more effective form of communication to hundreds of thousands of ASL users across the globe.

Our product solution has substantial implications for the way the hearing world can engage with ASL users in virtual spaces. Culturally, hearing impaired communities tend to be forgotten about and separate from the hearing world in America. Influence in public spaces, like concerts, and legislation has been hard to come by and accommodations tend to be fought for rather than freely given. Part of this issue stems from hearing communities not having personal relationships with or awareness of the hearing impaired. Every step towards cultivating spaces where deaf individuals can comfortably participate in any and all discussions is a step towards hearing impaired communities having a louder voice in the cultural factors of the communities they identify with. An ASL to text feature in video communication is a small step towards meeting the needs of hearing impaired communities, but opens the doorway to substantial progress. By creating avenues of direct interpersonal communication (without the need for a dedicated translator in every spoken space), the potential for an array of meaningful relationships in professional, academic, and recitational spaces opens up.

While an ASL translation system for digital environments aims to improve accessibility and inclusion for the deaf and hard of hearing community, it does not directly address core environmental needs or sustainability factors. The computational requirements for running machine learning models can be energy-intensive, so our system design will prioritize efficient neural architectures and lean cloud/edge deployment to minimize environmental impact from an energy usage perspective. Additionally, by increasing accessibility to online information, this technology could potentially help promote environmental awareness and education within the Deaf community as an indirect bonus. However, as a communication tool operating in the digital realm, this ASL translator does not inherently solve specific environmental issues and its impact is confined to the human-centric domains of accessibility. While environmental sustainability is not a primary focus, we still aim for an efficient system that has a minimal environmental footprint.

Section breakdown:

A was written by Kavish
B was written by Sandra
C was written by Neeraj

February 18, 2024March 10, 2024

Team’s Status Report for 2/17/24

Our largest major risk as of right now is determining how much we can do on an FPGA. More specifically, we are trying to experiment on whether we can do human pose estimation using an FPGA, as we believe that it could make a strong impact in boosting our performance metrics. We do have a contingency plan of doing HPE and our RNN on an Nvidia Jetson if we are unable to get this working though. That said, to best manage this risk, we are trying to frontload as much of that experimentation as possible so that if we do have to pivot, we can pivot early and stay relatively on schedule. Our other risks our echoes of the ones of last week. We have looked into the latency issues, mainly concerning the bottleneck of human pose estimation. Based on research into past projects using AMD’s Kris KV260, we do believe that we should be able to manage our latency so it isn’t a bottleneck.

As per last week, our major change is related to our risk above, in that, we want to try and split the HPE model and our RNN so that the HPE model is on the FPGA and the RNN is on the Jetson. After experimentation, we can find that we can just pull the vector data of each landmark from Mediapipe’s hand pose estimation model, allowing us to pass that into an RNN rather than an entire photo. The main reason we made this pivot is that we believe that we can have a smaller model on the Jetson in comparison to the CNN-RNN design we originally planned. This gives a bit more room to scale this model to adjust for per-word accuracy and inference speed. Luckily, as we have not begun extensive work on developing the RNN model, there is not a large cost of change here either. In the case that we are not able to have the HPE model on the FPGA, we would have to be a bit more restrictive on the RNN model size to manage space, but we would still most likely keep the HPE-to-RNN pipeline.

There has not yet been any update to our schedule as of yet, as a lot of our pivots does not have a large impact on our scheduling.

Below, we have some images that we used for testing the hand pose estimation model that Mediapipe offers:

Impact Statement

While our real-time ASL translation technology is not focused specifically on physical health applications, it has the potential to profoundly impact mental health and accessibility for the deaf and hard of hearing community. By enabling effective communication in popular digital environments like video calls, live-streams, and conferences, our system promotes fuller social inclusion and participation for signing individuals. Lack of accessibility in these everyday virtual spaces can lead to marginalization and isolation. Our project aims to break down these barriers. Widespread adoption of our project could reduce deaf social isolation and loneliness stemming from communication obstacles. Enhanced access to information and public events also leads to stronger community ties, a sense of empowerment, and overall improved wellbeing and quality of life. Mainstreaming ASL translation signifies a societal commitment to equal opportunity – that all individuals can engage fully in the digital world regardless of ability or disability. The mental health benefits of digital inclusion and accessibility are multifaceted, from reducing anxiety and depression to fostering a sense of identity and self-worth. [Part A written by Neeraj]

While many Americans use ASL as their primary language, there are still many social barriers preventing deaf and hard of hearing (HoH) individuals from being able to fully engage in a variety of environments without a hearing translator. Our product seeks to break down this communication barrier in virtual spaces, allowing deaf/HoH individuals to fully participate in virtual spaces with non-ASL speakers. We’ve seen a drastic increase in accessibility to physical meeting spaces due to the pandemic making platforms like Zoom and Google Meet nearly ubiquitous in many social and professional environments. As a result, those who are immunocomprimized or have difficulties getting to a physical location now have access to collaborative meetings in ways that weren’t previously available. However, this accessibility hasn’t extended to allow ASL speakers to connect with those who dont know ASL. Since these platforms already offer audio and visual data streams, integration of automatic closed captioning has already been done on some of them, like Zoom. The development of models like ours would allow these platforms to further their range of users.

This project’s scope is limited to ASL to prevent unecessary complexity since there are over 100 different sign languages globally. As a result, the social imapact of our product would only extend to places where ASL is commonly used. ASL is commonly used in the United States, Philippines, Puerto Rico, Dominican Republic, Canada, Mexico, much of West Africa and parts of Southeast Asia. [Part B written by Sandra]

Although our product does not have direct implications on economic factors, it can affect the economic factors in many indirect ways. The first method is by improving the working efficiency of a large subgroup of the population. Currently, the deaf and hard-of-hearing community have a lot of hurdles when working on any type of project because the modern work environment is highly reliant on video conferencing. By adopting our product, they will become more independent and thus will be able to work more effectively, which will in-turn improve the output received by their projects. Overall, by improving the efficiency of a large subgroup of the population, we would effectively be improving the production of many projects and thus boosting the economy. Additionally, our entire project is built using open source work and anything we build ourselves will also be open source. By adding to the open source community, we are indirectly boosting the economy because we are making further developments in this area much easier. We are only targeting ASL and there are many other sign languages which will eventually have to be added to the model. Thus, further expansion on this idea becomes economically friendly (due to reduction in production time and cost savings) and leads to more innovation. [Part C written by Kavish]

February 10, 2024March 10, 2024

Team’s Status Report for 2/10/24

As we analyze the project right now, the most significant risks that could jeopardize the success of the project are not being able to port our HPE model onto FPGA, not having enough data to train a proper RNN, getting inconsistent response when working in Real-Time, and having communication failures. If there are any major problems with our FPGA, our contingency is to reduce our MVP and port all parts of our pipeline directly onto a Jetson. Although we seem to have enough data right now, we have found a couple of datasets which we could try to combine with some additional pre-processing if they are needed for training. Finally, if there are any failures due to latency requirement, we have prepared enough slack to increase latency if necessary. Compared to our original design, we are now considering an augmented version of the pipeline. We plan to run Human Pose Estimation and use the outputs from that model to train our ASL classification RNN. We are still exploring this concept but we hope that will reduce our model size. We are also considering moving the RNN from the computer and running on the Jetson. This will allow use to develop a more compact final product. Of course, this adds another part to our parts list; however, we will discuss this idea more with faculty before implementing this solution. We are still on schedule and no updates are necessary as of right now.