Neeraj’s Status Report for 3/9/24

For the past couple of weeks, my main focus has been completing the design report. This mainly involved writing up the design trade-offs and testing/verification sections, as well as wrapping up other sections that needed to be finished. The main point of focus for me throughout this was the justification of our choices and making sure that every choice and metric we used had a valid reason behind it. From a more technical standpoint, I have spent more time developing the RNN and looking into the possibility of using MUSE-RNN, which would be interesting to develop and effective for improving accuracy. As of right now, I have the majority of a basic GRU-based RNN developed, compiling code I could find from outside resources before working on implementing the MUSE-RNN architecture, as I want to check our current metrics before diving into further complicating the architecture. I have also been experimenting with various word-to-sentence combinations to test different LLM prompts to not fall behind on that front either. I started with a basic “translate these words [_,…,_] from ASL into an English sentence” prompt, which has a lot of variation in response, not only in the sentence itself but also in the format of the response we get from the model. Since this is a very simple prompt, I am planning on utilizing various papers and studies, such as the Japanese-to-English paper that I referenced beforehand, to further develop a more complex prompt. I have not been able to spend as much time on this as I would have liked to, so to not fall behind, I am planning to focus on finishing this relatively quickly or get help from my groupmates, as I also want to focus on the RNN development.

My plan as of right now is to try developing our RNN training cases by running our dataset through our HPE model and then training the base RNN model that we are using. I am also planning on starting the MUSE-RNN architecture separately. I also want to meet up with my groupmates at some point to focus in on developing our prompt for the LLM and see if we can finish as much of that component as fast as possible.

I believe that I am mostly on schedule, if not slightly behind due to experimenting with our model architecture. As aforementioned, I plan on compensating for this by getting help from my groupmates on the LLM portion of things, as I believe that doing some intensive research on prompt engineering should give us substantial progress, especially in terms of achieving desired formatting and answer structure. This would give me more time to focus on developing MUSE-RNN if we want to pursue that architecture.

Kavish’s Status Report for 3/9/24

My last two weeks were spent working mainly on the design report. This process made me look a little bit deeper into the system implementation component of our project compared to the design presentation. Since there were numerous ways to achieve the same goal of implementation, we analyzed the different methods to understand the advantages in terms of functional efficiency and developmental efficiencies. For the FPGA development, we will be highly reliant on the Vitis AI tool and its model zoo and libraries. This decision was made for fastest development time in mind instead of focusing on optimizations from the get-go. Similarly for the web application, we decided that we should initially just use the OpenCV library for viewing and debugging purposes and then transition to a more mature display platform later in the project. As for my progress on the HPE development on FPGA, I am still running into a couple of roadblocks to fully compile the project on the board. I am referring to other works online who have attempted similar projects and will reach out for additional help as required. As for the progress, I am on the border of on-track or a little behind schedule due to the time spent on the design report. I do not need to alter the schedule as of right now, but will update it next week if required.

Team’s Status Report for 3/9/24

As of right now, our largest risk remains to be working HPE on the FPGA, but as mentioned before, we are currently in the process of developing that and we have a backup plan of running HPE on the Jetson if necessary. The other major risk as of right now is our RNN model and whether a standard GRU architecture is enough. While we do believe it should be, we have planned for this in being able to switch out GRU cells with LSTM cells, using new architectures such as MUSE-RNN, and considering the possibility of moving to transformers if necessary.

As of right now, there has been no significant change in the design plans of our project. We have options to pivot in the case of a competent failure or lack of performance, as mentioned above, but we have not yet had to commit to one of these pivots.

With consideration of global factors, our biggest impact will be on the deaf and hard of hearing community – particularly the ones who use the American Sign Language (ASL) and are reliant on the digital video communication platforms like zoom. For all those individuals, it is quite a challenge to participate in online platforms like zoom and have to rely on either chat features or a translator if other people in the meeting are not familiar with ASL. Our project helps those individuals globally by removing that reliance and bringing accessibility in communication. This will bring a greater, more effective form of communication to hundreds of thousands of ASL users across the globe.

Our product solution has substantial implications for the way the hearing world can engage with ASL users in virtual spaces. Culturally, hearing impaired communities tend to be forgotten about and separate from the hearing world in America. Influence in public spaces, like concerts,  and legislation has been hard to come by and accommodations tend to be fought for rather than freely given. Part of this issue stems from hearing communities not having personal relationships with or awareness of the hearing impaired. Every step towards cultivating spaces where deaf individuals can comfortably participate in any and all discussions is a step towards hearing impaired communities having a louder voice in the cultural factors of the communities they identify with. An ASL to text feature in video communication is a small step towards meeting the needs of hearing impaired communities, but opens the doorway to substantial progress. By creating avenues of direct interpersonal communication (without the need for a dedicated translator in every spoken space), the potential for an array of meaningful relationships in professional, academic, and recitational spaces opens up. 

While an ASL translation system for digital environments aims to improve accessibility and inclusion for the deaf and hard of hearing community, it does not directly address core environmental needs or sustainability factors. The computational requirements for running machine learning models can be energy-intensive, so our system design will prioritize efficient neural architectures and lean cloud/edge deployment to minimize environmental impact from an energy usage perspective. Additionally, by increasing accessibility to online information, this technology could potentially help promote environmental awareness and education within the Deaf community as an indirect bonus. However, as a communication tool operating in the digital realm, this ASL translator does not inherently solve specific environmental issues and its impact is confined to the human-centric domains of accessibility. While environmental sustainability is not a primary focus, we still aim for an efficient system that has a minimal environmental footprint.

Section breakdown:
  • A was written by Kavish
  • B was written by Sandra
  • C was written by Neeraj

Team’s Status Report for 2/24/24

Our risks still remain the same as last week. The primary ones being the porting of MediaPipe onto FPGA and the performance of our self-trained RNN. If we are unable to port the MediaPipe, we will simply implement it on the Jetson. Although it is undesirable, it should still be possible to make our MVP. If we are not able to get the required accuracy from the GRU RNN model, we will look into either LSTM or a transformer as a plan B.

There were no major design changes made this week. After more benchmarking and testing, we might have to make more changes.

Updated Gantt chart is attached below.

Kavish’s Status Report 2/24/24

Most of this week was spent on preparing the design presentation and then translating that presentation into report. Me and Neeraj discussed our design quite extensively for the presentation and report. We looked at different libraries for RNN and whether to use GRU or LSTM. I also spent time learning about the MediaPipe model as I am quite weak in the Machine Learning aspects. Unfortunately I was unable to work more on the FPGA development and still have to port a pre-trained model onto it for testing purposes. Although this does not put me behind the new schedule that we developed for design presentation, I will need to put in some extra hours next week to be ahead of schedule. I plan on implementing a basic pose estimation model from the Vitis model zoo and then benchmarking its performance to fully validate our idea.

Neeraj’s Status Report for 2/24/24

The main focus of this week was revolving around our design presentation. My main work for this week was spent revolving around preparing the slideshow at the beginning of the week, helping Kavish get ready to present, and then beginning the transition of information from the slideshow onto the final design report.

I also spent more time this week looking into the RNN and beginning development. I am looking into what libraries would work best for a real-time inference model. I decided to begin development using Tensorflow simply because it is the most predominant library used in the industry, considering it has the largest amount of optimization and thus would run the fastest, thus lowering our latency as much as possible.  As of right now, I am just developing the model, so my main goal over the next week is to spend more time focusing on finishing that so we can get to testing as much as possible. We also need to develop a training set for the RNN, so I am currently setting up the code to take our videos and transform them into HPE vector outputs so that we can use those to train our RNN. I need to figure out where I should do this process, as it might take a good amount of computing power, so I need to determine whether we can do this on the ECE machines or if it might be worth it to invest in AWS services to run these.

I believe I am slightly behind schedule, as we expect the RNN model to be finished soon. It might be worth it to push the RNN scheduling a bit further back, as the RNN model might take more time than we anticipated, partially due to workload from other classes. As such, it might be worth it to merge the time in which we are testing the RNN and developing it, as that is something that can be combined, since part of the RNN development time will include developing the LSTM model for comparison. As such, we can test the GRU model in the mean time.

For next week, I ideally want to finish the base GRU RNN model, as that would keep us as on track as possible, and in a good spot heading into spring break, in which I can work on testing and model comparison.

Sandra’s Status Report for 2/24/24

My progress is a bit behind schedule and I plan to take steps over the upcoming break to catch up to the project schedule.

In the next week, I hope to develop a meshed hand/facial expression estimation model and compare it to MediaPipe’s holistic feature recognition model.

Neeraj’s Status Report for 2/17/24

My main goal for this week was to experiment with various human pose estimation libraries. Primarily, I was focusing on determining whether to use OpenPose or Mediapipe and which of the two would better fit our design pipeline. These libraries have had a history of running on smaller IoT devices, meaning either of which has the potential to possibly work on an FPGA like we intend to.

When installing both of these models, I was having issues with installing the OpenPose models, meaning I might need more time experimenting with them. However, I have been able to test Mediapipe’s hand detection model. It can create 20 landmarks to detect hand position and pose, as well as distinguish left and right hands. It also outputs vectors that hold locations with the position of each landmark in the image. This means that we do not necessarily have to develop a CNN-RNN fusion model to account for spatial information and instead use these vectors as inputs into an RNN to classify words. I have tested this with a few still photos, as per Mediapipe’s test documentation code, which is in our team report. Combining this with the OpenCV library, I have developed a script that takes in a live video from a camera and return output vectors containing the positions of the vectors. This script would be representative of the beginning end of our design pipeline, which we can use for testing and verifying the hardware side of our design.

On another note, I have been looking about how prompt engineering works with LLMs. More specifically, I am looking at the following paper by Masaru Yamada:

https://arxiv.org/ftp/arxiv/papers/2308/2308.01391.pdf

The paper explored the influence of integrating the purpose of the translation and the target audience into prompts on the quality of translations produced by ChatGPT. The findings suggest that including suitable prompts related to the translation purpose and target audience can yield more flexible and higher quality translations that better meet industry standards. Specifically, the prompts allowed ChatGPT to generate translations that were more culturally adapted and persuasive for marketing content as well as more intelligible translations of culture-dependent idioms. The paper also demonstrated the practical application of translation concepts like dynamic equivalence by using prompts to guide creative translations. This paper could provide good insight into translating word fragments into full sentences, as well as what prompts we could begin experimenting with so that we can capture the nuances within ASL.

I am basically on schedule. This human pose estimation code would function as a good foundation for our pre-processing model, which we can also use for hardware testing. We are also on pace for starting prompt engineering for our LLM.

Kavish’s Status Report for 2/17/24

Over the last week, I got access to the canvas and piazza for 18643 class. Using that, I have followed along and finished the first few labs to understand Vitis tool flow and software emulation steps for testing. I also researched about the current confusion we are having – whether doing pose estimation on FPGA is worth it or not. Looking online, I found evidence that many pose estimation models have been ported onto our FPGA (KV260) and have achieved comparable throughput to many Jetsons. Although it is a positive response, it is not conclusive evidence that our current plan is fully feasible. Thus, I am currently working on trying to port pre-trained models from Vitis Model Zoo to self-measure the feasibility of the project. I am running into some roadblocks on some development steps on the tool side (trying to find or build the correct configuration files and booting the kernel on the SD card), which I am trying to resolve. Although I am a little behind schedule, it is not too far off just yet to change anything on our project timeline. I should be on schedule if I can make significant progress in trying to get the FPGA running with a pre-trained pose estimation model from the model zoo. Not only will this help to confirm if our current design is feasible, but it will also put us on track to get MediaPipe or OpenPose running later down the line.

Team’s Status Report for 2/17/24

Our largest major risk as of right now is determining how much we can do on an FPGA. More specifically, we are trying to experiment on whether we can do human pose estimation using an FPGA, as we believe that it could make a strong impact in boosting our performance metrics.  We do have a contingency plan of doing HPE and our RNN on an Nvidia Jetson if we are unable to get this working though. That said, to best manage this risk, we are trying to frontload as much of that experimentation as possible so that if we do have to pivot, we can pivot early and stay relatively on schedule. Our other risks our echoes of the ones of last week. We have looked into the latency issues, mainly concerning the bottleneck of human pose estimation. Based on research into past projects using AMD’s Kris KV260, we do believe that we should be able to manage our latency so it isn’t a bottleneck.

As per last week, our major change is related to our risk above, in that, we want to try and split the HPE model and our RNN so that the HPE model is on the FPGA and the RNN is on the Jetson. After experimentation, we can find that we can just pull the vector data of each landmark from Mediapipe’s hand pose estimation model, allowing us to pass that into an RNN rather than an entire photo. The main reason we made this pivot is that we believe that we can have a smaller model on the Jetson in comparison to the CNN-RNN design we originally planned. This gives a bit more room to scale this model to adjust for per-word accuracy and inference speed. Luckily, as we have not begun extensive work on developing the RNN model, there is not a large cost of change here either. In the case that we are not able to have the HPE model on the FPGA, we would have to be a bit more restrictive on the RNN model size to manage space, but we would still most likely keep the HPE-to-RNN pipeline.

There has not yet been any update to our schedule as of yet, as a lot of our pivots does not have a large impact on our scheduling.

Below, we have some images that we used for testing the hand pose estimation model that Mediapipe offers:

 

Impact Statement

While our real-time ASL translation technology is not focused specifically on physical health applications, it has the potential to profoundly impact mental health and accessibility for the deaf and hard of hearing community. By enabling effective communication in popular digital environments like video calls, live-streams, and conferences, our system promotes fuller social inclusion and participation for signing individuals. Lack of accessibility in these everyday virtual spaces can lead to marginalization and isolation. Our project aims to break down these barriers. Widespread adoption of our project could reduce deaf social isolation and loneliness stemming from communication obstacles. Enhanced access to information and public events also leads to stronger community ties, a sense of empowerment, and overall improved wellbeing and quality of life. Mainstreaming ASL translation signifies a societal commitment to equal opportunity – that all individuals can engage fully in the digital world regardless of ability or disability. The mental health benefits of digital inclusion and accessibility are multifaceted, from reducing anxiety and depression to fostering a sense of identity and self-worth. [Part A written by Neeraj]

While many Americans use ASL as their primary language,  there are still many social barriers preventing deaf and hard of hearing (HoH) individuals from being able to fully engage in a variety of environments without a hearing translator. Our product seeks to break down this communication barrier in virtual spaces, allowing deaf/HoH individuals to fully participate in virtual spaces with non-ASL speakers. We’ve seen a drastic increase in accessibility to physical meeting spaces due to the pandemic making platforms like Zoom and Google Meet nearly ubiquitous in many social and professional environments. As a result, those who are immunocomprimized or have difficulties getting to a physical location  now have access to collaborative meetings in ways that weren’t previously available. However, this accessibility hasn’t extended to allow ASL speakers to connect with those who dont know ASL. Since these platforms already offer audio and visual data streams, integration of automatic closed captioning has already been done on some of them, like Zoom. The development of models like ours would allow these platforms to further their range of users.

This project’s scope is limited to ASL to prevent unecessary complexity since there are over 100 different sign languages globally. As a result, the social imapact of our  product would only extend to places where ASL is commonly used. ASL is commonly used in the United States, Philippines, Puerto Rico, Dominican Republic, Canada, Mexico, much of West Africa and parts of Southeast Asia. [Part B written by Sandra]

 

Although our product does not have direct implications on economic factors, it can affect the economic factors in many indirect ways. The first method is by improving the working efficiency of a large subgroup of the population. Currently, the deaf and hard-of-hearing community have a lot of hurdles when working on any type of project because the modern work environment is highly reliant on video conferencing. By adopting our product, they will become more independent and thus will be able to work more effectively, which will in-turn improve the output received by their projects. Overall, by improving the efficiency of a large subgroup of the population, we would effectively be improving the production of many projects and thus boosting the economy. Additionally, our entire project is built using open source work and anything we build ourselves will also be open source. By adding to the open source community, we are indirectly boosting the economy because we are making further developments in this area much easier. We are only targeting ASL and there are many other sign languages which will eventually have to be added to the model. Thus, further expansion on this idea becomes economically friendly (due to reduction in production time and cost savings) and leads to more innovation. [Part C written by Kavish]

Kavish’s Status Report for 2/10/24

This week we finished and presented our proposals to the class and analyzed the feedback from the TAs and Professors. Apart from that I downloaded Vivado and Vitis tools on my computer to start ramping up the FPGA development. I started exploring the basic steps of understanding the block diagrams and stepping through the synthesis tool flow. I have also been working with Neeraj to understand different Human Pose Estimation models (OpenPose and MediaPipe) and the feasibility of porting them onto the FPGA. I found a detailed resource that has implemented ICAIPose network onto the DPU, and will try to replicate their steps this week to understand feasibility and performance. As long as I can get the board and tools set up by next week, I should be on schedule and maintain progress according to Gantt Chart.

Neeraj’s Status Report for 2/10/24

For this week, aside from preparing for proposals, my main goal was to do dataset and model exploration. I spent a good amount of time downloading and going through the dataset to determine what exactly we are working with. Our video inputs are about 2-to-4-second video clips of a person signing a single word. Along with these 2-second hand gestures, we also have numerous JSON files that encode a word to each video. The difference between the JSON files is the number of video-word combinations that said JSON file accesses. As a result, we not only have a few pre-defined subsets that we can access, but we also can make our own splits in case we want to train on fewer words in the case that we need to make our model smaller.

I have been looking into a couple of ideas and libraries for pre-processing. There are a lot of cases where people have used general video classification, but in a lot of those cases, those models are trained so that they capture the entirety of the image. In the case of ASL detection, we do not necessarily want to do that, as there can be an issue of recognizing signs based on extraneous data, such as the color of the background or the person signing. When looking into human pose estimation to counter this, there are two main libraries that we have been considering, being Mediapipe and OpenPose. The main thing that I was looking at was what we wanted  to use for landmarking. Mediapipe has pre-trained models for both full-body landmarks and hand orientation landmarks. The amin issue is that full-body does not give us enough hand estimation details, but the hand estimation does not give use any data outside of the hand. There have been numerous of past cases in which OpenPose has been used for human pose estimation, so I am looking into whether there is any repositories that would fit our needs. If there isn’t, we might have to create our own model, but I think it would be best to avoid this so that we do not have to create our own dataset. I have also been talking with Kavish about the feasibility of these various options on an FPGA.

For classification itself, I have been looking into what type of RNN model to use, as human pose estimation might be able to give us vectors that we can pass into the RNN model instead of images (I still need to look more into this before finalizing this possible design change). Otherwise, there are a lot of possibilities with CNN-RNN models from Keras, OpenCV, and Scipy that we can use.

As of right now, I believe that I am slightly ahead of schedule. There are a few decisions that I need to ask the team and revise, as mentioned above. After this, my main deliverable would be to find/develop a good model for our human-pose estimation pre-processing and determining whether it would be reasonable to use that for our RNN input.

Team’s Status Report for 2/10/24

As we analyze the project right now, the most significant risks that could jeopardize the success of the project are not being able to port our HPE model onto FPGA, not having enough data to train a proper RNN, getting inconsistent response when working in Real-Time, and having communication failures. If there are any major problems with our FPGA, our contingency is to reduce our MVP and port all parts of our pipeline directly onto a Jetson. Although we seem to have enough data right now, we have found a couple of datasets which we could try to combine with some additional pre-processing if they are needed for training. Finally, if there are any failures due to latency requirement, we have prepared enough slack to increase latency if necessary. Compared to our original design, we are now considering an augmented version of the pipeline. We plan to run Human Pose Estimation and use the outputs from that model to train our ASL classification RNN. We are still exploring this concept but we hope that will reduce our model size. We are also considering moving the RNN from the computer and running on the Jetson. This will allow use to develop a more compact final product.  Of course, this adds another part to our parts list; however, we will discuss this idea more with faculty before implementing this solution. We are still on schedule and no updates are necessary as of right now.

Sandra’s Status Report for 2/10/24

This week was focused on clearly establishing the boundaries of our project and presenting our proposal during 18-500 class time.  As such, this week was primarily focused on research and preparing for implementation. Seeing our peers’ project proposals was a useful way to reflect on the strong and weak points of our own project planning. I’m most exicted to focus on the integration and aplication of our various project pieces.  Ultimately, we need to connect out hardware and ML algorithms to a functional display that can be easily navigated by a user. To broden the scope of what our project can be applied to, I’ve spent significant time looking into the possible input and output connection points we can easily integrate early on to ensure we’ll be able to smoothly connect our various pieces together when finishing our project.

Moreover, my teammates have a deeper understanding of machine learning in practice than I do so I’ve also spent time time this week continuing to familliarize myself with the process of training and optimizing an LLM and CV neural network. Since training any ML algorithm is a substantial task, I’ve focused on learning more of the the grammar rules ASL uses to begin the process of selecting hyperparameters or otherwise tuning our algorithms to best meet our needs.

Currently, my progress is on schedule as we’re laying the groundwork to ramp up training and processing our datasets, as outlined in our Gantt Chart below.

In the next week, I hope to continue researching the aformentioned topics and collaborate with my teammates to establish the division of labor for the week. Developing a human-pose estimation pre-processing model will be the software end’s next main step and establishing a clear roadmap for achieving our project objectives will be necessary.