Team’s Status Report for 2/24/24

Our risks still remain the same as last week. The primary ones being the porting of MediaPipe onto FPGA and the performance of our self-trained RNN. If we are unable to port the MediaPipe, we will simply implement it on the Jetson. Although it is undesirable, it should still be possible to make our MVP. If we are not able to get the required accuracy from the GRU RNN model, we will look into either LSTM or a transformer as a plan B.

There were no major design changes made this week. After more benchmarking and testing, we might have to make more changes.

Updated Gantt chart is attached below.

Kavish’s Status Report 2/24/24

Most of this week was spent on preparing the design presentation and then translating that presentation into report. Me and Neeraj discussed our design quite extensively for the presentation and report. We looked at different libraries for RNN and whether to use GRU or LSTM. I also spent time learning about the MediaPipe model as I am quite weak in the Machine Learning aspects. Unfortunately I was unable to work more on the FPGA development and still have to port a pre-trained model onto it for testing purposes. Although this does not put me behind the new schedule that we developed for design presentation, I will need to put in some extra hours next week to be ahead of schedule. I plan on implementing a basic pose estimation model from the Vitis model zoo and then benchmarking its performance to fully validate our idea.

Neeraj’s Status Report for 2/24/24

The main focus of this week was revolving around our design presentation. My main work for this week was spent revolving around preparing the slideshow at the beginning of the week, helping Kavish get ready to present, and then beginning the transition of information from the slideshow onto the final design report.

I also spent more time this week looking into the RNN and beginning development. I am looking into what libraries would work best for a real-time inference model. I decided to begin development using Tensorflow simply because it is the most predominant library used in the industry, considering it has the largest amount of optimization and thus would run the fastest, thus lowering our latency as much as possible.  As of right now, I am just developing the model, so my main goal over the next week is to spend more time focusing on finishing that so we can get to testing as much as possible. We also need to develop a training set for the RNN, so I am currently setting up the code to take our videos and transform them into HPE vector outputs so that we can use those to train our RNN. I need to figure out where I should do this process, as it might take a good amount of computing power, so I need to determine whether we can do this on the ECE machines or if it might be worth it to invest in AWS services to run these.

I believe I am slightly behind schedule, as we expect the RNN model to be finished soon. It might be worth it to push the RNN scheduling a bit further back, as the RNN model might take more time than we anticipated, partially due to workload from other classes. As such, it might be worth it to merge the time in which we are testing the RNN and developing it, as that is something that can be combined, since part of the RNN development time will include developing the LSTM model for comparison. As such, we can test the GRU model in the mean time.

For next week, I ideally want to finish the base GRU RNN model, as that would keep us as on track as possible, and in a good spot heading into spring break, in which I can work on testing and model comparison.

Sandra’s Status Report for 2/24/24

My progress is a bit behind schedule and I plan to take steps over the upcoming break to catch up to the project schedule.

In the next week, I hope to develop a meshed hand/facial expression estimation model and compare it to MediaPipe’s holistic feature recognition model.

Neeraj’s Status Report for 2/17/24

My main goal for this week was to experiment with various human pose estimation libraries. Primarily, I was focusing on determining whether to use OpenPose or Mediapipe and which of the two would better fit our design pipeline. These libraries have had a history of running on smaller IoT devices, meaning either of which has the potential to possibly work on an FPGA like we intend to.

When installing both of these models, I was having issues with installing the OpenPose models, meaning I might need more time experimenting with them. However, I have been able to test Mediapipe’s hand detection model. It can create 20 landmarks to detect hand position and pose, as well as distinguish left and right hands. It also outputs vectors that hold locations with the position of each landmark in the image. This means that we do not necessarily have to develop a CNN-RNN fusion model to account for spatial information and instead use these vectors as inputs into an RNN to classify words. I have tested this with a few still photos, as per Mediapipe’s test documentation code, which is in our team report. Combining this with the OpenCV library, I have developed a script that takes in a live video from a camera and return output vectors containing the positions of the vectors. This script would be representative of the beginning end of our design pipeline, which we can use for testing and verifying the hardware side of our design.

On another note, I have been looking about how prompt engineering works with LLMs. More specifically, I am looking at the following paper by Masaru Yamada:

https://arxiv.org/ftp/arxiv/papers/2308/2308.01391.pdf

The paper explored the influence of integrating the purpose of the translation and the target audience into prompts on the quality of translations produced by ChatGPT. The findings suggest that including suitable prompts related to the translation purpose and target audience can yield more flexible and higher quality translations that better meet industry standards. Specifically, the prompts allowed ChatGPT to generate translations that were more culturally adapted and persuasive for marketing content as well as more intelligible translations of culture-dependent idioms. The paper also demonstrated the practical application of translation concepts like dynamic equivalence by using prompts to guide creative translations. This paper could provide good insight into translating word fragments into full sentences, as well as what prompts we could begin experimenting with so that we can capture the nuances within ASL.

I am basically on schedule. This human pose estimation code would function as a good foundation for our pre-processing model, which we can also use for hardware testing. We are also on pace for starting prompt engineering for our LLM.

Kavish’s Status Report for 2/17/24

Over the last week, I got access to the canvas and piazza for 18643 class. Using that, I have followed along and finished the first few labs to understand Vitis tool flow and software emulation steps for testing. I also researched about the current confusion we are having – whether doing pose estimation on FPGA is worth it or not. Looking online, I found evidence that many pose estimation models have been ported onto our FPGA (KV260) and have achieved comparable throughput to many Jetsons. Although it is a positive response, it is not conclusive evidence that our current plan is fully feasible. Thus, I am currently working on trying to port pre-trained models from Vitis Model Zoo to self-measure the feasibility of the project. I am running into some roadblocks on some development steps on the tool side (trying to find or build the correct configuration files and booting the kernel on the SD card), which I am trying to resolve. Although I am a little behind schedule, it is not too far off just yet to change anything on our project timeline. I should be on schedule if I can make significant progress in trying to get the FPGA running with a pre-trained pose estimation model from the model zoo. Not only will this help to confirm if our current design is feasible, but it will also put us on track to get MediaPipe or OpenPose running later down the line.

Team’s Status Report for 2/17/24

Our largest major risk as of right now is determining how much we can do on an FPGA. More specifically, we are trying to experiment on whether we can do human pose estimation using an FPGA, as we believe that it could make a strong impact in boosting our performance metrics.  We do have a contingency plan of doing HPE and our RNN on an Nvidia Jetson if we are unable to get this working though. That said, to best manage this risk, we are trying to frontload as much of that experimentation as possible so that if we do have to pivot, we can pivot early and stay relatively on schedule. Our other risks our echoes of the ones of last week. We have looked into the latency issues, mainly concerning the bottleneck of human pose estimation. Based on research into past projects using AMD’s Kris KV260, we do believe that we should be able to manage our latency so it isn’t a bottleneck.

As per last week, our major change is related to our risk above, in that, we want to try and split the HPE model and our RNN so that the HPE model is on the FPGA and the RNN is on the Jetson. After experimentation, we can find that we can just pull the vector data of each landmark from Mediapipe’s hand pose estimation model, allowing us to pass that into an RNN rather than an entire photo. The main reason we made this pivot is that we believe that we can have a smaller model on the Jetson in comparison to the CNN-RNN design we originally planned. This gives a bit more room to scale this model to adjust for per-word accuracy and inference speed. Luckily, as we have not begun extensive work on developing the RNN model, there is not a large cost of change here either. In the case that we are not able to have the HPE model on the FPGA, we would have to be a bit more restrictive on the RNN model size to manage space, but we would still most likely keep the HPE-to-RNN pipeline.

There has not yet been any update to our schedule as of yet, as a lot of our pivots does not have a large impact on our scheduling.

Below, we have some images that we used for testing the hand pose estimation model that Mediapipe offers:

 

Impact Statement

While our real-time ASL translation technology is not focused specifically on physical health applications, it has the potential to profoundly impact mental health and accessibility for the deaf and hard of hearing community. By enabling effective communication in popular digital environments like video calls, live-streams, and conferences, our system promotes fuller social inclusion and participation for signing individuals. Lack of accessibility in these everyday virtual spaces can lead to marginalization and isolation. Our project aims to break down these barriers. Widespread adoption of our project could reduce deaf social isolation and loneliness stemming from communication obstacles. Enhanced access to information and public events also leads to stronger community ties, a sense of empowerment, and overall improved wellbeing and quality of life. Mainstreaming ASL translation signifies a societal commitment to equal opportunity – that all individuals can engage fully in the digital world regardless of ability or disability. The mental health benefits of digital inclusion and accessibility are multifaceted, from reducing anxiety and depression to fostering a sense of identity and self-worth. [Part A written by Neeraj]

While many Americans use ASL as their primary language,  there are still many social barriers preventing deaf and hard of hearing (HoH) individuals from being able to fully engage in a variety of environments without a hearing translator. Our product seeks to break down this communication barrier in virtual spaces, allowing deaf/HoH individuals to fully participate in virtual spaces with non-ASL speakers. We’ve seen a drastic increase in accessibility to physical meeting spaces due to the pandemic making platforms like Zoom and Google Meet nearly ubiquitous in many social and professional environments. As a result, those who are immunocomprimized or have difficulties getting to a physical location  now have access to collaborative meetings in ways that weren’t previously available. However, this accessibility hasn’t extended to allow ASL speakers to connect with those who dont know ASL. Since these platforms already offer audio and visual data streams, integration of automatic closed captioning has already been done on some of them, like Zoom. The development of models like ours would allow these platforms to further their range of users.

This project’s scope is limited to ASL to prevent unecessary complexity since there are over 100 different sign languages globally. As a result, the social imapact of our  product would only extend to places where ASL is commonly used. ASL is commonly used in the United States, Philippines, Puerto Rico, Dominican Republic, Canada, Mexico, much of West Africa and parts of Southeast Asia. [Part B written by Sandra]

 

Although our product does not have direct implications on economic factors, it can affect the economic factors in many indirect ways. The first method is by improving the working efficiency of a large subgroup of the population. Currently, the deaf and hard-of-hearing community have a lot of hurdles when working on any type of project because the modern work environment is highly reliant on video conferencing. By adopting our product, they will become more independent and thus will be able to work more effectively, which will in-turn improve the output received by their projects. Overall, by improving the efficiency of a large subgroup of the population, we would effectively be improving the production of many projects and thus boosting the economy. Additionally, our entire project is built using open source work and anything we build ourselves will also be open source. By adding to the open source community, we are indirectly boosting the economy because we are making further developments in this area much easier. We are only targeting ASL and there are many other sign languages which will eventually have to be added to the model. Thus, further expansion on this idea becomes economically friendly (due to reduction in production time and cost savings) and leads to more innovation. [Part C written by Kavish]

Kavish’s Status Report for 2/10/24

This week we finished and presented our proposals to the class and analyzed the feedback from the TAs and Professors. Apart from that I downloaded Vivado and Vitis tools on my computer to start ramping up the FPGA development. I started exploring the basic steps of understanding the block diagrams and stepping through the synthesis tool flow. I have also been working with Neeraj to understand different Human Pose Estimation models (OpenPose and MediaPipe) and the feasibility of porting them onto the FPGA. I found a detailed resource that has implemented ICAIPose network onto the DPU, and will try to replicate their steps this week to understand feasibility and performance. As long as I can get the board and tools set up by next week, I should be on schedule and maintain progress according to Gantt Chart.

Neeraj’s Status Report for 2/10/24

For this week, aside from preparing for proposals, my main goal was to do dataset and model exploration. I spent a good amount of time downloading and going through the dataset to determine what exactly we are working with. Our video inputs are about 2-to-4-second video clips of a person signing a single word. Along with these 2-second hand gestures, we also have numerous JSON files that encode a word to each video. The difference between the JSON files is the number of video-word combinations that said JSON file accesses. As a result, we not only have a few pre-defined subsets that we can access, but we also can make our own splits in case we want to train on fewer words in the case that we need to make our model smaller.

I have been looking into a couple of ideas and libraries for pre-processing. There are a lot of cases where people have used general video classification, but in a lot of those cases, those models are trained so that they capture the entirety of the image. In the case of ASL detection, we do not necessarily want to do that, as there can be an issue of recognizing signs based on extraneous data, such as the color of the background or the person signing. When looking into human pose estimation to counter this, there are two main libraries that we have been considering, being Mediapipe and OpenPose. The main thing that I was looking at was what we wanted  to use for landmarking. Mediapipe has pre-trained models for both full-body landmarks and hand orientation landmarks. The amin issue is that full-body does not give us enough hand estimation details, but the hand estimation does not give use any data outside of the hand. There have been numerous of past cases in which OpenPose has been used for human pose estimation, so I am looking into whether there is any repositories that would fit our needs. If there isn’t, we might have to create our own model, but I think it would be best to avoid this so that we do not have to create our own dataset. I have also been talking with Kavish about the feasibility of these various options on an FPGA.

For classification itself, I have been looking into what type of RNN model to use, as human pose estimation might be able to give us vectors that we can pass into the RNN model instead of images (I still need to look more into this before finalizing this possible design change). Otherwise, there are a lot of possibilities with CNN-RNN models from Keras, OpenCV, and Scipy that we can use.

As of right now, I believe that I am slightly ahead of schedule. There are a few decisions that I need to ask the team and revise, as mentioned above. After this, my main deliverable would be to find/develop a good model for our human-pose estimation pre-processing and determining whether it would be reasonable to use that for our RNN input.

Team’s Status Report for 2/10/24

As we analyze the project right now, the most significant risks that could jeopardize the success of the project are not being able to port our HPE model onto FPGA, not having enough data to train a proper RNN, getting inconsistent response when working in Real-Time, and having communication failures. If there are any major problems with our FPGA, our contingency is to reduce our MVP and port all parts of our pipeline directly onto a Jetson. Although we seem to have enough data right now, we have found a couple of datasets which we could try to combine with some additional pre-processing if they are needed for training. Finally, if there are any failures due to latency requirement, we have prepared enough slack to increase latency if necessary. Compared to our original design, we are now considering an augmented version of the pipeline. We plan to run Human Pose Estimation and use the outputs from that model to train our ASL classification RNN. We are still exploring this concept but we hope that will reduce our model size. We are also considering moving the RNN from the computer and running on the Jetson. This will allow use to develop a more compact final product.  Of course, this adds another part to our parts list; however, we will discuss this idea more with faculty before implementing this solution. We are still on schedule and no updates are necessary as of right now.

Sandra’s Status Report for 2/10/24

This week was focused on clearly establishing the boundaries of our project and presenting our proposal during 18-500 class time.  As such, this week was primarily focused on research and preparing for implementation. Seeing our peers’ project proposals was a useful way to reflect on the strong and weak points of our own project planning. I’m most exicted to focus on the integration and aplication of our various project pieces.  Ultimately, we need to connect out hardware and ML algorithms to a functional display that can be easily navigated by a user. To broden the scope of what our project can be applied to, I’ve spent significant time looking into the possible input and output connection points we can easily integrate early on to ensure we’ll be able to smoothly connect our various pieces together when finishing our project.

Moreover, my teammates have a deeper understanding of machine learning in practice than I do so I’ve also spent time time this week continuing to familliarize myself with the process of training and optimizing an LLM and CV neural network. Since training any ML algorithm is a substantial task, I’ve focused on learning more of the the grammar rules ASL uses to begin the process of selecting hyperparameters or otherwise tuning our algorithms to best meet our needs.

Currently, my progress is on schedule as we’re laying the groundwork to ramp up training and processing our datasets, as outlined in our Gantt Chart below.

In the next week, I hope to continue researching the aformentioned topics and collaborate with my teammates to establish the division of labor for the week. Developing a human-pose estimation pre-processing model will be the software end’s next main step and establishing a clear roadmap for achieving our project objectives will be necessary.