March 2024 – Team E6: TransLingualVisionary

March 31, 2024

Kavish’s Status Report for 3/30/24

I have been working this week on finishing subsystems for the interim demo. I have mostly finalized the viewer and it should be good for the demo. Once Sandra gets the LLM part working, I will work with her to integrate the viewer for a better demo. I have finished booting the kria board and am currently working on integrating the camera to it. For the demo, I aim to show a processed image stream through the FPGA since we are still working on finalizing the HPE model. I have been having some trouble working through the kinks of the board and so I am a little behind schedule. I will have to alter the timeline if I am unable to integrate the camera stream by interim demo.

March 30, 2024April 1, 2024

Sandra Serbu’s Status Report for 3/30/24

This week was focused on building an OpenAI API for text generation and integration with other project components. OpenAI’s text generation capabilities are extensive and offer a range of input modalities and model parameterization. In their tutorials and documentation, OpenAI “recommend[s] first attempting to get good results with prompt engineering, prompt chaining (breaking complex tasks into multiple prompts), and function calling.” To do so, I started by asking ChatGPT to “translate” ASL given syntactically ASL sentences. ChatGPT had a lot of success doing this correctly given isolates phrases without much context or prompting. I then went to describe my prompt as GPT will be used in the context of our project as follows:

I will give you sequential single word inputs. Your objective is to correctly interpret a sequence of signed ASL words in accordance with ASL grammar and syntax rules and construct an appropriate english sentence translation. You can expect Subject, Verb, Object word order and topic-comment sentence structure. Upon receiving a new word input, you should respond with your best approximation of a complete English sentence using the words you’ve been given so far. Can you do that?

It worked semi successfully but ChatGPT had substantial difficulty moving away from the “conversational” context it’s expected to function within.

So far, I’ve used ChatGPT-4 and some personal credits to work in the OpenAI API playground but would like to limit personal spending on LLM building and fine-tuning going forward. I would like to use credits towards gpt-3.5-turbo since fine-tuning for GPT-4 is in an experimental access program and for the scope of our project, I expect gpt-3.5-turbo to be a robust model for in terms of accurate translation results.

from openai import OpenAI
client = OpenAI()

behavior = "You are an assistant that receives word by word inputs and interprets them in accordance with ASL grammatical syntax. Once a complete sentence can be formed, that sentence should be sent to the user as a response. If more than 10 seconds have passed since the last word, you should send a response to the user with the current sentence."
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": behavior},
{"role": "user", "content": "last"},
{"role": "user", "content": "year"},
{"role": "user", "content": "me"},
{"role": "user", "content": "went"},
{"role": "user", "content": "Spain"},
{"role": "assistant", "content": "I went to Spain a year ago."},
{"role": "user", "content": "where"}
])

The system message helps set the behavior of the assistant.

eventually, we would like to use an Openai text generation model such that we provide word inputs as a stream and only receive text in complete English sentences.

SVO (Subject, Verb, Object) is the most common sentence word order in ASL and Object, Subject, Verb (OSV) necessarily uses non-manual features (facial expression) to introduce the object of a sentence as the topic. As a result, we will restrict the scope of our current LLM to successfully interpreting SVO sign order.

March 30, 2024March 31, 2024

Neeraj’s Status Report for 3/30/24

My main progress over the past week has just been preparing for interim demos. As such, I have put a temporary pause on any significant innovation, mainly concerning modifying the architecture of either the HPE model or the spoter transformer. As such, I have been putting more time into getting the necessary data after HPE processing and training the spoter model on the data we want to train on. The only thing that I have been coding up is the eval function for the spoter model, which should hold public functions that can run inferences on a trained model. Since the spoter model was a model generated for architecture research, the main goal was to evaluate the accuracy and loss metrics of the model. As such, I needed to save the model post-training and utilize the given architecture to create our eval functions for inference. I am currently working on training the model, so I am planning to have the working model and eval function for interim demos if all goes well.

My main plan for next week is to continue to iron out the interim demo material until we present, and then after that, focus on debugging, retraining, and architectural changes to the model to improve performance. This mainly involves collecting validation data and modifying hyperparameters to maximize that and looking into further research to see what improvements I can make to the model that would not result in a large latency burden.

As of right now, I am just slightly behind schedule. If I can finish up what we need for the interim demo in time, then I should be relatively back on schedule, as I will have the time to test, debug, and configure our pipeline. In the case that I do not have my stuff finished in time, I would be behind. In this case, I would most likely ask for help from my teammates in order to help in debugging code that I need to or get help in migration and integration later on.

March 30, 2024March 31, 2024

Team’s Status Report for 3/30/24

Our main risks echo those of our previous reports, namely models not working/accuracy, inability to use the FPGA for HPE, etc., which we have provided mitigation strategies for in previous reports. However, the main risk that we are considering right now, especially when considering the interim demo and our final product is integration between our components. As we are working to get our components up and running, we will soon begin integration, which could raise some issues in getting stuff working together. Our main mitigation strategy for this is focusing on integration early before improving the individual components. As such, we can have a working pipeline first before we focus on developing, debugging, and improving the individual components of our model further.

At this point, we have not made any large design changes to the system, as we have not found a need too. Similarly, we have not changed our schedule from the update we made last week.

March 24, 2024

Kavish’s Status Report for 3/23/24

I finished up the code for the viewer and synchronizer that I described last week (corrected few errors and made it product ready). The only part left for the viewer is determine the exact json and message format which I can change and test when we begin integration. I hoped to get back on the FPGA development and make more progress on that end like I mentioned last week, but I had to spend time on coming up with a solution for the end of sentence issue. The final solution that we decided on was to run a double query to the LLM while maintaining a log of previous words given by the classification model. The first query will aim to create as many sentences as possible based on the inputted log, while the second query will determine if the current log is a complete sentence. The output of the first query will be sent as a json message to my viewer which will continuously present the words. The log will be cleared based on the response of the second query. Although I am behind on the hardware tasks, Sandra has now taken over the task of testing and determining the correct prompts for the LLM to make the previous idea work, so I will now have time next week to get the FPGA working for the interim demo. My goal is to first have a basic HPE model working for the demo (even if it does not meet the specifications) and then spend the rest of the time finalizing it. I have updated the gantt chart accordingly.

March 24, 2024

Neeraj’s Status Report for 3/23/24

The main progress that I have made is looking into the spoter algorithm I found at the end of last week and understanding how it works and how to apply it. It utilizes pose estimation as input, so it fits in well with our current pipeline. However, we would most likely have to retrain the model based on our pose estimation inputs, which I plan to do later this week. There is also the question of integration as well, as I am not entirely sure how this model works for post-training for inference time, which I am also looking to figure out. I have also been looking into our human pose estimation algorithm as of right now. As mentioned in earlier posts, we were looking at adding a more holistic landmarking by including facial features in our HPE algorithm. As such, I have been looking into Mediapipe’s holistic landmarking model. While Mediapipe said that they are planning on updating their holistic model relatively soon, I decided to look at their older GitHub repo and determine what I can do from there. It looks like we should be able to implement it similarly to our hand pose estimation code. We can reduce the number of landmarks that our classification model works off of as well, as we don’t need any info on the lower half of the body.

As such, my plan for the next week is to focus on getting a rough pipeline from HPE to classification. There will be a need to debug, as there always is, but given that we have an HPE model and a classification model with spoter, I am going to be focusing on integrating the two so that they can be later assimilated into the other parts of the pipeline.

I am still slightly behind on my work, as bouncing between various models has resulted in a bit of a setback, even though finding spoter does put me relatively back on. There has been a slight change in work, as Sandra is working on the LLM side of things now after discussing with the team so that I can focus in more on the classification model. As such, getting a rough pipeline working should put me basically back on track with the slightly altered schedule, as I should have the time to debug and workshop the spoter architecture.

March 24, 2024March 24, 2024

Team’s Status Report for 3/23/24

Overall, apart from the risks and mitigations discussed in the previous reports about the new spoter classification model and the FPGA integration, the new risk we found was the ability to determine the end of sentences. As of right now, our classification model simply outputs words or phrases based on the input video stream; however, it is unable to analyze the stream to determine end of sentence. In order to mitigate this problem, our current solution is to run a double query to the LLM while maintaining a log of previous responses to determine the end of sentence. If this idea does not work, we will introduce timeouts and ask the users to take a second of pause between sentences. No big design changes have been made until now since we are still building and modifying both the pose estimation and classification models. We are a bit behind schedule due to the various problems with the model and FPGA, and have thus updated the schedule below. We hope to have at a partially integrated product by next week before our interim demo.

March 16, 2024

Team’s Report for 3/16/24

Apart from the risks and contingencies mentioned from our previous status reports, the latest risk is with out Muse-RNN model. The github model we found was developed in Matlab and we planned on translating and re-adopting that code to python for our use. However, the good part of the code is p-code (encrypted). Thus, we have two options: go forward with our current RNN and plans or try a new model we found called spoter. Assuming that there are problems with our current plan, our contingency is to explore this new model of spoter. The other risks of FPGA and LLM remain the same (since development is ongoing) and our contingencies also remain the same.

There have been no major updates to the design just yet and we are still mostly on schedule. Although we might change it next week because Neeraj might finish development of LLM first before we finalize our RNN.

March 16, 2024

Kavish’s Status Report for 3/16/24

This week I split my time working on the FPGA and developing a C++ project which will be useful for our testing and demo purposes. For the FPGA application, I currently in the process of booting PetaLinux 2022. Once that process is finished, I will connect our usb camera and try use my C++ project for a very basic demo. This demo is mainly to finalize the setup of FPGA (before I develop the HPE model) and test the communication bandwidth across all the ports. The C++ project is an application to receive two streams of data (a thread to receive images and another thread to receive json data) from an ethernet connection as UDP packets. The images and corresponding data must be pre-tagged with ID in the transmitting end so that I can match them on my receiving application. A third thread pops an image from the input buffer queue and matches its ID with corresponding json data (which is stored in a dictionary). It writes the data onto the image (could be words, bounding boxes, instance segmentation etc.) and saves the updated image to an output queue. A fourth thread then gets the data from the output queue and displays it. This project mainly relies on OpenCV for image processing and a github library (nlohmann) for json parsing in C++. I do not think I am behind schedule just yet and should be able to stick to the development cycle as planned. Over the next week I plan to finish booting the FPGA and then test running an image stream and across to the Jetson.

March 16, 2024

Neeraj’s Status Report for 3/16/24

My main update is regarding the RNN architecture. As mentioned last week, a lot of this week has mainly been spent on exploring MUSE-RNN, its capabilities, and whether it can be applied to our current architecture. I have found some current code implementing it in MatLab at https://github.com/MUSE-RNN/Share. However, a majority of this code is p-code meaning, that it is encrypted and we can only interact with the model itself. From testing it out, it does seem to be able to work when given an appropriately structured mat file as a dataset. However, I also believe that creating a script to redevelop our dataset into such is not a viable use of our time, especially when considering that MatLab might not be the best language to use for our code. As such, I have reached a few options. We can move forward with the basic RNN that we have, use the current implementation of MUSE-RNN that we have and disregard the possible negative drawbacks of MatLab as a language, or try developing/finding a new model that could also work. As of right now, I believe the best option is the first one, but I have also found another model to explore called spoter, which is a transformer that has been used in a very similar way to our use case, as we can see here: https://github.com/matyasbohacek/spoter. I am also interested in looking into this and possibly building a transformer with a similar structure since this code also works under the presumption of pose estimation inputs, meaning that this would translate cleanly into our current pipeline. On the LLM side, there has been a good amount of progress made, as I have experimented more with different ideas from various papers and articles. In particular, from this site (https://humanloop.com/blog/prompt-engineering-101), I found that example and concise, structured prompts are more effective, which I have been working with. I am planning on bringing this up during our group’s working session later, as I want to solidify our prompt to dedicate more time to everything else.

I want to finish up the decision and the classification model as soon as possible, so that is my first priority. The LLM prompt is also a priority, but I want to finish that with the rest of the team as I believe that is something that can be finished relatively quickly with the three of us working together, especially since it is something that we can quickly test once we gather a few example sentences hat we can test.

I am currently a bit behind schedule, as the RNN work is taking a bit more time than I anticipated, especially when considering the fact that there are a variety of different models that I am looking at using. However, there is a counterbalance here, because the LLM prompt generation should take far less time than we had originally anticipated on. As a result, we can adjust the schedule to dedicate more time towards the RNN rather than the LLM prompting.

March 10, 2024March 10, 2024

Neeraj’s Status Report for 3/9/24

For the past couple of weeks, my main focus has been completing the design report. This mainly involved writing up the design trade-offs and testing/verification sections, as well as wrapping up other sections that needed to be finished. The main point of focus for me throughout this was the justification of our choices and making sure that every choice and metric we used had a valid reason behind it. From a more technical standpoint, I have spent more time developing the RNN and looking into the possibility of using MUSE-RNN, which would be interesting to develop and effective for improving accuracy. As of right now, I have the majority of a basic GRU-based RNN developed, compiling code I could find from outside resources before working on implementing the MUSE-RNN architecture, as I want to check our current metrics before diving into further complicating the architecture. I have also been experimenting with various word-to-sentence combinations to test different LLM prompts to not fall behind on that front either. I started with a basic “translate these words [_,…,_] from ASL into an English sentence” prompt, which has a lot of variation in response, not only in the sentence itself but also in the format of the response we get from the model. Since this is a very simple prompt, I am planning on utilizing various papers and studies, such as the Japanese-to-English paper that I referenced beforehand, to further develop a more complex prompt. I have not been able to spend as much time on this as I would have liked to, so to not fall behind, I am planning to focus on finishing this relatively quickly or get help from my groupmates, as I also want to focus on the RNN development.

My plan as of right now is to try developing our RNN training cases by running our dataset through our HPE model and then training the base RNN model that we are using. I am also planning on starting the MUSE-RNN architecture separately. I also want to meet up with my groupmates at some point to focus in on developing our prompt for the LLM and see if we can finish as much of that component as fast as possible.

I believe that I am mostly on schedule, if not slightly behind due to experimenting with our model architecture. As aforementioned, I plan on compensating for this by getting help from my groupmates on the LLM portion of things, as I believe that doing some intensive research on prompt engineering should give us substantial progress, especially in terms of achieving desired formatting and answer structure. This would give me more time to focus on developing MUSE-RNN if we want to pursue that architecture.

March 10, 2024March 10, 2024

Kavish’s Status Report for 3/9/24

My last two weeks were spent working mainly on the design report. This process made me look a little bit deeper into the system implementation component of our project compared to the design presentation. Since there were numerous ways to achieve the same goal of implementation, we analyzed the different methods to understand the advantages in terms of functional efficiency and developmental efficiencies. For the FPGA development, we will be highly reliant on the Vitis AI tool and its model zoo and libraries. This decision was made for fastest development time in mind instead of focusing on optimizations from the get-go. Similarly for the web application, we decided that we should initially just use the OpenCV library for viewing and debugging purposes and then transition to a more mature display platform later in the project. As for my progress on the HPE development on FPGA, I am still running into a couple of roadblocks to fully compile the project on the board. I am referring to other works online who have attempted similar projects and will reach out for additional help as required. As for the progress, I am on the border of on-track or a little behind schedule due to the time spent on the design report. I do not need to alter the schedule as of right now, but will update it next week if required.

March 9, 2024March 10, 2024

Team’s Status Report for 3/9/24

As of right now, our largest risk remains to be working HPE on the FPGA, but as mentioned before, we are currently in the process of developing that and we have a backup plan of running HPE on the Jetson if necessary. The other major risk as of right now is our RNN model and whether a standard GRU architecture is enough. While we do believe it should be, we have planned for this in being able to switch out GRU cells with LSTM cells, using new architectures such as MUSE-RNN, and considering the possibility of moving to transformers if necessary.

As of right now, there has been no significant change in the design plans of our project. We have options to pivot in the case of a competent failure or lack of performance, as mentioned above, but we have not yet had to commit to one of these pivots.

With consideration of global factors, our biggest impact will be on the deaf and hard of hearing community – particularly the ones who use the American Sign Language (ASL) and are reliant on the digital video communication platforms like zoom. For all those individuals, it is quite a challenge to participate in online platforms like zoom and have to rely on either chat features or a translator if other people in the meeting are not familiar with ASL. Our project helps those individuals globally by removing that reliance and bringing accessibility in communication. This will bring a greater, more effective form of communication to hundreds of thousands of ASL users across the globe.

Our product solution has substantial implications for the way the hearing world can engage with ASL users in virtual spaces. Culturally, hearing impaired communities tend to be forgotten about and separate from the hearing world in America. Influence in public spaces, like concerts, and legislation has been hard to come by and accommodations tend to be fought for rather than freely given. Part of this issue stems from hearing communities not having personal relationships with or awareness of the hearing impaired. Every step towards cultivating spaces where deaf individuals can comfortably participate in any and all discussions is a step towards hearing impaired communities having a louder voice in the cultural factors of the communities they identify with. An ASL to text feature in video communication is a small step towards meeting the needs of hearing impaired communities, but opens the doorway to substantial progress. By creating avenues of direct interpersonal communication (without the need for a dedicated translator in every spoken space), the potential for an array of meaningful relationships in professional, academic, and recitational spaces opens up.

While an ASL translation system for digital environments aims to improve accessibility and inclusion for the deaf and hard of hearing community, it does not directly address core environmental needs or sustainability factors. The computational requirements for running machine learning models can be energy-intensive, so our system design will prioritize efficient neural architectures and lean cloud/edge deployment to minimize environmental impact from an energy usage perspective. Additionally, by increasing accessibility to online information, this technology could potentially help promote environmental awareness and education within the Deaf community as an indirect bonus. However, as a communication tool operating in the digital realm, this ASL translator does not inherently solve specific environmental issues and its impact is confined to the human-centric domains of accessibility. While environmental sustainability is not a primary focus, we still aim for an efficient system that has a minimal environmental footprint.

Section breakdown:

A was written by Kavish
B was written by Sandra
C was written by Neeraj