Neeraj’s Status Report for 4/20/24

Over the past week, my main progress has been mainly retraining the classification model and integration of the model into the pipeline, namely with the HPE model.  Since the last classification model was inaccurate, I decided to reprocess all of our WLASL training data and use that to retrain the model. I also did some small testing on a few hyperparameters, namely the learning rate and epoch counts. As such, I got a model with about 74%-75% validation accuracy. From there I focused on developing an inference script based on the spoter evaluation code, which we could use to input data into the model, receive a softmax output, and find the associated label with that. Since then, I have mainly been working with Kavish on integrating the HPE model and the classification model, determining the necessary data transformations we need to go through to translate the output of the HPE model into the input of the classification model. We also have been running latency tests for our components, looking at inference times and frame rates to make sure that we are staying within our design specification’s latencies.

As of right now, I feel like we are a little bit behind schedule, simply because we ran into a couple of unexpected issues during integration. But our main idea for combatting this is simply spending more time working it as a team to figure it out together in a more efficient fashion.

For the next week, my main focus is going to be any further testing and integration we might need alongside continuing final presentation work.

Throughout this project, I feel like one of the greatest sources of knowledge that I found was previous research papers. Especially when looking into video classification and how various models and architectures worked, reading through papers, understanding the purpose the research, and the detail within the papers themselves were a valuable medium. For example, reading between the MUSE-RNN paper, the spoter paper, the spoter-embeddings paper, and other papers that were related to WLASL were really useful to figure out how to tackle this problem in a real time setting. More specifically, they helped me learn about intricacies and relationships between nodes within neural network architectures, whether that be in RNNs or transformers. This even extends to online articles and videos discussing more base level topics that I could learn from to better understand these papers, such as Medium articles describing how various architectures worked was useful for understanding important concepts.

Neeraj’s Status Report for 3/30/24

My main progress over the past week has just been preparing for interim demos. As such, I have put a temporary pause on any significant innovation, mainly concerning modifying the architecture of either the HPE model or the spoter transformer. As such, I have been putting more time into getting the necessary data after HPE processing and training the spoter model on the data we want to train on. The only thing that I have been coding up is the eval function for the spoter model, which should hold public functions that can run inferences on a trained model. Since the spoter model was a model generated for architecture research, the main goal was to evaluate the accuracy and loss metrics of the model. As such, I needed to save the model post-training and utilize the given architecture to create our eval functions for inference. I am currently working on training the model, so I am planning to have the working model and eval function for interim demos if all goes well.

My main plan for next week is to continue to iron out the interim demo material until we present, and then after that, focus on debugging, retraining, and architectural changes to the model to improve performance. This mainly involves collecting validation data and modifying hyperparameters to maximize that and looking into further research to see what improvements I can make to the model that would not result in a large latency burden.

As of right now, I am just slightly behind schedule. If I can finish up what we need for the interim demo in time, then I should be relatively back on schedule, as I will have the time to test, debug, and configure our pipeline. In the case that I do not have my stuff finished in time, I would be behind. In this case, I would most likely ask for help from my teammates in order to help in debugging code that I need to or get help in migration and integration later on.

Neeraj’s Status Report for 3/23/24

The main progress that I have made is looking into the spoter algorithm I found at the end of last week and understanding how it works and how to apply it. It utilizes pose estimation as input, so it fits in well with our current pipeline. However, we would most likely have to retrain the model based on our pose estimation inputs, which I plan to do later this week. There is also the question of integration as well, as I am not entirely sure how this model works for post-training for inference time, which I am also looking to figure out. I have also been looking into our human pose estimation algorithm as of right now. As mentioned in earlier posts, we were looking at adding a more holistic landmarking by including facial features in our HPE algorithm. As such, I have been looking into Mediapipe’s holistic landmarking model. While Mediapipe said that they are planning on updating their holistic model relatively soon, I decided to look at their older GitHub repo and determine what I can do from there. It looks like we should be able to implement it similarly to our hand pose estimation code. We can reduce the number of landmarks that our classification model works off of as well, as we don’t need any info on the lower half of the body.

As such, my plan for the next week is to focus on getting a rough pipeline from HPE to classification. There will be a need to debug, as there always is, but given that we have an HPE model and a classification model with spoter, I am going to be focusing on integrating the two so that they can be later assimilated into the other parts of the pipeline.

I am still slightly behind on my work, as bouncing between various models has resulted in a bit of a setback, even though finding spoter does put me relatively back on. There has been a slight change in work, as Sandra is working on the LLM side of things now after discussing with the team so that I can focus in more on the classification model. As such, getting a rough pipeline working should put me basically back on track with the slightly altered schedule, as I should have the time to debug and workshop the spoter architecture.

Neeraj’s Status Report for 3/16/24

My main update is regarding the RNN architecture. As mentioned last week, a lot of this week has mainly been spent on exploring MUSE-RNN, its capabilities, and whether it can be applied to our current architecture.  I have found some current code implementing it in MatLab at https://github.com/MUSE-RNN/Share. However, a majority of this code is p-code meaning, that it is encrypted and we can only interact with the model itself. From testing it out, it does seem to be able to work when given an appropriately structured mat file as a dataset. However, I also believe that creating a script to redevelop our dataset into such is not a viable use of our time, especially when considering that MatLab might not be the best language to use for our code. As such, I have reached a few options. We can move forward with the basic RNN that we have, use the current implementation of MUSE-RNN that we have and disregard the possible negative drawbacks of MatLab as a language, or try developing/finding a new model that could also work. As of right now, I believe the best option is the first one, but I have also found another model to explore called spoter, which is a transformer that has been used in a very similar way to our use case, as we can see here: https://github.com/matyasbohacek/spoter. I am also interested in looking into this and possibly building a transformer with a similar structure since this code also works under the presumption of pose estimation inputs, meaning that this would translate cleanly into our current pipeline. On the LLM side, there has been a good amount of progress made, as I have experimented more with different ideas from various papers and articles. In particular, from this site (https://humanloop.com/blog/prompt-engineering-101), I found that example and concise, structured prompts are more effective, which I have been working with. I am planning on bringing this up during our group’s working session later, as I want to solidify our prompt to dedicate more time to everything else.

I want to finish up the decision and the classification model as soon as possible, so that is my first priority. The LLM prompt is also a priority, but I want to finish that with the rest of the team as I believe that is something that can be finished relatively quickly with the three of us working together, especially since it is something that we can quickly test once we gather a few example sentences hat we can test.

I am currently a bit behind schedule, as the RNN work is taking a bit more time than I anticipated, especially when considering the fact that there are a variety of different models that I am looking at using. However, there is a counterbalance here, because the LLM prompt generation should take far less time than we had originally anticipated on. As  a result, we can adjust the schedule to dedicate more time towards the RNN rather than the LLM prompting.

Neeraj’s Status Report for 3/9/24

For the past couple of weeks, my main focus has been completing the design report. This mainly involved writing up the design trade-offs and testing/verification sections, as well as wrapping up other sections that needed to be finished. The main point of focus for me throughout this was the justification of our choices and making sure that every choice and metric we used had a valid reason behind it. From a more technical standpoint, I have spent more time developing the RNN and looking into the possibility of using MUSE-RNN, which would be interesting to develop and effective for improving accuracy. As of right now, I have the majority of a basic GRU-based RNN developed, compiling code I could find from outside resources before working on implementing the MUSE-RNN architecture, as I want to check our current metrics before diving into further complicating the architecture. I have also been experimenting with various word-to-sentence combinations to test different LLM prompts to not fall behind on that front either. I started with a basic “translate these words [_,…,_] from ASL into an English sentence” prompt, which has a lot of variation in response, not only in the sentence itself but also in the format of the response we get from the model. Since this is a very simple prompt, I am planning on utilizing various papers and studies, such as the Japanese-to-English paper that I referenced beforehand, to further develop a more complex prompt. I have not been able to spend as much time on this as I would have liked to, so to not fall behind, I am planning to focus on finishing this relatively quickly or get help from my groupmates, as I also want to focus on the RNN development.

My plan as of right now is to try developing our RNN training cases by running our dataset through our HPE model and then training the base RNN model that we are using. I am also planning on starting the MUSE-RNN architecture separately. I also want to meet up with my groupmates at some point to focus in on developing our prompt for the LLM and see if we can finish as much of that component as fast as possible.

I believe that I am mostly on schedule, if not slightly behind due to experimenting with our model architecture. As aforementioned, I plan on compensating for this by getting help from my groupmates on the LLM portion of things, as I believe that doing some intensive research on prompt engineering should give us substantial progress, especially in terms of achieving desired formatting and answer structure. This would give me more time to focus on developing MUSE-RNN if we want to pursue that architecture.

Neeraj’s Status Report for 2/24/24

The main focus of this week was revolving around our design presentation. My main work for this week was spent revolving around preparing the slideshow at the beginning of the week, helping Kavish get ready to present, and then beginning the transition of information from the slideshow onto the final design report.

I also spent more time this week looking into the RNN and beginning development. I am looking into what libraries would work best for a real-time inference model. I decided to begin development using Tensorflow simply because it is the most predominant library used in the industry, considering it has the largest amount of optimization and thus would run the fastest, thus lowering our latency as much as possible.  As of right now, I am just developing the model, so my main goal over the next week is to spend more time focusing on finishing that so we can get to testing as much as possible. We also need to develop a training set for the RNN, so I am currently setting up the code to take our videos and transform them into HPE vector outputs so that we can use those to train our RNN. I need to figure out where I should do this process, as it might take a good amount of computing power, so I need to determine whether we can do this on the ECE machines or if it might be worth it to invest in AWS services to run these.

I believe I am slightly behind schedule, as we expect the RNN model to be finished soon. It might be worth it to push the RNN scheduling a bit further back, as the RNN model might take more time than we anticipated, partially due to workload from other classes. As such, it might be worth it to merge the time in which we are testing the RNN and developing it, as that is something that can be combined, since part of the RNN development time will include developing the LSTM model for comparison. As such, we can test the GRU model in the mean time.

For next week, I ideally want to finish the base GRU RNN model, as that would keep us as on track as possible, and in a good spot heading into spring break, in which I can work on testing and model comparison.

Neeraj’s Status Report for 2/17/24

My main goal for this week was to experiment with various human pose estimation libraries. Primarily, I was focusing on determining whether to use OpenPose or Mediapipe and which of the two would better fit our design pipeline. These libraries have had a history of running on smaller IoT devices, meaning either of which has the potential to possibly work on an FPGA like we intend to.

When installing both of these models, I was having issues with installing the OpenPose models, meaning I might need more time experimenting with them. However, I have been able to test Mediapipe’s hand detection model. It can create 20 landmarks to detect hand position and pose, as well as distinguish left and right hands. It also outputs vectors that hold locations with the position of each landmark in the image. This means that we do not necessarily have to develop a CNN-RNN fusion model to account for spatial information and instead use these vectors as inputs into an RNN to classify words. I have tested this with a few still photos, as per Mediapipe’s test documentation code, which is in our team report. Combining this with the OpenCV library, I have developed a script that takes in a live video from a camera and return output vectors containing the positions of the vectors. This script would be representative of the beginning end of our design pipeline, which we can use for testing and verifying the hardware side of our design.

On another note, I have been looking about how prompt engineering works with LLMs. More specifically, I am looking at the following paper by Masaru Yamada:

https://arxiv.org/ftp/arxiv/papers/2308/2308.01391.pdf

The paper explored the influence of integrating the purpose of the translation and the target audience into prompts on the quality of translations produced by ChatGPT. The findings suggest that including suitable prompts related to the translation purpose and target audience can yield more flexible and higher quality translations that better meet industry standards. Specifically, the prompts allowed ChatGPT to generate translations that were more culturally adapted and persuasive for marketing content as well as more intelligible translations of culture-dependent idioms. The paper also demonstrated the practical application of translation concepts like dynamic equivalence by using prompts to guide creative translations. This paper could provide good insight into translating word fragments into full sentences, as well as what prompts we could begin experimenting with so that we can capture the nuances within ASL.

I am basically on schedule. This human pose estimation code would function as a good foundation for our pre-processing model, which we can also use for hardware testing. We are also on pace for starting prompt engineering for our LLM.

Neeraj’s Status Report for 2/10/24

For this week, aside from preparing for proposals, my main goal was to do dataset and model exploration. I spent a good amount of time downloading and going through the dataset to determine what exactly we are working with. Our video inputs are about 2-to-4-second video clips of a person signing a single word. Along with these 2-second hand gestures, we also have numerous JSON files that encode a word to each video. The difference between the JSON files is the number of video-word combinations that said JSON file accesses. As a result, we not only have a few pre-defined subsets that we can access, but we also can make our own splits in case we want to train on fewer words in the case that we need to make our model smaller.

I have been looking into a couple of ideas and libraries for pre-processing. There are a lot of cases where people have used general video classification, but in a lot of those cases, those models are trained so that they capture the entirety of the image. In the case of ASL detection, we do not necessarily want to do that, as there can be an issue of recognizing signs based on extraneous data, such as the color of the background or the person signing. When looking into human pose estimation to counter this, there are two main libraries that we have been considering, being Mediapipe and OpenPose. The main thing that I was looking at was what we wanted  to use for landmarking. Mediapipe has pre-trained models for both full-body landmarks and hand orientation landmarks. The amin issue is that full-body does not give us enough hand estimation details, but the hand estimation does not give use any data outside of the hand. There have been numerous of past cases in which OpenPose has been used for human pose estimation, so I am looking into whether there is any repositories that would fit our needs. If there isn’t, we might have to create our own model, but I think it would be best to avoid this so that we do not have to create our own dataset. I have also been talking with Kavish about the feasibility of these various options on an FPGA.

For classification itself, I have been looking into what type of RNN model to use, as human pose estimation might be able to give us vectors that we can pass into the RNN model instead of images (I still need to look more into this before finalizing this possible design change). Otherwise, there are a lot of possibilities with CNN-RNN models from Keras, OpenCV, and Scipy that we can use.

As of right now, I believe that I am slightly ahead of schedule. There are a few decisions that I need to ask the team and revise, as mentioned above. After this, my main deliverable would be to find/develop a good model for our human-pose estimation pre-processing and determining whether it would be reasonable to use that for our RNN input.