Team Status Report 4/27/24

Coming into the final week, our project design consists of running MediaPipe’s HPE and classification model directly on the Jetson and connecting to OpenAI’s API for the LLM portion. A web server will be hosted locally through Node.js and display the video feed alongside any detected text. This week, we’ve focused on retraining our word classification algorithm and finalizing the integration of project components.

The main risk in the coming week before project demos would be unexpected failure of a subsystem before demo day. To combat this, we will document our system as soon as desired behavior is achieved so there is proof of success for final proof of concept. If retraining the word classification model lowers accuracy or creates failure due to the complexity of 2000 classification labels, we will demo using the smaller word classification models that have been trained earlier in the semester.

We are on track with our last schedule update and will spend the majority of the next week finishing project details.

When measuring project performance for final documentation, we used the following unit tests and overall system tests:

  • Overall and Unit Latencies were calculated by running a timer around specific project components.
    • The speed of our overall latency encouraged us to focus on accuracy and range of word classification rather than reducing latency.  As such, we’re retraining our word classification model to identify 2000 words as opposed to our current 100.
  • Recognition Rate was tested through measuring output/no output of the classification model in tandem with short video clips.
  • Word Classification Accuracy was measured by running video clips of signs through the classification model and checking the output value against the video’s correct label.
  • Inference Accuracy has been primarily gauged through human feedback when testing various LLM models’ need for reinforcement or fine tuning. A more complete dataset, predominantly informed by SigningSavvy,  will soon be used to collect model-specific accuracy metrics.

 

 

Sandra’s Status Report 4/27/2024

This week, I worked on setting up an IDE environment on the Jetson Nano for Llama 2 and Node.js. Unfortunately, in the process I found the Jetson used an aarch64 (ARM) instead of a x86_64 (Intel) CPU architecture. As a result, the necessary local  conda environment to enable Llama 2 wasn’t a feasible option. In the initial debugging phase, multiple developers who’d run into the same issue recommended I follow this article as a workaround for a conda-like environment (Archiconda). However, upon trial the Jetson experienced multiple dependency issues that didn’t have simple resolutions after attempting the first 3-4 viable suggestions (like running sudo apt update/upgrade or sudo apt-get autoremove to update/reinstall available libraries). Due to time constraints, I thought it best to follow the previously stated backup plan, working with OpenAI’s API. This would allow me to fine tune the GPT LLM simply without needing to rely on device/OS specific variability within environment and setup. 

here, you can see a few syntactically ASL sentences correctly translated to natural language English sentences.

 I was able to successfully install Node on the Jetson and create a simple local web server that displayed webcam input on a browser using the livecam library. As seen to the left, the monitor displays a live stream of the webcam connected to the Jetson. I have limited experience with front end development but will shortly be adding text and interactive features to the display.

In the final stretch of our schedule, my progress is slightly behind where I’d like it to be due to the extensive time spent on debugging and testing multiple avenues for setting up an adequate LLM without needing to pay out of pocket.

By next week, the whole project will need to be fully integrated from end to end on the Jetson and stress tested. For me, that will mean setting up the proper connection with OpenAI on the Jetson and aiding in any integration issues or debugging.

Sandra’s Status Report 4/20/2024

This week I focused on fine tuning the Llama 2 LLM to modify it’s stylistic I/O so it better fits our needs. Words are classified one at a time and output with no memory so a separate program must be responsible for recalling past words and continually interpreting possible sentences. As such, I’ve modified the generation.py file to define the “system” role as an ASL language translator. The system is told “Your objective is to act as an ASL sign language interpreter. You will be given a sequence of words directly transcribed from signed ASL speech and correctly interpret the words in their given order to create full English sentences. You will receive a list of comma separated words as an input from the user and interpret them to your best approximation of a natural language English sentences in accordance with ASL grammar and syntax rules. Words that haven’t been given and exist in ASL should not appear in the sentence. Past word inputs should continue to be interpreted into a sentence with new inputs. Parts of speech like copulas, articles, adverbs, pluralization, and tense markers should be inserted into full English sentences when appropriate.” This behavior synopsis was developed earlier in the semester while gauging  GPT’s ability to achieve our LLM needs without substantial model modification. 

Additionally, behavior is further defined in generation.py using examples from a small dataset I collected while researching ASL grammar and syntax rules. Some examples include:

last, year, me, went, Spain
I went to Spain a year ago.
house, I, quiet, enter
I enter the house quietly.
yesterday, I, go, date
Yesterday I went on a date
tomorrow, vacation, go, I
I am going on vacation tomorrow.

I would also like to use HRLF (human reinforcement learning feedback) to refine correct/incorrect sentences and test for edge cases that may challenge common SVO sentence structure.

Over the next two weeks I’ll be integrating the final iteration of our LLM translation module with the classification model’s output and  inputting text to our viewing application.

I am a bit behind schedule due to difficulties with Llama 2 environment setup  needs on Windows. The Llama 2 Git repository is only officially supported on Linux and multiple debugging efforts have led to numerous issues that seem to be related to operating system configuration. It may be relevant to note: my computer has been sporadically entering BSOD and occasionally corrupting files despite factory reset so there may likely be device-specific issues I’m running into that may not have simple/timely solutions. As a result, I felt my time would be better spent looking for alternate implementations. I have tried setting up Llama directly on my Windows computer, on WSL, and through GCP (which ended up being essentially a cloud based host that required pairing and payment to VertexAI rather than OpenAI despite offering $300 free credits for new users). I am currently trying to set up Llama on the Jetson Nano. If issues persist, I will transition the knowledge I’ve gained through troubleshooting and working with Llama 2 onto the OpenAI API LLM. 

In the next week I hope to have successfully set up and tested Llama 2 on the Jetson Nano and have finalized a Node.js local web server.

Sandra Serbu’s Status Report for 3/30/24

This week was focused on building an OpenAI API for text generation and integration with other project components. OpenAI’s text generation capabilities are extensive and offer a range of input modalities and model parameterization.  In their tutorials and documentation, OpenAI “recommend[s] first attempting to get good results with prompt engineering, prompt chaining (breaking complex tasks into multiple prompts), and function calling.” To do so, I started by asking ChatGPT to “translate” ASL given syntactically ASL sentences. ChatGPT had a lot of success doing this correctly given isolates phrases without much context or prompting. I then went to  describe my prompt as GPT will be used in the context of our project as follows:

I will give you sequential single word inputs. Your objective is to correctly interpret a sequence of signed ASL words in accordance with ASL grammar and syntax rules and construct an appropriate english sentence translation. You can expect Subject, Verb, Object word order and topic-comment sentence structure. Upon receiving a new word input, you should respond with your best approximation of a complete English sentence using the words you’ve been given so far. Can you do that?

It worked semi successfully but ChatGPT had substantial difficulty moving away from the “conversational” context it’s expected to function within.

So far, I’ve used ChatGPT-4 and some personal credits to work in the OpenAI API playground but would like to limit personal spending on LLM building and fine-tuning going forward. I would like to use credits towards gpt-3.5-turbo since fine-tuning for GPT-4 is in an experimental access program and for the scope of our project, I expect gpt-3.5-turbo to be a robust model for in terms of accurate translation results.

from openai import OpenAI
client = OpenAI()

behavior = "You are an assistant that receives word by word inputs and interprets them in accordance with ASL grammatical syntax. Once a complete sentence can be formed, that sentence should be sent to the user as a response. If more than 10 seconds have passed since the last word, you should send a response to the user with the current sentence."
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": behavior},
{"role": "user", "content": "last"},
{"role": "user", "content": "year"},
{"role": "user", "content": "me"},
{"role": "user", "content": "went"},
{"role": "user", "content": "Spain"},
{"role": "assistant", "content": "I went to Spain a year ago."},
{"role": "user", "content": "where"}
])

The system message helps set the behavior of the assistant.

eventually, we would like to use an Openai text generation model such that we provide word inputs as a stream and only receive text in complete English sentences.

SVO (Subject, Verb, Object) is the most common sentence word order in ASL and Object, Subject, Verb (OSV) necessarily uses non-manual features (facial expression) to introduce the object of a sentence as the topic. As a result, we will restrict the scope of our current LLM to successfully interpreting SVO sign order.

Sandra’s Status Report for 2/24/24

My progress is a bit behind schedule and I plan to take steps over the upcoming break to catch up to the project schedule.

In the next week, I hope to develop a meshed hand/facial expression estimation model and compare it to MediaPipe’s holistic feature recognition model.

Sandra’s Status Report for 2/10/24

This week was focused on clearly establishing the boundaries of our project and presenting our proposal during 18-500 class time.  As such, this week was primarily focused on research and preparing for implementation. Seeing our peers’ project proposals was a useful way to reflect on the strong and weak points of our own project planning. I’m most exicted to focus on the integration and aplication of our various project pieces.  Ultimately, we need to connect out hardware and ML algorithms to a functional display that can be easily navigated by a user. To broden the scope of what our project can be applied to, I’ve spent significant time looking into the possible input and output connection points we can easily integrate early on to ensure we’ll be able to smoothly connect our various pieces together when finishing our project.

Moreover, my teammates have a deeper understanding of machine learning in practice than I do so I’ve also spent time time this week continuing to familliarize myself with the process of training and optimizing an LLM and CV neural network. Since training any ML algorithm is a substantial task, I’ve focused on learning more of the the grammar rules ASL uses to begin the process of selecting hyperparameters or otherwise tuning our algorithms to best meet our needs.

Currently, my progress is on schedule as we’re laying the groundwork to ramp up training and processing our datasets, as outlined in our Gantt Chart below.

In the next week, I hope to continue researching the aformentioned topics and collaborate with my teammates to establish the division of labor for the week. Developing a human-pose estimation pre-processing model will be the software end’s next main step and establishing a clear roadmap for achieving our project objectives will be necessary.