Neeraj’s Status Report 4/27/24

Over this last week, I have mainly been focusing on changing the preprocessing of the model and retrain the model and avoid unnecessary details in our input images with Kavish. As of right now, we both have been tackling different fronts of the classification model and improving its overall metrics before we finish up integration by porting it onto the jetson. If anything else, I have also been messing with the idea of integrating a quick expression classifier that could aid with the LLM. Although a separate network, our current metrics show that we have the latency room to possibly add this in. It is also an idea that we have bene toying around with for a while as a group that could help in better translating ASL semantics, such as identifying if a sentence was a question or not. I intend on either finding a pretrained model or a quick dataset to train on so hat it is not a large investment, but it is also just an extra jump that we are considering making while our classification model is training and if we have the time.

As such, my main priority as of right now is to finish retraining the classification model, as that is one of the final things we need to finish up for demo day. If there is a lot of time remaining and there is nothing else for us to do, then we can look into the expression algorithm, but as of right now, I am not really considering it due to our time constrictions.

I would say I am around on schedule. I have finished integrating my part of the code with Kavish’s work, so he only thing left to do is retraining the model on my end. I also intend on helping out with further integration with the rest of the group as well. If anything, I am slightly behind due to the fact that I hoped to be done with the model by now, but testing and debugging during integration took longer than expected. Getting the model retrained and helping out the rest of the group with integration should help get us back on schedule.

Neeraj’s Status Report for 4/20/24

Over the past week, my main progress has been mainly retraining the classification model and integration of the model into the pipeline, namely with the HPE model.  Since the last classification model was inaccurate, I decided to reprocess all of our WLASL training data and use that to retrain the model. I also did some small testing on a few hyperparameters, namely the learning rate and epoch counts. As such, I got a model with about 74%-75% validation accuracy. From there I focused on developing an inference script based on the spoter evaluation code, which we could use to input data into the model, receive a softmax output, and find the associated label with that. Since then, I have mainly been working with Kavish on integrating the HPE model and the classification model, determining the necessary data transformations we need to go through to translate the output of the HPE model into the input of the classification model. We also have been running latency tests for our components, looking at inference times and frame rates to make sure that we are staying within our design specification’s latencies.

As of right now, I feel like we are a little bit behind schedule, simply because we ran into a couple of unexpected issues during integration. But our main idea for combatting this is simply spending more time working it as a team to figure it out together in a more efficient fashion.

For the next week, my main focus is going to be any further testing and integration we might need alongside continuing final presentation work.

Throughout this project, I feel like one of the greatest sources of knowledge that I found was previous research papers. Especially when looking into video classification and how various models and architectures worked, reading through papers, understanding the purpose the research, and the detail within the papers themselves were a valuable medium. For example, reading between the MUSE-RNN paper, the spoter paper, the spoter-embeddings paper, and other papers that were related to WLASL were really useful to figure out how to tackle this problem in a real time setting. More specifically, they helped me learn about intricacies and relationships between nodes within neural network architectures, whether that be in RNNs or transformers. This even extends to online articles and videos discussing more base level topics that I could learn from to better understand these papers, such as Medium articles describing how various architectures worked was useful for understanding important concepts.

Neeraj’s Status Report for 4/6/2024

As of right now, we have a lot of individual components that are working, namely the HPE model and the classification transformer. We can take in a video stream and transform that into skeletal data, then take that skeletal data and process it for the model, and then run the model to train and predict words. However, each of these components is separate and needs to be joined together. As such, my current work is pretty set: I need to focus on developing code that can integrate these. While most processes can be explicitly done, some components, like model prediction, are buried within the depths of the spoter model, so while I can make predictions based on skeletal data after training, I need to pull out and formulate the prediction aspect into its code that can be used for live/real-time translation. Therefore, my main goal for the next week or so is to develop this integration code. Afterward, I intend to focus on optimization and improving the classification model further. This would involve retraining with new parameters, new architecture, or different dataset sizes, as that would allow us to better approach our benchmark for translation.

For the components that I have been working on, I have a few verification tests to make sure that the HPE to spoter pipeline works as intended. The first test is simply validation accuracy on the classification model, which I have already been running, to make sure that the overall accuracy we get for the validation set is high enough, as this would signify that our model is well enough generalized to be able to translate words to a usable accuracy. I would also be latency testing the inference time of the classification model to make sure that our inference time is within the necessary unit latency we set out in our design requirements. As for when the HPE and classification model are integrated, I plan to run another accuracy test on unseen data, whether that be from a video dataset or a created dataset of our own, to test generalized accuracy.

As of right now, I am mostly on schedule. I anticipate that integration will be difficult, but that is also a task that I will not be working on alone since the entire team should be working on integration together. As such, I hope that working with the team on this will speed up the progress. Further debugging and optimization should be something that I run in the background and analyze the results later to determine whether we should further optimize our model.

Neeraj’s Status Report for 3/30/24

My main progress over the past week has just been preparing for interim demos. As such, I have put a temporary pause on any significant innovation, mainly concerning modifying the architecture of either the HPE model or the spoter transformer. As such, I have been putting more time into getting the necessary data after HPE processing and training the spoter model on the data we want to train on. The only thing that I have been coding up is the eval function for the spoter model, which should hold public functions that can run inferences on a trained model. Since the spoter model was a model generated for architecture research, the main goal was to evaluate the accuracy and loss metrics of the model. As such, I needed to save the model post-training and utilize the given architecture to create our eval functions for inference. I am currently working on training the model, so I am planning to have the working model and eval function for interim demos if all goes well.

My main plan for next week is to continue to iron out the interim demo material until we present, and then after that, focus on debugging, retraining, and architectural changes to the model to improve performance. This mainly involves collecting validation data and modifying hyperparameters to maximize that and looking into further research to see what improvements I can make to the model that would not result in a large latency burden.

As of right now, I am just slightly behind schedule. If I can finish up what we need for the interim demo in time, then I should be relatively back on schedule, as I will have the time to test, debug, and configure our pipeline. In the case that I do not have my stuff finished in time, I would be behind. In this case, I would most likely ask for help from my teammates in order to help in debugging code that I need to or get help in migration and integration later on.

Team’s Status Report for 3/30/24

Our main risks echo those of our previous reports, namely models not working/accuracy, inability to use the FPGA for HPE, etc., which we have provided mitigation strategies for in previous reports.  However, the main risk that we are considering right now, especially when considering the interim demo and our final product is integration between our components. As we are working to get our components up and running, we will soon begin integration, which could raise some issues in getting stuff working together. Our main mitigation strategy for this is focusing on integration early before improving the individual components. As such, we can have a working pipeline first before we focus on developing, debugging, and improving the individual components of our model further.

At this point, we have not made any large design changes to the system, as we have not found a need too. Similarly, we have not changed our schedule from the update we made last week.

Neeraj’s Status Report for 3/23/24

The main progress that I have made is looking into the spoter algorithm I found at the end of last week and understanding how it works and how to apply it. It utilizes pose estimation as input, so it fits in well with our current pipeline. However, we would most likely have to retrain the model based on our pose estimation inputs, which I plan to do later this week. There is also the question of integration as well, as I am not entirely sure how this model works for post-training for inference time, which I am also looking to figure out. I have also been looking into our human pose estimation algorithm as of right now. As mentioned in earlier posts, we were looking at adding a more holistic landmarking by including facial features in our HPE algorithm. As such, I have been looking into Mediapipe’s holistic landmarking model. While Mediapipe said that they are planning on updating their holistic model relatively soon, I decided to look at their older GitHub repo and determine what I can do from there. It looks like we should be able to implement it similarly to our hand pose estimation code. We can reduce the number of landmarks that our classification model works off of as well, as we don’t need any info on the lower half of the body.

As such, my plan for the next week is to focus on getting a rough pipeline from HPE to classification. There will be a need to debug, as there always is, but given that we have an HPE model and a classification model with spoter, I am going to be focusing on integrating the two so that they can be later assimilated into the other parts of the pipeline.

I am still slightly behind on my work, as bouncing between various models has resulted in a bit of a setback, even though finding spoter does put me relatively back on. There has been a slight change in work, as Sandra is working on the LLM side of things now after discussing with the team so that I can focus in more on the classification model. As such, getting a rough pipeline working should put me basically back on track with the slightly altered schedule, as I should have the time to debug and workshop the spoter architecture.

Neeraj’s Status Report for 3/16/24

My main update is regarding the RNN architecture. As mentioned last week, a lot of this week has mainly been spent on exploring MUSE-RNN, its capabilities, and whether it can be applied to our current architecture.  I have found some current code implementing it in MatLab at https://github.com/MUSE-RNN/Share. However, a majority of this code is p-code meaning, that it is encrypted and we can only interact with the model itself. From testing it out, it does seem to be able to work when given an appropriately structured mat file as a dataset. However, I also believe that creating a script to redevelop our dataset into such is not a viable use of our time, especially when considering that MatLab might not be the best language to use for our code. As such, I have reached a few options. We can move forward with the basic RNN that we have, use the current implementation of MUSE-RNN that we have and disregard the possible negative drawbacks of MatLab as a language, or try developing/finding a new model that could also work. As of right now, I believe the best option is the first one, but I have also found another model to explore called spoter, which is a transformer that has been used in a very similar way to our use case, as we can see here: https://github.com/matyasbohacek/spoter. I am also interested in looking into this and possibly building a transformer with a similar structure since this code also works under the presumption of pose estimation inputs, meaning that this would translate cleanly into our current pipeline. On the LLM side, there has been a good amount of progress made, as I have experimented more with different ideas from various papers and articles. In particular, from this site (https://humanloop.com/blog/prompt-engineering-101), I found that example and concise, structured prompts are more effective, which I have been working with. I am planning on bringing this up during our group’s working session later, as I want to solidify our prompt to dedicate more time to everything else.

I want to finish up the decision and the classification model as soon as possible, so that is my first priority. The LLM prompt is also a priority, but I want to finish that with the rest of the team as I believe that is something that can be finished relatively quickly with the three of us working together, especially since it is something that we can quickly test once we gather a few example sentences hat we can test.

I am currently a bit behind schedule, as the RNN work is taking a bit more time than I anticipated, especially when considering the fact that there are a variety of different models that I am looking at using. However, there is a counterbalance here, because the LLM prompt generation should take far less time than we had originally anticipated on. As  a result, we can adjust the schedule to dedicate more time towards the RNN rather than the LLM prompting.

Neeraj’s Status Report for 3/9/24

For the past couple of weeks, my main focus has been completing the design report. This mainly involved writing up the design trade-offs and testing/verification sections, as well as wrapping up other sections that needed to be finished. The main point of focus for me throughout this was the justification of our choices and making sure that every choice and metric we used had a valid reason behind it. From a more technical standpoint, I have spent more time developing the RNN and looking into the possibility of using MUSE-RNN, which would be interesting to develop and effective for improving accuracy. As of right now, I have the majority of a basic GRU-based RNN developed, compiling code I could find from outside resources before working on implementing the MUSE-RNN architecture, as I want to check our current metrics before diving into further complicating the architecture. I have also been experimenting with various word-to-sentence combinations to test different LLM prompts to not fall behind on that front either. I started with a basic “translate these words [_,…,_] from ASL into an English sentence” prompt, which has a lot of variation in response, not only in the sentence itself but also in the format of the response we get from the model. Since this is a very simple prompt, I am planning on utilizing various papers and studies, such as the Japanese-to-English paper that I referenced beforehand, to further develop a more complex prompt. I have not been able to spend as much time on this as I would have liked to, so to not fall behind, I am planning to focus on finishing this relatively quickly or get help from my groupmates, as I also want to focus on the RNN development.

My plan as of right now is to try developing our RNN training cases by running our dataset through our HPE model and then training the base RNN model that we are using. I am also planning on starting the MUSE-RNN architecture separately. I also want to meet up with my groupmates at some point to focus in on developing our prompt for the LLM and see if we can finish as much of that component as fast as possible.

I believe that I am mostly on schedule, if not slightly behind due to experimenting with our model architecture. As aforementioned, I plan on compensating for this by getting help from my groupmates on the LLM portion of things, as I believe that doing some intensive research on prompt engineering should give us substantial progress, especially in terms of achieving desired formatting and answer structure. This would give me more time to focus on developing MUSE-RNN if we want to pursue that architecture.

Team’s Status Report for 3/9/24

As of right now, our largest risk remains to be working HPE on the FPGA, but as mentioned before, we are currently in the process of developing that and we have a backup plan of running HPE on the Jetson if necessary. The other major risk as of right now is our RNN model and whether a standard GRU architecture is enough. While we do believe it should be, we have planned for this in being able to switch out GRU cells with LSTM cells, using new architectures such as MUSE-RNN, and considering the possibility of moving to transformers if necessary.

As of right now, there has been no significant change in the design plans of our project. We have options to pivot in the case of a competent failure or lack of performance, as mentioned above, but we have not yet had to commit to one of these pivots.

With consideration of global factors, our biggest impact will be on the deaf and hard of hearing community – particularly the ones who use the American Sign Language (ASL) and are reliant on the digital video communication platforms like zoom. For all those individuals, it is quite a challenge to participate in online platforms like zoom and have to rely on either chat features or a translator if other people in the meeting are not familiar with ASL. Our project helps those individuals globally by removing that reliance and bringing accessibility in communication. This will bring a greater, more effective form of communication to hundreds of thousands of ASL users across the globe.

Our product solution has substantial implications for the way the hearing world can engage with ASL users in virtual spaces. Culturally, hearing impaired communities tend to be forgotten about and separate from the hearing world in America. Influence in public spaces, like concerts,  and legislation has been hard to come by and accommodations tend to be fought for rather than freely given. Part of this issue stems from hearing communities not having personal relationships with or awareness of the hearing impaired. Every step towards cultivating spaces where deaf individuals can comfortably participate in any and all discussions is a step towards hearing impaired communities having a louder voice in the cultural factors of the communities they identify with. An ASL to text feature in video communication is a small step towards meeting the needs of hearing impaired communities, but opens the doorway to substantial progress. By creating avenues of direct interpersonal communication (without the need for a dedicated translator in every spoken space), the potential for an array of meaningful relationships in professional, academic, and recitational spaces opens up. 

While an ASL translation system for digital environments aims to improve accessibility and inclusion for the deaf and hard of hearing community, it does not directly address core environmental needs or sustainability factors. The computational requirements for running machine learning models can be energy-intensive, so our system design will prioritize efficient neural architectures and lean cloud/edge deployment to minimize environmental impact from an energy usage perspective. Additionally, by increasing accessibility to online information, this technology could potentially help promote environmental awareness and education within the Deaf community as an indirect bonus. However, as a communication tool operating in the digital realm, this ASL translator does not inherently solve specific environmental issues and its impact is confined to the human-centric domains of accessibility. While environmental sustainability is not a primary focus, we still aim for an efficient system that has a minimal environmental footprint.

Section breakdown:
  • A was written by Kavish
  • B was written by Sandra
  • C was written by Neeraj

Neeraj’s Status Report for 2/24/24

The main focus of this week was revolving around our design presentation. My main work for this week was spent revolving around preparing the slideshow at the beginning of the week, helping Kavish get ready to present, and then beginning the transition of information from the slideshow onto the final design report.

I also spent more time this week looking into the RNN and beginning development. I am looking into what libraries would work best for a real-time inference model. I decided to begin development using Tensorflow simply because it is the most predominant library used in the industry, considering it has the largest amount of optimization and thus would run the fastest, thus lowering our latency as much as possible.  As of right now, I am just developing the model, so my main goal over the next week is to spend more time focusing on finishing that so we can get to testing as much as possible. We also need to develop a training set for the RNN, so I am currently setting up the code to take our videos and transform them into HPE vector outputs so that we can use those to train our RNN. I need to figure out where I should do this process, as it might take a good amount of computing power, so I need to determine whether we can do this on the ECE machines or if it might be worth it to invest in AWS services to run these.

I believe I am slightly behind schedule, as we expect the RNN model to be finished soon. It might be worth it to push the RNN scheduling a bit further back, as the RNN model might take more time than we anticipated, partially due to workload from other classes. As such, it might be worth it to merge the time in which we are testing the RNN and developing it, as that is something that can be combined, since part of the RNN development time will include developing the LSTM model for comparison. As such, we can test the GRU model in the mean time.

For next week, I ideally want to finish the base GRU RNN model, as that would keep us as on track as possible, and in a good spot heading into spring break, in which I can work on testing and model comparison.

Neeraj’s Status Report for 2/17/24

My main goal for this week was to experiment with various human pose estimation libraries. Primarily, I was focusing on determining whether to use OpenPose or Mediapipe and which of the two would better fit our design pipeline. These libraries have had a history of running on smaller IoT devices, meaning either of which has the potential to possibly work on an FPGA like we intend to.

When installing both of these models, I was having issues with installing the OpenPose models, meaning I might need more time experimenting with them. However, I have been able to test Mediapipe’s hand detection model. It can create 20 landmarks to detect hand position and pose, as well as distinguish left and right hands. It also outputs vectors that hold locations with the position of each landmark in the image. This means that we do not necessarily have to develop a CNN-RNN fusion model to account for spatial information and instead use these vectors as inputs into an RNN to classify words. I have tested this with a few still photos, as per Mediapipe’s test documentation code, which is in our team report. Combining this with the OpenCV library, I have developed a script that takes in a live video from a camera and return output vectors containing the positions of the vectors. This script would be representative of the beginning end of our design pipeline, which we can use for testing and verifying the hardware side of our design.

On another note, I have been looking about how prompt engineering works with LLMs. More specifically, I am looking at the following paper by Masaru Yamada:

https://arxiv.org/ftp/arxiv/papers/2308/2308.01391.pdf

The paper explored the influence of integrating the purpose of the translation and the target audience into prompts on the quality of translations produced by ChatGPT. The findings suggest that including suitable prompts related to the translation purpose and target audience can yield more flexible and higher quality translations that better meet industry standards. Specifically, the prompts allowed ChatGPT to generate translations that were more culturally adapted and persuasive for marketing content as well as more intelligible translations of culture-dependent idioms. The paper also demonstrated the practical application of translation concepts like dynamic equivalence by using prompts to guide creative translations. This paper could provide good insight into translating word fragments into full sentences, as well as what prompts we could begin experimenting with so that we can capture the nuances within ASL.

I am basically on schedule. This human pose estimation code would function as a good foundation for our pre-processing model, which we can also use for hardware testing. We are also on pace for starting prompt engineering for our LLM.

Team’s Status Report for 2/17/24

Our largest major risk as of right now is determining how much we can do on an FPGA. More specifically, we are trying to experiment on whether we can do human pose estimation using an FPGA, as we believe that it could make a strong impact in boosting our performance metrics.  We do have a contingency plan of doing HPE and our RNN on an Nvidia Jetson if we are unable to get this working though. That said, to best manage this risk, we are trying to frontload as much of that experimentation as possible so that if we do have to pivot, we can pivot early and stay relatively on schedule. Our other risks our echoes of the ones of last week. We have looked into the latency issues, mainly concerning the bottleneck of human pose estimation. Based on research into past projects using AMD’s Kris KV260, we do believe that we should be able to manage our latency so it isn’t a bottleneck.

As per last week, our major change is related to our risk above, in that, we want to try and split the HPE model and our RNN so that the HPE model is on the FPGA and the RNN is on the Jetson. After experimentation, we can find that we can just pull the vector data of each landmark from Mediapipe’s hand pose estimation model, allowing us to pass that into an RNN rather than an entire photo. The main reason we made this pivot is that we believe that we can have a smaller model on the Jetson in comparison to the CNN-RNN design we originally planned. This gives a bit more room to scale this model to adjust for per-word accuracy and inference speed. Luckily, as we have not begun extensive work on developing the RNN model, there is not a large cost of change here either. In the case that we are not able to have the HPE model on the FPGA, we would have to be a bit more restrictive on the RNN model size to manage space, but we would still most likely keep the HPE-to-RNN pipeline.

There has not yet been any update to our schedule as of yet, as a lot of our pivots does not have a large impact on our scheduling.

Below, we have some images that we used for testing the hand pose estimation model that Mediapipe offers:

 

Impact Statement

While our real-time ASL translation technology is not focused specifically on physical health applications, it has the potential to profoundly impact mental health and accessibility for the deaf and hard of hearing community. By enabling effective communication in popular digital environments like video calls, live-streams, and conferences, our system promotes fuller social inclusion and participation for signing individuals. Lack of accessibility in these everyday virtual spaces can lead to marginalization and isolation. Our project aims to break down these barriers. Widespread adoption of our project could reduce deaf social isolation and loneliness stemming from communication obstacles. Enhanced access to information and public events also leads to stronger community ties, a sense of empowerment, and overall improved wellbeing and quality of life. Mainstreaming ASL translation signifies a societal commitment to equal opportunity – that all individuals can engage fully in the digital world regardless of ability or disability. The mental health benefits of digital inclusion and accessibility are multifaceted, from reducing anxiety and depression to fostering a sense of identity and self-worth. [Part A written by Neeraj]

While many Americans use ASL as their primary language,  there are still many social barriers preventing deaf and hard of hearing (HoH) individuals from being able to fully engage in a variety of environments without a hearing translator. Our product seeks to break down this communication barrier in virtual spaces, allowing deaf/HoH individuals to fully participate in virtual spaces with non-ASL speakers. We’ve seen a drastic increase in accessibility to physical meeting spaces due to the pandemic making platforms like Zoom and Google Meet nearly ubiquitous in many social and professional environments. As a result, those who are immunocomprimized or have difficulties getting to a physical location  now have access to collaborative meetings in ways that weren’t previously available. However, this accessibility hasn’t extended to allow ASL speakers to connect with those who dont know ASL. Since these platforms already offer audio and visual data streams, integration of automatic closed captioning has already been done on some of them, like Zoom. The development of models like ours would allow these platforms to further their range of users.

This project’s scope is limited to ASL to prevent unecessary complexity since there are over 100 different sign languages globally. As a result, the social imapact of our  product would only extend to places where ASL is commonly used. ASL is commonly used in the United States, Philippines, Puerto Rico, Dominican Republic, Canada, Mexico, much of West Africa and parts of Southeast Asia. [Part B written by Sandra]

 

Although our product does not have direct implications on economic factors, it can affect the economic factors in many indirect ways. The first method is by improving the working efficiency of a large subgroup of the population. Currently, the deaf and hard-of-hearing community have a lot of hurdles when working on any type of project because the modern work environment is highly reliant on video conferencing. By adopting our product, they will become more independent and thus will be able to work more effectively, which will in-turn improve the output received by their projects. Overall, by improving the efficiency of a large subgroup of the population, we would effectively be improving the production of many projects and thus boosting the economy. Additionally, our entire project is built using open source work and anything we build ourselves will also be open source. By adding to the open source community, we are indirectly boosting the economy because we are making further developments in this area much easier. We are only targeting ASL and there are many other sign languages which will eventually have to be added to the model. Thus, further expansion on this idea becomes economically friendly (due to reduction in production time and cost savings) and leads to more innovation. [Part C written by Kavish]

Neeraj’s Status Report for 2/10/24

For this week, aside from preparing for proposals, my main goal was to do dataset and model exploration. I spent a good amount of time downloading and going through the dataset to determine what exactly we are working with. Our video inputs are about 2-to-4-second video clips of a person signing a single word. Along with these 2-second hand gestures, we also have numerous JSON files that encode a word to each video. The difference between the JSON files is the number of video-word combinations that said JSON file accesses. As a result, we not only have a few pre-defined subsets that we can access, but we also can make our own splits in case we want to train on fewer words in the case that we need to make our model smaller.

I have been looking into a couple of ideas and libraries for pre-processing. There are a lot of cases where people have used general video classification, but in a lot of those cases, those models are trained so that they capture the entirety of the image. In the case of ASL detection, we do not necessarily want to do that, as there can be an issue of recognizing signs based on extraneous data, such as the color of the background or the person signing. When looking into human pose estimation to counter this, there are two main libraries that we have been considering, being Mediapipe and OpenPose. The main thing that I was looking at was what we wanted  to use for landmarking. Mediapipe has pre-trained models for both full-body landmarks and hand orientation landmarks. The amin issue is that full-body does not give us enough hand estimation details, but the hand estimation does not give use any data outside of the hand. There have been numerous of past cases in which OpenPose has been used for human pose estimation, so I am looking into whether there is any repositories that would fit our needs. If there isn’t, we might have to create our own model, but I think it would be best to avoid this so that we do not have to create our own dataset. I have also been talking with Kavish about the feasibility of these various options on an FPGA.

For classification itself, I have been looking into what type of RNN model to use, as human pose estimation might be able to give us vectors that we can pass into the RNN model instead of images (I still need to look more into this before finalizing this possible design change). Otherwise, there are a lot of possibilities with CNN-RNN models from Keras, OpenCV, and Scipy that we can use.

As of right now, I believe that I am slightly ahead of schedule. There are a few decisions that I need to ask the team and revise, as mentioned above. After this, my main deliverable would be to find/develop a good model for our human-pose estimation pre-processing and determining whether it would be reasonable to use that for our RNN input.