Brian’s Status Report for 12/4

For the final two weeks of the semester I scheduled most of my time for refining the feel of using our product as well as preparing for our final presentation and public demo.

For the refining portion, I added a minor smoothing to the movement of the mouse by averaging the location of the 6 landmarks surrounding the palm of the user’s hand.

Following that, the location of the hand is averaged across 3 frames of calculation to act as a low pass filter on the noise generated by our CV system.

Further, in order to circumvent an issue with the product where it could not differentiate between a click and a click and hold, I segmented each action into its own gesture. Click and drag/holding is now the closing of the hand into a fist, while a simple click is touching your index finger to your thumb, like an ASL number 9.

 

Finally, in preparing for our final demo and presentation I spent a few hours running experiments involving different model architectures and data augmentation to provide reasonable insight into tradeoffs and design decision outcomes.

Brian Lane’s Status Update 11/20

This week was spent with further improvements to the model. Specifically, I spent this week performing some data augmentation in order to improve overall validation accuracy.

The model as it exists currently outputs high confidence intervals for a couple gestures, namely open hand and a fist, when they directly face the camera. For gestures involving a specific number of fingers, the model is less certain and only predicts correctly when the hand is directly facing the camera.

I am still uncertain how to go about improving the accuracy in classifying more complex gestures, but the problem of hand orientation in relation to the camera has a strait forward solution in data augmentation.

Using math similar to that in my former blog post where I explained a 2D rotation, I spent the week creating a script to perform random rotations of the training data about the X, Y, and Z axes instead of just the initial Z axis rotations.

This data augmentation involved projecting the training data points into 3D space at Z=0 and selecting an angle from 0 to 2pi for the Z axis rotation and -pi/4 to pi/4 for the Y and X rotations. After this, the rotations were each done sequentially and the result was project back onto the XY plane by simply removing the Z component.

Next week is shortened for Thanksgiving break, so my ability to work on this project will be reduced. Even so, next week I plan to add two more gesture classes to the OS interface allowing right clicking and scrolling. Time permitting I will also begin collecting tradeoff data for our final report.

Links:

Wikipedia article on 3D rotation matrices
https://en.wikipedia.org/wiki/Rotation_matrix#In_three_dimensions

Brian Lane’s Status Update for 11/13

We had interim demos this week. My group demoed the functional portions of our project to course TAs on Monday and Professors on Wednesday, are received positive feedback, as well as suggestions for UI to add and potential training and dataset augmentations.

Personally, I spent this week improving the gesture recognition model, adding robustness to the predictions made by the model by introducing random rotations to the training data. This is done by initializing a random angle theta and constructing a 2×2 rotation matrix

[ cos theta, –sin theta ]
[ sin theta,    cos theta]

That is then multiplied by the 2×21 matrix containing the landmark coordinates. These transforms resulted in much better accuracy recognizing the click gesture when user hands were not directly vertical.

I will spend next week further refining the model. Our current implementation is very accurate classifying open hands and closed fists, but is still struggling some with hand representations of number and the other and signs.

 

 

Brian Lane’s Status Report 11/6

I spent this week experimenting with various model architectures and training them with our formatted data.

Each model has 42 input features, representing an x and y coordinate for each of the 21 landmarks, to which it assigns one of 26 possible labels, each corresponding to one class or variant of gesture.

In a paper titled “Gesture Recognition Based on 3D Human Pose
Estimation and Body Part Segmentation for RGB
Data Input”
 various architectures for a problem similar to ours were tested, with much success coming from architectures structured as stacks of ‘hourglasses.’ An ‘hourglass’ is a series of fully connected linear layers that decrease in width and then increase out again, with the heuristic behind this being that the reduction in nodes, and thus the compression of information regarding a gesture, would reveal features of the model that would then be expanded again to be repeated.

Using this stacked hourglass architecture I experimented with architectures, including ‘skip’ connections that would add the inputs from compression layers (where output features are lower than input features) to decompression layers to allow information that was lost in compression to be available if needed.

I also experimented with the number of hourglasses that were stacked, the length of each hour glass, and the factor of compression between layers. Along with this, multiple loss and optimization functions were considered, with the most accurate being cross-entropy loss and Adagrad.

The most effective model found thus uses only two hourglasses of the structure pictured above, achieving a validation accuracy of 86% after being trained over 1000 epochs. The training loss over 350 epochs is pictured below.

I will spend the rest of this weekend preparing for the interim demo for next week, and next week will be spent presenting said demo. Along side this further model experimentation and training will be performed.

Brian Lane’s Status Report 10/30

I spent this week getting prepared for beginning the training of our gesture recognition model next week. For these preparations I needed to apply some data transformations to the HANDS dataset that we are using, which contains a couple hundred images of various hand gestures.

The dataset supplies images of 5 subject, 4 male and 1 female, in various positions and lighting conditions, as well as annotations of these images containing bounding boxes for each gesture being performed in the image. These annotations were stored in massive text files in csv format, with a default value of [0 0 0 0] for a bounding box if the gesture did not exist in the image. For example:

image_name,left_fist,right_fist,left_one,right_one,…
./001_color.png,[0 0 0 0],[0 0 0 0],[143 76 50 50],[259 76, 50, 50],

The above lines would represent a subject holding out the number one on both hands within the specified areas of the image.

This format for the data cannot be used to train our model, as we are now using the hand landmark coordinates as the features with which we will train/inference. I spent some time writing a script that would take in the annotations, find the hand within the image, and apply the hand pose estimation implemented by Andrew.

For example, we would start with the image below.

Then the bounding box would be used to crop both the gestures to handle separately, and then the pose estimation would be applied to find the coordinates of the hand landmarks.

The palm landmark would then be considered the origin (coordinates 0, 0) and all other points in the image would have their location expressed relative to the origin and saved in a new csv file with their corresponding label.

Brian Lane’s Weekly Status Report for 10/23

I spent this week setting up some preliminary pytorch scripts for training our model and adapting the pre-built model we have slated to use. Because of a shift in our planning, I now need to assist Andrew in the creation of our pose estimation software, because the hand pose data will now be used when training our model, rather than raw image data.

Following this quick pivot, I will spend next week creating and running a script to apply the hand pose estimation to our data set while training our pre-built model in preparation for our upcoming interim demo.

Brian Lane’s Status Update 10/11

I spent the last week drafting, refining, and presenting our final design presentation, as well as preparing to give said presentation before the class.

Further, I did some more research into gesture recognition and found that many more studies used a pose estimation algorithm to identify hand landmarks and run the estimated pose through their model for gesture detection. This was in contrast to my initial approach, which was to feed in camera data directly. Presented with this new paradigm and upon further consideration this makes sense, as this method eliminates much of the noise of background colors and images and reduces the number of features the model will need to learn.

This week the adaptation of the pre-trained model will begin, as well as work on the final design documents/report. We are still waiting on AWS access to GPUs, which will aid in the speedy training of our model.

Brian Lane’s Status Report for 10/2

I spent this week looking into potential model architectures and machine learning paradigms for use in our project. In this research I found that one of the best pre-built models included in pytorch is Microsoft’s ResNet, a deep residual convolutional neural network. I’ve decided to train this model as an intermediary to be used in the development of our project while our custom model is trained, tested, and tweaked.

I looked into transfer learning as suggested by Professor Tamal and have found the concept interesting enough to warrant further investigation as to how it would aid in the training of our model.

Further, time I spent some time creating slides for the design presentation and practicing my presentation.

Brian Lane’s Status Report for 9/25

I spent this week researching model designs for gesture recognition as well as further research into datasets, though none were found that would improve upon the data provided by the dataset referenced in our project proposal: https://data.mendeley.com/datasets/ndrczc35bt/1

For model architecture it seems the most common and effective method would be a deep convolutional neural network (Deep CNN),  as convolutional networks are incredibly effective at classification within image data science.

This places me well on schedule, as this week and next are slated for model design and I have set myself up for some experimentation with model hyperparameters this week.

Further, I spent time watching other teams’ design proposals and providing criticism and feedback.

 

Brian Lane’s Status Report for 9/20

Last week I further refined requirements and use cases for the project through meetings with team members and course staff.

My initial idea of augmented reality document handling was pivoted to be more inline with groupmates interests into a system to interact with a windows desktop environment through user hand movement and gestures.

Further, I did cursory research into potential sensors to accomplish this goal including IMUs, infrared or ultrasonic sensors, and computer vision. When computer vision became the most promising option I began research into hand gesture datasets. This research was then added to my team’s project proposal.

This week I will begin setting up a development environment for initial experiments, as well as start designing the gesture recognition model and adapting the aforementioned gesture dataset for our project.