Alan’s Status Report for 11/6

This week, I continued to develop the mouse movement module and worked on a calibration module to prepare for the Interim Demo. The mouse movement module was updated to allow for different tracking sensitivity for users at different distances from our input camera. Additionally, now if the computer vision fails to detect the hand at any time, the mouse stays in place and will continue movement from the same location instead of jumping around once the hand is detected in a different location. This update video shows the new mouse movement at a larger distance. As seen in the video, even from across my room which is around 8-10 feet from the camera, the mouse movement is still able to precisely navigate over the minimize, resize, and exit buttons in VSCode. The cursor also stays in place whenever the hand is not detected and continues moving relative to the new location of hand detection.

Even with the poor webcam quality that misses some hand detection, the motion of the cursor is still somewhat smooth and does not jump around in an unwanted manner.

For the demo, I will also have a calibration module ready that will automatically adjust the sensitivity for users based on their maximum range of motion within the camera’s field of view. Currently, I am on schedule and should be ready to show everything that we have planned to show for the Interim Demo.

Brian Lane’s Status Report 11/6

I spent this week experimenting with various model architectures and training them with our formatted data.

Each model has 42 input features, representing an x and y coordinate for each of the 21 landmarks, to which it assigns one of 26 possible labels, each corresponding to one class or variant of gesture.

In a paper titled “Gesture Recognition Based on 3D Human Pose
Estimation and Body Part Segmentation for RGB
Data Input”
 various architectures for a problem similar to ours were tested, with much success coming from architectures structured as stacks of ‘hourglasses.’ An ‘hourglass’ is a series of fully connected linear layers that decrease in width and then increase out again, with the heuristic behind this being that the reduction in nodes, and thus the compression of information regarding a gesture, would reveal features of the model that would then be expanded again to be repeated.

Using this stacked hourglass architecture I experimented with architectures, including ‘skip’ connections that would add the inputs from compression layers (where output features are lower than input features) to decompression layers to allow information that was lost in compression to be available if needed.

I also experimented with the number of hourglasses that were stacked, the length of each hour glass, and the factor of compression between layers. Along with this, multiple loss and optimization functions were considered, with the most accurate being cross-entropy loss and Adagrad.

The most effective model found thus uses only two hourglasses of the structure pictured above, achieving a validation accuracy of 86% after being trained over 1000 epochs. The training loss over 350 epochs is pictured below.

I will spend the rest of this weekend preparing for the interim demo for next week, and next week will be spent presenting said demo. Along side this further model experimentation and training will be performed.

Brian Lane’s Status Report 10/30

I spent this week getting prepared for beginning the training of our gesture recognition model next week. For these preparations I needed to apply some data transformations to the HANDS dataset that we are using, which contains a couple hundred images of various hand gestures.

The dataset supplies images of 5 subject, 4 male and 1 female, in various positions and lighting conditions, as well as annotations of these images containing bounding boxes for each gesture being performed in the image. These annotations were stored in massive text files in csv format, with a default value of [0 0 0 0] for a bounding box if the gesture did not exist in the image. For example:

image_name,left_fist,right_fist,left_one,right_one,…
./001_color.png,[0 0 0 0],[0 0 0 0],[143 76 50 50],[259 76, 50, 50],

The above lines would represent a subject holding out the number one on both hands within the specified areas of the image.

This format for the data cannot be used to train our model, as we are now using the hand landmark coordinates as the features with which we will train/inference. I spent some time writing a script that would take in the annotations, find the hand within the image, and apply the hand pose estimation implemented by Andrew.

For example, we would start with the image below.

Then the bounding box would be used to crop both the gestures to handle separately, and then the pose estimation would be applied to find the coordinates of the hand landmarks.

The palm landmark would then be considered the origin (coordinates 0, 0) and all other points in the image would have their location expressed relative to the origin and saved in a new csv file with their corresponding label.