Team Status Reports – Team B3 Computer Vision Boxing

SEP21

One significant risk that could jeopardize the success of the project is if it is not possible to integrate the control of a game with the computer vision. To mitigate this risk, we will attempt simple integration of the two early on to see if we need to switch directions to achieve our goal.

Initially, we planned to use an Apple Watch for haptic feedback. However, we pivoted to designing our own bracelet, which involves custom circuit design, selecting a suitable microcontroller or chip, and developing a PCB layout. The primary reason for this change was to reduce latency in the system, as the Apple Watch introduced delays that impacted real-time feedback. By designing our own hardware, we can optimize the communication protocol, ensuring faster response times. Additionally, the custom bracelet offers greater flexibility in terms of integrating specific actuators (e.g. ERM motors) tailored to our use case, which wasn’t possible with off-the-shelf solutions like the Apple Watch. The costs of this change include increased development time for circuit design and PCB fabrication but shouldn’t impact our schedule overall. By taking control of the hardware design, we will also have more freedom to customize the haptic feedback experience and enhance user interaction with lower power consumption and potentially better ergonomics.

SEP 28

The most significant risk to our project would be the ability of our machine learning pipeline to accurately capture the movements of the player. To mitigate this, we have plans for filtering and averaging the data to ensure that the player’s movements in game are smooth and accurately represented. Given that fails, we would be able to pivot to a different machine learning model or change the sort of input data we expect to receive from the camera, which would require a redesign of our game idea, but ultimately still be able to make use of our existing hardware and code infrastructure.

For our system, we decided to shift from pose classification to direct mapping of the MediaPipe-detected landmarks due to the limitations we encountered in collecting and building a comprehensive dataset for boxing pose classification. Training a model to accurately classify boxing poses requires a large and diverse set of labeled data, which we found difficult to acquire. As a result, the classification accuracy was suboptimal for real-time interaction in the game.

By transitioning to direct mapping, we could use the body landmark data more efficiently. This change also enhanced the system’s responsiveness and accuracy, as it allowed for real-time tracking of body movements free of classification errors. Direct mapping provided a more intuitive and immediate translation of real-world gestures into the game, making the overall user experience more immersive.

Part A. (Eric)

One of the goals we had in mind for designing our game was for our game to provide health benefits to our users in the form of exercise. Boxing is an extremely physical intensive aerobic exercise with benefits such as improved energy, weight loss, better sleep, and innumerous more, and we aim for our game to accurately reflect this by capturing the movements of the original exercise, ultimately providing a novel and exciting outlet for physical exercise.

Part B.(Taiming)

The social impact of our project extends into various social groups with distinctive cultural, social, and political structures. For instance, in lower-income or underserved communities where access to gyms, fitness programs, or high-cost gaming equipment is limited, this project democratizes access to interactive physical activity using relatively more affordable and accessible technology, such as the laptop and open-source software. Politically, the project aligns with broader goals of public health initiatives, potentially supporting global (government or NGO) efforts to combat sedentary lifestyles and increase physical activity across various demographics.

Part C (Shithe)

This product takes into account meeting the specified needs of our consumers with regards to economics in a few ways. The biggest way it does this is by having a much lower initial cost investment compared to its typical competitors which require consumers to purchase additional hardware like a console and controllers to be able to play their game. Our product is being made to be used on existing hardware such as a consumer’s laptop using their laptop camera only and no controllers.

Another way is that our method of distribution would be digital which lowers the cost on the consumers from purchasing a physical copy or device. This would also allow faster updates. Our digital method of distribution can take advantage of big platforms like Steam or the Microsoft Store for third party distribution to consumers all around the world as well.

OCT 5

This week, one significant risk we identified in the project is related to jitter in the Mediapipe body tracking data. Sudden changes in keypoint positions due to noisy data can reduce the system’s effectiveness and user experience. To manage this risk, we have started to research and test filters, such as Kalman and One-euro filters, to smooth the movement of the keypoints and reduce abrupt jumps. If jitter persists, the contingency plan involves testing more advanced filtering methods or implementing Z-axis scaling to improve stability.

On the hardware side, one significant risk is that either the battery or the motor are unable to meet our requirements, therefore, to increase modularity and be able to swap out components in a worst-case scenario, rather than design around the components on the PCB, the decision was made to add header pins to the PCB instead.

On the video game side, one significant risk will be if converting media pipe poses to the video game environment is not feasible within a short time frame from scratch. This risk is being managed by making the integration of computer vision pose data the main focus in the video game development currently. A contingency plan is the use of prebuilt animations that will go off based on analyzed poses of users to simulate a real time imitation of their movements as many existing and in depth pre-built animations already exist.

OCT 19

One risk for the project is the failure of integrating computer vision into our game properly. This will be managed by the development of an early prototype connecting the video game to the computer vision controls this week. A contingency plan if direct mapping with smoothing between points does not pan out well is the use of prebuilt animations, which already are in the game, that will go off based on detection of movement in the arms of the users.

Another risk is that the development of a very sophisticated AI behavior could be overly difficult for our time constraints. This will be managed by starting with the development of basic AI behavior and adding more complex actions in over time. A contingency plan for this is to use the most simple AI behavior but increase the difficulty of the game in other ways that allows the game to remain fun and challenging.

Part A (Eric)

One goal for our project is to be able to increase global health by making exercise more accessible and enjoyable. Accessibility is maximized by having our software be made to be run while making use of a simple laptop camera, so anyone with a laptop would be able to use our system (minus the optional hardware wristband). For enjoyability, we aim to create an innovate and refreshing game that will be fun for users.

Part B (Taiming)

Our project considers cultural factors by focusing on inclusivity and accessibility to create an immersive gaming experience. We designed the system to promote physical exercise and engagement without the need for expensive or bulky equipment, which aligns with diverse socio-economic conditions (as discussed earlier.) Additionally, the game’s intuitive computer vision control scheme eliminates language or literacy barriers, enabling broader participation across different communities. The choice of a boxing-themed game taps into a globally recognized sport, ensuring that the gameplay resonates with various cultural groups, and thus enhancing its appeal and relevance across regions.

Part C (Shithe)

This product has been designed with consideration of environmental factors to meet energy conservation and electronic waste reduction needs. Energy is conserved in our product due to be being designed to run on already existing hardware like laptops rather than needing a required gaming console. The lack of a new system and the use of a laptop compared to a gaming console lowers the overall energy usage of the user. In addition to that, electronic waste is lowered in our product because no external controllers or new gaming systems are needed to be developed and purchased by our users to run the game. The watch that gives haptic feedback is optional making it so that a user could play the game after only a simple download if they already have a laptop/desktop and a video camera.

OCT 26

Our most significant risk still remains the failure of integrating computer vision into our game properly. This is currently and will continue to be managed by the development of an early prototype connecting the video game to the computer vision controls this week. The contingency plan is the same thing of direct mapping with smoothing between points does not pan out well is the use of prebuilt animations, which already are in the game, that will go off based on detection of movement in the arms of the users. We have begun integration between the computer vision and game controls already this week and have run into issues but are continuing to tackle them.

There were no changes to the existing design of the system.

NOV 2

Through a review of the VNect model’s logic, we identified why our initial version struggled with maintaining accurate coordinate to avatar mappings. The original implementation lacked neccessary handling of each joint’s parent-child hierarchy and inverse rotations, resulting in misaligned bones and distorted movements. By setting up each JointPoint with a clear parent-child relationship and applying inverse rotations, we can maintain the natural orientation of each bone in the skeletal structure. However, implementing this solution presents its own challenges. While we now understand the corrective adjustments needed for accurate joint mapping, effectively coding and integrating these changes into the existing framework requires careful calibration.

NOV 9

This week, we implemented the MediaPipe landmark-to-Unity game avatar algorithm, achieving a milestone in our project. Our approach involves mapping key joint landmarks captured by MediaPipe to the corresponding parts of the Unity avatar. We used a dictionary that aligns MediaPipe joints with their Unity counterparts. By employing quaternion inversions and transformations, the algorithm dynamically adjusts joint rotations, enabling realistic rotations based on each joint’s calculated 3D orientation. Additionally, we used vector lerping to create a smoothing effect, which minimizes jitter and enhances the avatar’s fluidity in response to the player’s motions.

Check out the link below for a short demo.

https://drive.google.com/file/d/12WlJTbA2-HPqiBN8q2q5dFiIJFpqSk_l/view?usp=drive_link

We also discovered that the PCB connects to the wrong pin on the Arduino for power, but this can be fixed on the board by the addition of a single jumper wire, without the need to redesign/print a new board.

NOV 16

For validation, we need to ensure that our entire pipeline is smooth and responsive. All user gestures should be detected properly by our camera and properly translated to our game environment, and our game environment must be able to transmit a signal to our physical device triggering a detectable vibration. All of this must be done with little delay to ensure our game is as responsive as possible. To accomplish this, we will first test our in-game detection. We will have a user throw a large number of punches and measure how many of them are registered by our system, aiming for a target detection accuracy of >= 95%, adjusting our computer vision code if necessary. We will also have a test user throw punches with the watch on their wrist, having them determine whether or not there is a noticeable delay between their punch and the haptic feedback, adjusting our python/arduino code if necessary.

The first tests that were run for the video game portion of the project were tests involving the colliders and collision detection of the player character. First, tests were done in a simple environment with the debug console to check whether the colliders on the fists of the player were detected to have collided with a simple collider on a wall. Past that, the tests went into checking whether collision could be detected with an enemy player model. These tests checked whether the separate collisions could be detected for the player’s hands hitting the body, head, and arms of the enemy. Later damage calculation scripts were tested as well to dynamically apply damage to the enemy and player depending on where one was hit and verification that the HP slider’s connected to the player and the enemy moved down correspondingly.

Next tests were done with the environment of the boxing ring to make sure that there were no bugs that allowed a player to get stuck in the environment or be able to move through the boxing ring. The latest testing involved the integration of the computer vision and the game and being able to verify that our character model was properly able to be able to be set to a humanoid rig, which our computer vision movement code required.

Some later tests that will be done involve our use case requirements regarding the game FPS and ping. These will need to be at least meeting our use case requirements. More testing will also be done on the robustness and speed of the character in game based on the computer vision movement.

NOV 30

This week, we met up a few times to fully integrate and test the full pipeline. We first individually made sure each portion and any added capabilities, such as the automatic enemy attacking script for the game, were working properly. Then we integrated and tested the computer vision controls with the video game. A distance of a few feet back is needed for the computer vision to fully see the entire body and the movement translation to be smoother. We were able to see success with the movement in the game and tested around different camera angles to allow the player to better tell what’s going on. Next we tested the integration between the vibrating watch and the video game and were able to find success with that as well. The two different types of vibrations which are dependent on whether the enemy lands a hit or the player lands a hit both were functional. Lastly, we verified that the entire pipeline was functional with UDP communication from the python script with computer vision to unity and the other line of UDP communication from Unity towards the watch python script at the same time.

Through this process, we were testing to make sure we met our use case requirement metrics that were previously set. The game was able to meet the FPS requirement of greater than or equal to 40, as well as the ping requirement of less than or equal to 50 ms between the computer vision python script and unity. Without any added time constraints such as the time it takes for the motor to turn for the vibration on the watch, the total time it takes for the entire pipeline was also under 100ms. We are well on track and will be adding more details to our testing for the upcoming presentation.

DEC 7

Unit Tests and Overall System Tests Conducted:

Watch Testing

1.Weight Test

•Purpose: Measure the weight of the watch, with and without the battery.

•Target: 100 g (comparable to an average watch).

•Result: The watch weighed 53 g, significantly lighter than the target.

2.Vibration Test

•Purpose: Test the vibration capability to ensure distinct levels can be felt.

•Target: Two distinct levels of vibration.

•Result: Achieved two distinct levels successfully.

3.Power Test

•Purpose: Measure the watch’s power consumption during idle and vibration states.

•Target: 1.2 A maximum current for ensuring ~20 minutes of operation with a 400 mAh battery.

•Result: Measured between 50 mA (idle) and 77 mA (vibrating), well below the target, enabling longer operation time.

4.Distance Test

•Purpose: Measure the maximum Bluetooth connectivity distance.

•Target: 3 meters (aligned with expected gameplay range).

•Result: Achieved 20+ meters, far exceeding the target range.

5.Ping Test

•Purpose: Measure the delay in Bluetooth signal transmission between Python and Arduino.

•Target: ≤ 30 ms.

•Result: Achieved a delay of 22 ms, within the target range.

Computer Vision Testing

1.Accuracy Test

•Purpose: Measure accuracy of avatar movements corresponding to real human gestures.

•Target: At least 80% accuracy with < 10-degree differences in joint angles in 80% of timestamps over five frames.

•Result:

•Front View: 92.5% accuracy.

•Side View: 84.5% accuracy.

•Top View: 70.25% accuracy (underperformed due to MediaPipe’s depth estimation limitations).

•Observations: Top-view accuracy was limited by hardware and depth sensing constraints.

Video Game Testing

1.Frame Rate Test

•Purpose: Ensure smooth gameplay with no frame drops below 40 FPS.

•Result: Achieved 50 FPS on a lower-end laptop and 250 FPS on a desktop.

2.Latency Test

•Purpose: Measure delay between human movements and avatar response.

•Target: ≤ 50 ms latency.

•Result: Achieved a latency of 35 ms, well within the acceptable range.

3.Hit Detection Test

•Purpose: Verify hit detection and dynamic damage calculations based on hit location.

•Result: Passed, with the system accurately detecting hits and calculating damage properly.

Findings and Design Changes:

1.Watch Design

•The weight of the watch (53 g) was lighter than the target, improving comfort for the user.

•Power consumption was significantly lower than anticipated, enabling prolonged gameplay.

•No changes required as results exceeded expectations.

2.Computer Vision Accuracy

•Top-view mapping accuracy (70.25%) highlighted limitations in MediaPipe’s depth estimation.

•Recommendation: Implement a depth camera in future iterations to improve accuracy for 3D tracking.

3.Video Game Performance

•Both latency (35 ms) and frame rate (50 FPS on low-end devices) met targets, ensuring smooth and responsive gameplay.

•No additional changes needed for the current setup.