This week I implemented head pose estimation. My implementation involves solving the perspective-n-point pose computation problem. The goal is to find the rotation that minimizes the reprojection error from 3D-2D point correspondences. I am using 5 points on the face for these point correspondences: I have the 3D points of a face that is looking forward without any points obtained using MediaPipe’s facial landmarks. I then solve for the Euler angles given the rotation matrix. This gives the roll, pitch, and yaw of the head, which tells us if the user’s head is pointed away from the screen or looking around the room. The head pose estimator module is on GitHub here.
I also began experimenting with phone pick-up detection. My idea is to use a combination of hand detection and phone object detection to detect the user picking up and using their phone. I am using MediaPipe’s hand landmark detection that can detect where a phone is detected in the frame. For object detection, I looked into various algorithms, including SSD (Single-Shot object Detection) and YOLO (You Only Look Once). After reviewing some papers [1, 2] on these algorithms, I decided to go with YOLO for its higher performance.
I was able to find some pre-trained YOLOv5 models for mobile phone detection on Roboflow. Roboflow is a platform that streamlines the process of building and deploying computer vision models and allows for the sharing of models and datasets. One of the models and datasets is linked here. Using Roboflow’s inference Python API, I can load this model and use it to perform inference on images. Two other models [1, 2] performed pretty similarly. They all had trouble recognizing the phone when it was tilted in the hand. I think I will need a better dataset with images of people holding the phone in hand rather than just the phone by itself. I was able to find this dataset on Kaggle.
Overall, my progress is on schedule. In the following week, I hope to train and test a smartphone object detection model that performs better than the pre-trained models I found online. I will then try to integrate it with the hand landmark detector to detect phone pick-ups.
In the screenshots below, the yaw is negative when looking left and the yaw is positive when looking right.
Below are screenshots of the pre-trained mobile phone object detector and MediaPipe’s hand landmark detector.