Karen’s Status Report for 4/27

This week I am continuing to work on integration and improvements to the UI. I am working on improving the user flow focusing on the calibration page and display. Previously, clicking the get started button would trigger a Python script to start running which would take a long time as modules are imported and it would also cause a new terminal window to open. This was a hindrance to the user experience, so I made some changes so that a Python script is running at all times in the background. I changed some logic with the session ID creation in the backend to allow for this new user flow.

I am also working with Rohan to integrate focus and flow together in the web app. I am working on adding toggle buttons that allow you to toggle between displaying the focus vs flow predictions. I will also be working with him on setting up a calibration phase for the EEG headset to calibrate the readings in the web app.

Overall I’ve been making great progress, am on schedule, and am just adding some final touches to the UI.

Focus and flow toggle buttons:

Karen’s Status Report for 4/20

Now that I have finished implementation of all the distraction and behavior detections, I spent time testing and verifying the integrated system. I tweaked thresholds so that the detections would work across different users. 

My testing plan is outlined below, and the results of testing can be found here.

  • Test among 5 different users
  • Engage in each behavior 10 times over 5 minutes, leaving opportunities for the models to make true positive, false positive, and false negative predictions
  • Record number of true positives, false positives, and false negatives
  • Behaviors
    • Yawning: user yawns
    • Sleep: user closes eyes for at least 5 seconds
    • Gaze: user looks to left of screen for at least 3 seconds
    • Phone: user picks up and uses phone for 10 seconds
    • Other people: another person enters frame for 10 seconds
    • User is away: user leaves for 30 seconds, replaced by another user

After completing testing, I focused my effort on integration, getting flow state and distractions displayed on the current session page. I also made significant improvements to the UI to reach our usability and usefulness requirements.This required a lot of experimentation with different UI libraries and visualization options. I’ve attached some screenshots of the iterative process of designing the current session page.

My next focus is improving the session summary page and fully integrating the EEG headset with our web app. Currently, the EEG data visualizations are using mock data that I am generating. I would also like to improve the facial recognition component as the “user is away” event results in many false positives at the moment. With these next steps in mind and the progress I have made so far, I am on track and schedule.

One of the new skills I gained by working on this project is how to organize a large codebase. In the past, I’ve worked on smaller projects individually, giving me leeway to be more sloppy with version control, documentation, and code organization without major consequences. But with working in a team of three on a significantly larger project with many moving parts, I was able to improve my code management, collaboration, and version control skills with git. This is also my first experience fully developing a web app, implementing everything from the backend to the frontend. From this experience, I’ve gained a lot of knowledge about full-stack development, React, and Django. In terms of learning strategies and resources, Google, Reddit, Stack Overflow, YouTube, and Medium are where I was able to find a lot of information on how others are using and debugging Python libraries and React for frontend development. It was helpful for me to read Medium article tutorials and watch YouTube tutorials of a “beginner” project using a new library or tool I wanted to use. From there, I was able to get familiar with the basics and use that knowledge to work on our capstone project with a bigger scope.


Karen’s Status Report for 4/6

This week, I completed integration of phone pick-up and other people distraction detection into the backend and frontend of our web application. Now we can see the phone and other people distraction type displayed in the current session page.

I also have finished the facial recognition implementation. I decided on the Fast MT-CNN model for facial detection and the SFace model for facial embeddings. This produced the best results in terms of a balance between accuracy and speed. This is the core of the facial recognition module with the rest of the logic in the run.py and utils.py scripts. The program now recognizes when the user is no longer recognized or not in frame and reports how long the user was missing for. 

User not recognized:  08:54:02
User recognized:  08:54:20
User was away for 23.920616388320923 seconds

I also recognized that adding facial recognition significantly slowed down the programming since facial recognition requires a large amount of processing time. Because of this, I implemented asynchronous distraction detection using threading so that consecutive frames can be processed simultaneously. I am using the concurrent.futures package to achieve this.

executor = ThreadPoolExecutor(max_workers=8)

A next step would be recognizing when the user is simply not in frame vs. when there is an imposter taking place of the user. After that would be integrating the facial recognition data into the frontend and backend of the web app. In the following week, I will focus on facial recognition integration and properly testing to verify my individual components.

I have some initial testing of my distraction detection components. Arnav, Rohan, and I have all used yawning, sleeping, and gaze detection with success. From initial testing, these modules work well across different users and faces. Initial testing of other people detection has shown success and robustness for a variety of users. Phone pick-up detection needs more testing with different users and different colored phones, but initial testing shows success on my phone. I also need to begin verification that face recognition works for different users, but it has worked well for myself for now.

I have already performed some verification of individual components, such as the accuracy of the YOLOv8 phone object detector and the accuracy of MT-CNN and SFace. More thorough validation methods for the components integrated in the project as a whole are listed in our team progress report.

In the coming week I will work on the validation and verification methods. Now that all of the video processing distraction detections are implemented, I will work with Arnav on making the web application cleaner and more user friendly.

Karen’s Status Report for 3/30

This week I focused on integration and facial recognition. For integration, I worked with Arnav to understand the frontend and backend code. I now have a strong understanding of how the distraction data is sent to our custom API so that they can be displayed in the webpage. Now, sleep, yawning, gaze, phone, and other people detection are integrated into the frontend and backend.

I also worked on splitting calibration and distraction detection into separate scripts. This way, calibration data is saved to a file so that it can be retrieved when the user actually begins the work session and so it can be used in future sessions. I updated the backend so that the calibration script is triggered when the user navigates to the calibration page when starting a new session. After calibration is complete, the user will then click the finished calibration button which will trigger the distraction detection script.

After the initial testing of different facial recongition models, I have began implementation of the facial recognition module for our app. So far, the script will run facial recognition on the detected faces and print to the terminal if the user was recognized and the timestamp. The recognition runs around every one second, but this may need to be modified to improve performance.

I also began testing that the distraction detection works on users other than myself. Sleep, yawn, and gaze detection have performed very well on Arnav and Rohan, but we are running into some issues getting phone detection to work on Arnav and Rohan’s computers. I will investigate this issue in the following week.

Overall, my progress is on track. Although I did not finish implementing facial recognition, I have got a good start and was able to focus on integration in preparation for the interim demo.

Facial recognition output and screenshot:

User recognized:  21:55:37
User recognized:  21:55:38
User recognized:  21:55:39
User not recognized:  21:55:40
User not recognized:  21:55:41
User not recognized:  21:55:49

Karen’s Status Report for 3/23

This week I wrapped up the phone pick-up detection implementation. I completed another round of training of the YOLOv8 phone object detector, using over 1000 annotated images that I collected myself. This round of data contained more colors of phones and orientations of phones, making the detector more robust. I also integrated MediaPipe’s hand landmarker into the phone pick-up detector. By comparing the location of the phone detected and the hand over a series of frames, we can ensure that the phone detected is actually in the user’s hand. This further increases the robustness of the phone detection.

After this, I began working on facial recognition more. This is to ensure that the program is actually analyzing the user’s facial features and not someone else’s face in the frame. It will also ensure that it is actually the user working and that they did not replace themselves with another person to complete the work session for them.

I had first found a simple Python face recognition library, which I did some initial testing of. Although it has a very simple and usable interface, I realized the performance was not sufficient as it had too many false positives. Here you can see it identifies two people as “Karen” when only one of them is actually Karen.

I then looked into another Python face recognition library called DeepFace. This has a more complex interface, but provides much more customizability as it contains various different models that can be used for face detection and recognition. I did some extensive experimentation and research of the different model options for performance and speed, and have landed on using Fast-MTCNN for facial detection and SFace for facial recognition.

Here you can see the results of my tests for speed for each model:

❯ python3 evaluate_models.py
24-03-21 13:19:19 - Time taken for predictions with VGG-Face: 0.7759 seconds
24-03-21 13:19:20 - Time taken for predictions with Facenet: 0.5508 seconds
24-03-21 13:19:22 - Time taken for predictions with Facenet512: 0.5161 seconds
24-03-21 13:19:22 - Time taken for predictions with OpenFace: 0.3438 seconds
24-03-21 13:19:24 - Time taken for predictions with ArcFace: 0.5124 seconds
24-03-21 13:19:24 - Time taken for predictions with Dlib: 0.2902 seconds
24-03-21 13:19:24 - Time taken for predictions with SFace: 0.2892 seconds
24-03-21 13:19:26 - Time taken for predictions with GhostFaceNet: 0.4941 seconds

Here are some screenshots of tests I ran for performance and speed on different face detectors.

OpenCV face detector (poor performance):

Fast-MTCNN face detector (better performance):

Here is an outline of the overall implementation I would like to follow:

  • Use MediaPipe’s facial landmarking to rough crop out the face
  • During calibration
    • Do a rough crop out of face using MediaPipe
    • Extract face using face detector
    • Get template embedding
  • During work session
    • Do a rough crop out of face 0 using MediaPipe
    • Extract face using face detector
    • Get embedding and compare with template embedding
    • If below threshold, face 0 is a match
    • If face 0 is a match, everything is good so continue
    • If face 0 isn’t a match, do same process with face 1 to see if there is match
    • If face 0 and face 1 aren’t matches, fallback to using face 0
  • During work session
    • If there haven’t been any face matches in x minute, then user is no longer there

Although I hoped to have a full implementation of facial recognition completed, I spent more time this week just exploring and testing the different facial recognition options available to find the best option for our application, and outlining an implementation that would work with this option. Overall, my progress is still on schedule taking into account the slack time added.

Karen’s Status Report for 3/16

This week I focused on improving phone detection. I familiarized myself with the Roboflow platform and how to train my own object detection model on it. Following this, I began the process of training the object detector by collecting a diverse dataset. I recorded videos of several different people with several different phones holding their phone in their hands. On the Roboflow platform, I was able to annotate and label the phone in each frame. I also applied some augmentations (changes in sheer, saturation, and brightness) and ended up with over 1000 images for the dataset. The results of the training are in images below. Overall, this process went much smoother than training locally using the analytics Python package. The training time was much shorter and I also obtained much better results using my own custom dataset.

After using the phone detector live, it performs much more robustly than my previous iteration. However, I noticed that it struggled detecting phones in certain orientations, especially when only the thin of the phone is visible in frame. In frame, this looks like a very thing rectangle or even a line, so I collected more videos of people holding phones in this orientation. I also noticed poor performance on colored phones, so will need to collect more data in these situationsI will have to label each frame and will then use the model I have already trained as a starting point to further train on this new data in the coming week.

I have integrated all of the individual detectors into a single module that prints when a behavior or distraction is detected along with the timestamp. It keeps track of behavior “states” as well, so that a distraction is not recorded for every individual frame. I am collaborating with Arnav to translate these print statements into calls to the API he has created to communicate with the backend.

This coming week, I will also integrate MediaPipe’s hand pose landmarker so that I can track the hand in frame as well. We only want to consider a phone pick up when the phone is detected in the hand, so I will need to check that the location of phone is in the vicinity of the user’s hand. Another feature I will be working on in the next week is facial recognition. If there are multiple people in frame, facial recognition will be used to distinguish between the user and any other people in frame. This will ensure that we run facial analysis (sleeping, yawning, and gaze detection) on the right face.

With these updates to the phone detector, my progress is on schedule.

Karen’s Status Report for 3/9

This week I spent the majority of my time working on the design report. Outside of that, I experimented with object detection for phone pick-up detection. One component of the phone pick-up detection is phone object recognition, so I trained the YOLOv8 model to detect phones using the MUID-IITR dataset. This was the closest dataset I could find online to match scenarios for the Focus Tracker App. The dataset includes images of people using a phone while performing day-to-day activities as well as annotations of the coordinates of the phones in each image. The dataset required some converting to match the YOLOv8 formatting, and then I used the Python package Ultralytics to train the model. Below are results of the training with 100 epochs. The recall and mAP never exceed 0.8, which does not satisfy the design requirements we specified. Testing the model, I noticed that it sometimes predicted just a hand as a phone. The FPS is also fairly low ~10 FPS.

There are some other datasets (like this one) that I can try to continue training the model on that are just the phone itself, which could prevent the false negatives of the hand being classified as a phone. My risk mitigation plan for my custom YOLOv8 model not achieving sufficient performance is to use a model that has already been trained, available on Roboflow. This is a YOLOv5 model trained on 3000+ images of phones and people using phones. This model is linked here. This option may be better, because the training time is very costly (>12 hours for 100 epochs). The FPS for the Roboflow is also higher (~20 FPS).

I also have a plan to collect and annotate my own data. The MUID-IITR dataset puts a fairly large bounding box around the hand which may be the reason for so many false positives too. Roboflow has a very usable interface for collecting data, annotating images, and training a YOLO model.

Here is the directory with the code for manipulating the data and training my custom YOLOv8 model. And here is the directory with the code for facial recognition.

My progress is overall on schedule, but the custom YOLOv8 model not performing as well as desired is a bit of a setback. In the coming week, I plan to further train this custom model or fall back onto the Roboflow model if it is not successful. I will also integrate the hand landmarker to make the phone pick-up detection more robust by also taking into account the hand that is picking up the phone. I will also further experiment with the face recognition library that I will use for detecting interruptions from others.

Karen’s Status Report for 2/24

This week I implemented head pose estimation. My implementation involves solving the perspective-n-point pose computation problem. The goal is to find the rotation that minimizes the reprojection error from 3D-2D point correspondences. I am using 5 points on the face for these point correspondences: I have the 3D points of a face that is looking forward without any points obtained using MediaPipe’s facial landmarks. I then solve for the Euler angles given the rotation matrix. This gives the roll, pitch, and yaw of the head, which tells us if the user’s head is pointed away from the screen or looking around the room. The head pose estimator module is on GitHub here.

I also began experimenting with phone pick-up detection. My idea is to use a combination of hand detection and phone object detection to detect the user picking up and using their phone. I am using MediaPipe’s hand landmark detection that can detect where a phone is detected in the frame. For object detection, I looked into various algorithms, including SSD (Single-Shot object Detection) and YOLO (You Only Look Once). After reviewing some papers [1, 2] on these algorithms, I decided to go with YOLO for its higher performance.

I was able to find some pre-trained YOLOv5 models for mobile phone detection on Roboflow. Roboflow is a platform that streamlines the process of building and deploying computer vision models and allows for the sharing of models and datasets. One of the models and datasets is linked here. Using Roboflow’s inference Python API, I can load this model and use it to perform inference on images. Two other models [1, 2] performed pretty similarly. They all had trouble recognizing the phone when it was tilted in the hand. I think I will need a better dataset with images of people holding the phone in hand rather than just the phone by itself. I was able to find this dataset on Kaggle.

Overall, my progress is on schedule. In the following week, I hope to train and test a smartphone object detection model that performs better than the pre-trained models I found online. I will then try to integrate it with the hand landmark detector to detect phone pick-ups.

In the screenshots below, the yaw is negative when looking left and the yaw is positive when looking right.

Below are screenshots of the pre-trained mobile phone object detector and MediaPipe’s hand landmark detector.

Karen’s Status Report for 2/17

This week I finished implementing yawning and microsleep detection. These behaviors will help understand a user’s productivity during a work session. I used this paper as inspiration for how to detect yawning and microsleeps. I calculate the mouth and eye aspect ratios, which tell us how open or closed the mouth and eyes are. If the ratios exceed a certain threshold for a set amount of time, it will trigger a yawn or microsleep detection. I implemented this using MediaPipe’s face landmark detection rather than Dlib as used in the paper because MediaPipe is reported to have higher accuracy and also provides more facial landmarks to work with.

Calibration and determining an appropriate threshold to trigger a yawn or microsleep detection proved to be more difficult than expected. For the detector to work on all users with different eye and mouth shapes, I added a calibration step at the start of the program. It first measures the ratios on a neutral face. It then measures the ratios for when the user is yawning, and then the ratios for when the user’s eyes are closed. This is used to determine the corresponding thresholds. I normalize the ratios by calculating a Z-score for each measurement. My implementation also ensures that the detectors are triggered once for each yawn and each instance of a microsleep regardless of their duration. After finishing the implementation, I spent some time organizing the detectors into individual modules so that the code could be refactored and understood more easily. The code with my most recent commit with yawning and microsleep detection can be accessed here.

I began exploring options for head pose detection and will follow a similar approach to that proposed in this paper.

Overall, I am on schedule and making good progress. In the coming week, I will finish implementing head pose estimation to track where the user’s head is facing. This will help us track how long the user is looking at/away from their computer screen, which can be correlated to their focus and productivity levels. If this is complete, I will look into and begin implementing object detection to detect phone pick-ups.

Below is a screenshot of the yawning and microsleep detection program with some debugging messages to show the ratios and their thresholds.

Karen’s Status Report for 02/10

I spent this week more thoroughly researching and exploring CV and ML libraries I can use to implement distraction and behavior detection via a camera. I found MediaPipe and Dlib, both libraries compatible with Python and can be used for facial landmark detection. I plan to use these libraries to help detect drowsiness, yawning, and off-screen gazing. MediaPipe can also be used for object recognition, which I plan to experiment with for phone pick-up detection. Here is a document summarizing my research and brainstorming for camera-based distraction and behavior detection.

I also looked into and experimented with a few existing implementations of drowsiness detection. From this research and experimentation, I plan to use facial landmark detection to calculate the eye aspect ratio and mouth aspect ratio, and potentially a trained neural network to predict the drowsiness of the user.

Lastly, I submitted an order for a 1080p web camera that I will use to produce consistent camera results.

Overall, my progress is on schedule.

In the coming week, I hope to have a preliminary implementation of drowsiness detection. I would like to have successful yawning and closed eye detection via eye aspect ratio and mouth aspect ratio. I will also collect data and train a preliminary neural network to classify images as drowsy vs. not. If time permits, I will also begin experimentation with head tracking and off-screen gaze detection.

Below is a screenshot of me experimenting with the MediaPipe face landmark detection.

Below is a screenshot of me experimenting with an existing drowsiness detector.