Karen’s Status Report for 4/27

This week I am continuing to work on integration and improvements to the UI. I am working on improving the user flow focusing on the calibration page and display. Previously, clicking the get started button would trigger a Python script to start running which would take a long time as modules are imported and it would also cause a new terminal window to open. This was a hindrance to the user experience, so I made some changes so that a Python script is running at all times in the background. I changed some logic with the session ID creation in the backend to allow for this new user flow.

I am also working with Rohan to integrate focus and flow together in the web app. I am working on adding toggle buttons that allow you to toggle between displaying the focus vs flow predictions. I will also be working with him on setting up a calibration phase for the EEG headset to calibrate the readings in the web app.

Overall I’ve been making great progress, am on schedule, and am just adding some final touches to the UI.

Focus and flow toggle buttons:

Karen’s Status Report for 4/20

Now that I have finished implementation of all the distraction and behavior detections, I spent time testing and verifying the integrated system. I tweaked thresholds so that the detections would work across different users. 

My testing plan is outlined below, and the results of testing can be found here.

  • Test among 5 different users
  • Engage in each behavior 10 times over 5 minutes, leaving opportunities for the models to make true positive, false positive, and false negative predictions
  • Record number of true positives, false positives, and false negatives
  • Behaviors
    • Yawning: user yawns
    • Sleep: user closes eyes for at least 5 seconds
    • Gaze: user looks to left of screen for at least 3 seconds
    • Phone: user picks up and uses phone for 10 seconds
    • Other people: another person enters frame for 10 seconds
    • User is away: user leaves for 30 seconds, replaced by another user

After completing testing, I focused my effort on integration, getting flow state and distractions displayed on the current session page. I also made significant improvements to the UI to reach our usability and usefulness requirements.This required a lot of experimentation with different UI libraries and visualization options. I’ve attached some screenshots of the iterative process of designing the current session page.

My next focus is improving the session summary page and fully integrating the EEG headset with our web app. Currently, the EEG data visualizations are using mock data that I am generating. I would also like to improve the facial recognition component as the “user is away” event results in many false positives at the moment. With these next steps in mind and the progress I have made so far, I am on track and schedule.

One of the new skills I gained by working on this project is how to organize a large codebase. In the past, I’ve worked on smaller projects individually, giving me leeway to be more sloppy with version control, documentation, and code organization without major consequences. But with working in a team of three on a significantly larger project with many moving parts, I was able to improve my code management, collaboration, and version control skills with git. This is also my first experience fully developing a web app, implementing everything from the backend to the frontend. From this experience, I’ve gained a lot of knowledge about full-stack development, React, and Django. In terms of learning strategies and resources, Google, Reddit, Stack Overflow, YouTube, and Medium are where I was able to find a lot of information on how others are using and debugging Python libraries and React for frontend development. It was helpful for me to read Medium article tutorials and watch YouTube tutorials of a “beginner” project using a new library or tool I wanted to use. From there, I was able to get familiar with the basics and use that knowledge to work on our capstone project with a bigger scope.


Karen’s Status Report for 4/6

This week, I completed integration of phone pick-up and other people distraction detection into the backend and frontend of our web application. Now we can see the phone and other people distraction type displayed in the current session page.

I also have finished the facial recognition implementation. I decided on the Fast MT-CNN model for facial detection and the SFace model for facial embeddings. This produced the best results in terms of a balance between accuracy and speed. This is the core of the facial recognition module with the rest of the logic in the run.py and utils.py scripts. The program now recognizes when the user is no longer recognized or not in frame and reports how long the user was missing for. 

User not recognized:  08:54:02
User recognized:  08:54:20
User was away for 23.920616388320923 seconds

I also recognized that adding facial recognition significantly slowed down the programming since facial recognition requires a large amount of processing time. Because of this, I implemented asynchronous distraction detection using threading so that consecutive frames can be processed simultaneously. I am using the concurrent.futures package to achieve this.

executor = ThreadPoolExecutor(max_workers=8)

A next step would be recognizing when the user is simply not in frame vs. when there is an imposter taking place of the user. After that would be integrating the facial recognition data into the frontend and backend of the web app. In the following week, I will focus on facial recognition integration and properly testing to verify my individual components.

I have some initial testing of my distraction detection components. Arnav, Rohan, and I have all used yawning, sleeping, and gaze detection with success. From initial testing, these modules work well across different users and faces. Initial testing of other people detection has shown success and robustness for a variety of users. Phone pick-up detection needs more testing with different users and different colored phones, but initial testing shows success on my phone. I also need to begin verification that face recognition works for different users, but it has worked well for myself for now.

I have already performed some verification of individual components, such as the accuracy of the YOLOv8 phone object detector and the accuracy of MT-CNN and SFace. More thorough validation methods for the components integrated in the project as a whole are listed in our team progress report.

In the coming week I will work on the validation and verification methods. Now that all of the video processing distraction detections are implemented, I will work with Arnav on making the web application cleaner and more user friendly.

Karen’s Status Report for 3/30

This week I focused on integration and facial recognition. For integration, I worked with Arnav to understand the frontend and backend code. I now have a strong understanding of how the distraction data is sent to our custom API so that they can be displayed in the webpage. Now, sleep, yawning, gaze, phone, and other people detection are integrated into the frontend and backend.

I also worked on splitting calibration and distraction detection into separate scripts. This way, calibration data is saved to a file so that it can be retrieved when the user actually begins the work session and so it can be used in future sessions. I updated the backend so that the calibration script is triggered when the user navigates to the calibration page when starting a new session. After calibration is complete, the user will then click the finished calibration button which will trigger the distraction detection script.

After the initial testing of different facial recongition models, I have began implementation of the facial recognition module for our app. So far, the script will run facial recognition on the detected faces and print to the terminal if the user was recognized and the timestamp. The recognition runs around every one second, but this may need to be modified to improve performance.

I also began testing that the distraction detection works on users other than myself. Sleep, yawn, and gaze detection have performed very well on Arnav and Rohan, but we are running into some issues getting phone detection to work on Arnav and Rohan’s computers. I will investigate this issue in the following week.

Overall, my progress is on track. Although I did not finish implementing facial recognition, I have got a good start and was able to focus on integration in preparation for the interim demo.

Facial recognition output and screenshot:

User recognized:  21:55:37
User recognized:  21:55:38
User recognized:  21:55:39
User not recognized:  21:55:40
User not recognized:  21:55:41
User not recognized:  21:55:49

Karen’s Status Report for 3/23

This week I wrapped up the phone pick-up detection implementation. I completed another round of training of the YOLOv8 phone object detector, using over 1000 annotated images that I collected myself. This round of data contained more colors of phones and orientations of phones, making the detector more robust. I also integrated MediaPipe’s hand landmarker into the phone pick-up detector. By comparing the location of the phone detected and the hand over a series of frames, we can ensure that the phone detected is actually in the user’s hand. This further increases the robustness of the phone detection.

After this, I began working on facial recognition more. This is to ensure that the program is actually analyzing the user’s facial features and not someone else’s face in the frame. It will also ensure that it is actually the user working and that they did not replace themselves with another person to complete the work session for them.

I had first found a simple Python face recognition library, which I did some initial testing of. Although it has a very simple and usable interface, I realized the performance was not sufficient as it had too many false positives. Here you can see it identifies two people as “Karen” when only one of them is actually Karen.

I then looked into another Python face recognition library called DeepFace. This has a more complex interface, but provides much more customizability as it contains various different models that can be used for face detection and recognition. I did some extensive experimentation and research of the different model options for performance and speed, and have landed on using Fast-MTCNN for facial detection and SFace for facial recognition.

Here you can see the results of my tests for speed for each model:

❯ python3 evaluate_models.py
24-03-21 13:19:19 - Time taken for predictions with VGG-Face: 0.7759 seconds
24-03-21 13:19:20 - Time taken for predictions with Facenet: 0.5508 seconds
24-03-21 13:19:22 - Time taken for predictions with Facenet512: 0.5161 seconds
24-03-21 13:19:22 - Time taken for predictions with OpenFace: 0.3438 seconds
24-03-21 13:19:24 - Time taken for predictions with ArcFace: 0.5124 seconds
24-03-21 13:19:24 - Time taken for predictions with Dlib: 0.2902 seconds
24-03-21 13:19:24 - Time taken for predictions with SFace: 0.2892 seconds
24-03-21 13:19:26 - Time taken for predictions with GhostFaceNet: 0.4941 seconds

Here are some screenshots of tests I ran for performance and speed on different face detectors.

OpenCV face detector (poor performance):

Fast-MTCNN face detector (better performance):

Here is an outline of the overall implementation I would like to follow:

  • Use MediaPipe’s facial landmarking to rough crop out the face
  • During calibration
    • Do a rough crop out of face using MediaPipe
    • Extract face using face detector
    • Get template embedding
  • During work session
    • Do a rough crop out of face 0 using MediaPipe
    • Extract face using face detector
    • Get embedding and compare with template embedding
    • If below threshold, face 0 is a match
    • If face 0 is a match, everything is good so continue
    • If face 0 isn’t a match, do same process with face 1 to see if there is match
    • If face 0 and face 1 aren’t matches, fallback to using face 0
  • During work session
    • If there haven’t been any face matches in x minute, then user is no longer there

Although I hoped to have a full implementation of facial recognition completed, I spent more time this week just exploring and testing the different facial recognition options available to find the best option for our application, and outlining an implementation that would work with this option. Overall, my progress is still on schedule taking into account the slack time added.

Karen’s Status Report for 3/16

This week I focused on improving phone detection. I familiarized myself with the Roboflow platform and how to train my own object detection model on it. Following this, I began the process of training the object detector by collecting a diverse dataset. I recorded videos of several different people with several different phones holding their phone in their hands. On the Roboflow platform, I was able to annotate and label the phone in each frame. I also applied some augmentations (changes in sheer, saturation, and brightness) and ended up with over 1000 images for the dataset. The results of the training are in images below. Overall, this process went much smoother than training locally using the analytics Python package. The training time was much shorter and I also obtained much better results using my own custom dataset.

After using the phone detector live, it performs much more robustly than my previous iteration. However, I noticed that it struggled detecting phones in certain orientations, especially when only the thin of the phone is visible in frame. In frame, this looks like a very thing rectangle or even a line, so I collected more videos of people holding phones in this orientation. I also noticed poor performance on colored phones, so will need to collect more data in these situationsI will have to label each frame and will then use the model I have already trained as a starting point to further train on this new data in the coming week.

I have integrated all of the individual detectors into a single module that prints when a behavior or distraction is detected along with the timestamp. It keeps track of behavior “states” as well, so that a distraction is not recorded for every individual frame. I am collaborating with Arnav to translate these print statements into calls to the API he has created to communicate with the backend.

This coming week, I will also integrate MediaPipe’s hand pose landmarker so that I can track the hand in frame as well. We only want to consider a phone pick up when the phone is detected in the hand, so I will need to check that the location of phone is in the vicinity of the user’s hand. Another feature I will be working on in the next week is facial recognition. If there are multiple people in frame, facial recognition will be used to distinguish between the user and any other people in frame. This will ensure that we run facial analysis (sleeping, yawning, and gaze detection) on the right face.

With these updates to the phone detector, my progress is on schedule.

Karen’s Status Report for 3/9

This week I spent the majority of my time working on the design report. Outside of that, I experimented with object detection for phone pick-up detection. One component of the phone pick-up detection is phone object recognition, so I trained the YOLOv8 model to detect phones using the MUID-IITR dataset. This was the closest dataset I could find online to match scenarios for the Focus Tracker App. The dataset includes images of people using a phone while performing day-to-day activities as well as annotations of the coordinates of the phones in each image. The dataset required some converting to match the YOLOv8 formatting, and then I used the Python package Ultralytics to train the model. Below are results of the training with 100 epochs. The recall and mAP never exceed 0.8, which does not satisfy the design requirements we specified. Testing the model, I noticed that it sometimes predicted just a hand as a phone. The FPS is also fairly low ~10 FPS.

There are some other datasets (like this one) that I can try to continue training the model on that are just the phone itself, which could prevent the false negatives of the hand being classified as a phone. My risk mitigation plan for my custom YOLOv8 model not achieving sufficient performance is to use a model that has already been trained, available on Roboflow. This is a YOLOv5 model trained on 3000+ images of phones and people using phones. This model is linked here. This option may be better, because the training time is very costly (>12 hours for 100 epochs). The FPS for the Roboflow is also higher (~20 FPS).

I also have a plan to collect and annotate my own data. The MUID-IITR dataset puts a fairly large bounding box around the hand which may be the reason for so many false positives too. Roboflow has a very usable interface for collecting data, annotating images, and training a YOLO model.

Here is the directory with the code for manipulating the data and training my custom YOLOv8 model. And here is the directory with the code for facial recognition.

My progress is overall on schedule, but the custom YOLOv8 model not performing as well as desired is a bit of a setback. In the coming week, I plan to further train this custom model or fall back onto the Roboflow model if it is not successful. I will also integrate the hand landmarker to make the phone pick-up detection more robust by also taking into account the hand that is picking up the phone. I will also further experiment with the face recognition library that I will use for detecting interruptions from others.

Karen’s Status Report for 2/24

This week I implemented head pose estimation. My implementation involves solving the perspective-n-point pose computation problem. The goal is to find the rotation that minimizes the reprojection error from 3D-2D point correspondences. I am using 5 points on the face for these point correspondences: I have the 3D points of a face that is looking forward without any points obtained using MediaPipe’s facial landmarks. I then solve for the Euler angles given the rotation matrix. This gives the roll, pitch, and yaw of the head, which tells us if the user’s head is pointed away from the screen or looking around the room. The head pose estimator module is on GitHub here.

I also began experimenting with phone pick-up detection. My idea is to use a combination of hand detection and phone object detection to detect the user picking up and using their phone. I am using MediaPipe’s hand landmark detection that can detect where a phone is detected in the frame. For object detection, I looked into various algorithms, including SSD (Single-Shot object Detection) and YOLO (You Only Look Once). After reviewing some papers [1, 2] on these algorithms, I decided to go with YOLO for its higher performance.

I was able to find some pre-trained YOLOv5 models for mobile phone detection on Roboflow. Roboflow is a platform that streamlines the process of building and deploying computer vision models and allows for the sharing of models and datasets. One of the models and datasets is linked here. Using Roboflow’s inference Python API, I can load this model and use it to perform inference on images. Two other models [1, 2] performed pretty similarly. They all had trouble recognizing the phone when it was tilted in the hand. I think I will need a better dataset with images of people holding the phone in hand rather than just the phone by itself. I was able to find this dataset on Kaggle.

Overall, my progress is on schedule. In the following week, I hope to train and test a smartphone object detection model that performs better than the pre-trained models I found online. I will then try to integrate it with the hand landmark detector to detect phone pick-ups.

In the screenshots below, the yaw is negative when looking left and the yaw is positive when looking right.

Below are screenshots of the pre-trained mobile phone object detector and MediaPipe’s hand landmark detector.

Team Status Report for 2/17

Public Health, Social, and Economic Impacts

Concerning public health, our product will address the growing concern with digital distractions and their impact on mental well-being. By helping users monitor their focus and productivity levels during work sessions and their correlation with various environmental distractions such as digital devices, our product will give users insights into their work and phone usage, and potentially help improve their mental well-being in work environments and relationship with digital devices.

For social factors, our product addresses an issue that affects almost everyone today. Social media bridges people across various social groups but is also a significant distraction designed to efficiently draw and maintain users’ attention. Our product aims to empower users to track their focus and understand what factors play into their ability to enter focus states for extended periods of time.

The development and implementation of the Focus Tracker App can have significant economic implications. Firstly, by helping individuals improve their focus and productivity, our product can contribute to overall efficiency in the workforce. Increased productivity often translates to higher output per unit of labor, which can lead to economic growth. Businesses will benefit from a more focused and productive workforce, resulting in improved profitability and competitiveness in the market. Additionally, our app’s ability to help users identify distractions can lead to a better understanding of time management and resource allocation, which are crucial economic factors in optimizing production. In summary, our product will have a strong impact on economic factors by enhancing workforce efficiency, improving productivity, and aiding businesses in better-managing distractions and resources.

Progress Update

The Emotiv headset outputs metrics for various performance states via their EmotivPRO API including attention, relaxation, frustration, interest, cognitive stress, and more. We plan to compute metrics to understand correlations (perhaps inverse) between various performance metrics. Given further understanding of how some performance metrics interact with one another; for example, the effects of interest in a subject or cognitive stress on attention could prove to be extremely useful to users in evaluating what factors are affecting their ability to maintain focus on the task at hand. We also plan to look at this data in conjunction with Professor Dueck’s focus vs. distracted labeling to understand what threshold of performance metric values denote each state of mind.

On Monday, we met with Professor Dueck and her students to get some more background on how she works with her students and understands their flow states/focus levels. We discussed the best way for us to collaborate and collect data that would be useful for us. We plan to create a simple Python script that will record the start and end of focus and distracted states with timestamps using the laptop keyboard. This will give us a ground truth of focus states to compare with the EEG brainwave data provided by the Emotiv headset.

This week we also developed a concrete risk mitigation plan in case the EEG Headset does not produce accurate results. This plan integrates microphone data, PyAudioAnalysis/MediaPipe for audio analysis, and Meta’s LLaMA LLM for personalized feedback into the Focus Tracker App.

We will use the microphone on the user’s device to capture audio data during work sessions and implement real-time audio processing to analyze background sounds and detect potential distractions. The library PyAudioAnalysis will help us extract features from the audio data, such as speech, music, and background noise levels. MediaPipe will help us with real-time audio visualization, gesture recognition, and emotion detection from speech. PyAudioAnalysis/MediaPipe will help us categorize distractions based on audio cues and provide more insight into the user’s work environment. Next, we will integrate Meta’s LLaMA LLM to analyze the user’s focus patterns and distractions over time. We will train the LLM on a dataset of focus-related features, including audio data, task duration, and other relevant metrics. The LLM will generate personalized feedback and suggestions based on the user’s focus data.

In addition, we will provide actionable insights such as identifying common distractions, suggesting productivity techniques, or recommending changes to the work environment that will further help the user improve their productivity. Lastly, we will display the real-time focus metrics and detect distractions on multiple dashboards similar to the camera and EEG headset metrics we have planned. 

To test the integration of microphone data, we will conduct controlled experiments where users perform focused tasks while the app records audio data. We will analyze the audio recordings to detect distractions such as background noise, speech, and device notifications. Specifically, we will measure the accuracy of distraction detection by comparing it against manually annotated data, aiming for a detection accuracy of at least 90%. Additionally, we will assess the app’s real-time performance by evaluating the latency between detecting a distraction and providing feedback, aiming for a latency of less than 3 seconds. 

Lastly, we prepared for our design review presentation and considered our product’s public health, social, and economic impacts. Overall, we made great progress this week and are on schedule.

Karen’s Status Report for 2/17

This week I finished implementing yawning and microsleep detection. These behaviors will help understand a user’s productivity during a work session. I used this paper as inspiration for how to detect yawning and microsleeps. I calculate the mouth and eye aspect ratios, which tell us how open or closed the mouth and eyes are. If the ratios exceed a certain threshold for a set amount of time, it will trigger a yawn or microsleep detection. I implemented this using MediaPipe’s face landmark detection rather than Dlib as used in the paper because MediaPipe is reported to have higher accuracy and also provides more facial landmarks to work with.

Calibration and determining an appropriate threshold to trigger a yawn or microsleep detection proved to be more difficult than expected. For the detector to work on all users with different eye and mouth shapes, I added a calibration step at the start of the program. It first measures the ratios on a neutral face. It then measures the ratios for when the user is yawning, and then the ratios for when the user’s eyes are closed. This is used to determine the corresponding thresholds. I normalize the ratios by calculating a Z-score for each measurement. My implementation also ensures that the detectors are triggered once for each yawn and each instance of a microsleep regardless of their duration. After finishing the implementation, I spent some time organizing the detectors into individual modules so that the code could be refactored and understood more easily. The code with my most recent commit with yawning and microsleep detection can be accessed here.

I began exploring options for head pose detection and will follow a similar approach to that proposed in this paper.

Overall, I am on schedule and making good progress. In the coming week, I will finish implementing head pose estimation to track where the user’s head is facing. This will help us track how long the user is looking at/away from their computer screen, which can be correlated to their focus and productivity levels. If this is complete, I will look into and begin implementing object detection to detect phone pick-ups.

Below is a screenshot of the yawning and microsleep detection program with some debugging messages to show the ratios and their thresholds.