Team Status Report for 3/23

This week we realized that while focus and flow state are closely related, they are distinct states of mind. While people have a shared understanding of focus, flow state is a bit more of an elusive term which means that people have their internal mental models of what flow state is and looks like. Given that our ground truth data is based on Prof. Dueck’s labeling of flow states in her piano students, we are shifting from developing a model to measure focus, to instead identifying flow states. To stay on track with our initial use case of identifying focus vs. distracted states in work settings, we plan to use Emotiv’s Focus Performance Metric to monitor users’ focus levels and develop our own model to detect flow states. By implementing flow state detection, our project will apply to many fields beyond just traditional work settings including music, sports, and research.

Rohan also discussed our project with his information theory professor, Pulkit Grover, who is extremely knowledgeable about neuroscience, getting feedback on the flow state detection portion of our project. He told us that achieving model test accuracy better than random chance would be a strong result, which we have achieved in our first iteration of the flow detection model. 

We also began integration steps this week. Arnav and Karen collaborated on getting the yawn, gaze, and sleep detections to be sent to the backend, so now these distractions are displayed in the UI in a table format along with snapshots in real-time of when the distraction occurs. Our team also met together to try to get our code running locally on each of our machines. This led us to write a README with information about libraries that need to be installed and the steps to get the program running. This document will help us stay organized and make it easier for other users to use our application.

Regarding any challenges/ risks for the project this week, we were able to clear up some information that was unclear between the focused and flow states and we are still prepared to add in microphone detection if needed. Based on our progress this week, all three stages of the project (Camera, EEG, and Web App) are developing very well and we look forward to continue integrating all the features.

Arnav’s Status Report for 3/23

This week, I successfully integrated the yawning, gazing, and sleep detection data from the camera and also enabled a way to store a snapshot of the user when the distraction occurs. The yawning, gazing, and sleep detection data is now stored in a table format and the columns include Time, Distraction Type, and Image. The table is updated almost instantly with a couple of milliseconds delay and this is because I am polling the data from the API endpoints every 1 second. This can be updated if the data needs to be shown on the React page even faster, but it is most likely not needed since the user ideally will not be monitoring this page while they are in a work session. The table appears on the Current Session Page and is under the Real-Time Updates table. 

I was able to get the snapshot of the user by using the following steps: 

I first utilized the run.py Python script to capture images from the webcam which is being stored in current_frame (a NumPy array). Once a distraction state is identified, I encoded the associated image into a base64 string directly in the script. This conversion to a text-based format allowed me to send the image over HTTP by making a POST request to my Django backend through the requests library, along with other data like session ID and user ID. 

The Django backend, designed with the DetectionEventView class, handles these requests by decoding the base64 string back into a binary image format. Using the DetectionEventSerializer, the incoming data is serialized, and the image is saved in the server’s media path. I then generated a URL that points to the saved image, which can be accessed from the updated data payload. To make the images accessible in my React frontend, I configured Django with a MEDIA_URL, which allows the server to deliver media files. 

Within the React frontend, I implemented a useEffect hook to periodically fetch the latest detection data from the Django backend. This data now includes URLs for the images linked to each detection event. When the React component’s state is updated with this new data, it triggers a re-render, displaying the images using the <img> tag in a dynamically created table. I ensured the correct display of images by concatenating the base URL of my Django server with the relative URLs received from the backend. I then applied CSS to style the table, adjusting image sizing and the overall layout to provide a smooth and user-friendly interface.

 The Current Session Page looks like the following:

I made a lot of progress this week and I am definitely on schedule. I will add in data from phone detection and distractions from surroundings next week. I will also work on creating some sample graphs with the current data we have. If I have some additional time, I will connect with Rohan and start to look into the process of integrating the EEG data into the backend and frontend in real-time.

 

Karen’s Status Report for 3/23

This week I wrapped up the phone pick-up detection implementation. I completed another round of training of the YOLOv8 phone object detector, using over 1000 annotated images that I collected myself. This round of data contained more colors of phones and orientations of phones, making the detector more robust. I also integrated MediaPipe’s hand landmarker into the phone pick-up detector. By comparing the location of the phone detected and the hand over a series of frames, we can ensure that the phone detected is actually in the user’s hand. This further increases the robustness of the phone detection.

After this, I began working on facial recognition more. This is to ensure that the program is actually analyzing the user’s facial features and not someone else’s face in the frame. It will also ensure that it is actually the user working and that they did not replace themselves with another person to complete the work session for them.

I had first found a simple Python face recognition library, which I did some initial testing of. Although it has a very simple and usable interface, I realized the performance was not sufficient as it had too many false positives. Here you can see it identifies two people as “Karen” when only one of them is actually Karen.

I then looked into another Python face recognition library called DeepFace. This has a more complex interface, but provides much more customizability as it contains various different models that can be used for face detection and recognition. I did some extensive experimentation and research of the different model options for performance and speed, and have landed on using Fast-MTCNN for facial detection and SFace for facial recognition.

Here you can see the results of my tests for speed for each model:

❯ python3 evaluate_models.py
24-03-21 13:19:19 - Time taken for predictions with VGG-Face: 0.7759 seconds
24-03-21 13:19:20 - Time taken for predictions with Facenet: 0.5508 seconds
24-03-21 13:19:22 - Time taken for predictions with Facenet512: 0.5161 seconds
24-03-21 13:19:22 - Time taken for predictions with OpenFace: 0.3438 seconds
24-03-21 13:19:24 - Time taken for predictions with ArcFace: 0.5124 seconds
24-03-21 13:19:24 - Time taken for predictions with Dlib: 0.2902 seconds
24-03-21 13:19:24 - Time taken for predictions with SFace: 0.2892 seconds
24-03-21 13:19:26 - Time taken for predictions with GhostFaceNet: 0.4941 seconds

Here are some screenshots of tests I ran for performance and speed on different face detectors.

OpenCV face detector (poor performance):

Fast-MTCNN face detector (better performance):

Here is an outline of the overall implementation I would like to follow:

  • Use MediaPipe’s facial landmarking to rough crop out the face
  • During calibration
    • Do a rough crop out of face using MediaPipe
    • Extract face using face detector
    • Get template embedding
  • During work session
    • Do a rough crop out of face 0 using MediaPipe
    • Extract face using face detector
    • Get embedding and compare with template embedding
    • If below threshold, face 0 is a match
    • If face 0 is a match, everything is good so continue
    • If face 0 isn’t a match, do same process with face 1 to see if there is match
    • If face 0 and face 1 aren’t matches, fallback to using face 0
  • During work session
    • If there haven’t been any face matches in x minute, then user is no longer there

Although I hoped to have a full implementation of facial recognition completed, I spent more time this week just exploring and testing the different facial recognition options available to find the best option for our application, and outlining an implementation that would work with this option. Overall, my progress is still on schedule taking into account the slack time added.

Karen’s Status Report for 3/16

This week I focused on improving phone detection. I familiarized myself with the Roboflow platform and how to train my own object detection model on it. Following this, I began the process of training the object detector by collecting a diverse dataset. I recorded videos of several different people with several different phones holding their phone in their hands. On the Roboflow platform, I was able to annotate and label the phone in each frame. I also applied some augmentations (changes in sheer, saturation, and brightness) and ended up with over 1000 images for the dataset. The results of the training are in images below. Overall, this process went much smoother than training locally using the analytics Python package. The training time was much shorter and I also obtained much better results using my own custom dataset.

After using the phone detector live, it performs much more robustly than my previous iteration. However, I noticed that it struggled detecting phones in certain orientations, especially when only the thin of the phone is visible in frame. In frame, this looks like a very thing rectangle or even a line, so I collected more videos of people holding phones in this orientation. I also noticed poor performance on colored phones, so will need to collect more data in these situationsI will have to label each frame and will then use the model I have already trained as a starting point to further train on this new data in the coming week.

I have integrated all of the individual detectors into a single module that prints when a behavior or distraction is detected along with the timestamp. It keeps track of behavior “states” as well, so that a distraction is not recorded for every individual frame. I am collaborating with Arnav to translate these print statements into calls to the API he has created to communicate with the backend.

This coming week, I will also integrate MediaPipe’s hand pose landmarker so that I can track the hand in frame as well. We only want to consider a phone pick up when the phone is detected in the hand, so I will need to check that the location of phone is in the vicinity of the user’s hand. Another feature I will be working on in the next week is facial recognition. If there are multiple people in frame, facial recognition will be used to distinguish between the user and any other people in frame. This will ensure that we run facial analysis (sleeping, yawning, and gaze detection) on the right face.

With these updates to the phone detector, my progress is on schedule.

Arnav’s Status Report for 3/16

This week I focused on integrating the camera data with the Django backend and React frontend in real-time. I worked mainly on getting the yawning feature to work and the other ones should be easily integrated now that I have the template in place. The current flow looks like the following: the run.py file which is used for detecting all distractions (gaze, yawn, phone pickups, microsleep) now sends a post request for the data to http://127.0.0.1:8000/api/detections/ and also sends a post request for the current session to http://127.0.0.1:8000/api/current_session. The current_session is used to ensure that previous data is not shown for the current session the user is working on. The data packet that is currently sent includes the session_id, user_id, distraction_type, timestamp, and aspect_ratio. For the backend, I created a  DetectionEventView, CurrentSessionView, and YawningDataView that handles the POST and GET requests and orders the data accordingly. Finally, the frontend fetches the data from these endpoints using fetch(‘http://127.0.0.1:8000/api/current_session‘) and fetch(`http://127.0.0.1:8000/api/yawning-data/?session_id=${sessionId}`) and polls the data every 1 second to ensure that it catches any distraction event in real-time. Below is a picture of the data that is shown on the react page every time a user yawns during a work session:

The data is ordered so that the latest timestamps are shown first. Once I have all the distractions displayed, then I will work on making the data look more presentable. 

My progress is on schedule and during the next week, I will continue to work on the backend to ensure that all the data is displayed and I will put the real-time data in a tabular format. I will also try to add a button to the frontend so that it automatically triggers the run.py file so that it does not need to be run manually. 

Arnav’s Status Report for 3/9

This week I worked with Rohan on building the data labeling platform for Professor Dueck and designing the system for how to collect and filter the data. The Python program is specifically designed for Professor Dueck to annotate students’ focus states as ‘Focused,’ ‘Distracted,’ or ‘Neutral’ during music practice sessions. The platform efficiently records these labels alongside precise timestamps in both Epoch and conventional formats, ensuring compatibility with EEG headset data and ease of analysis across sessions. We also outlined the framework for integrating this labeled data with our machine learning model, focusing on how EEG inputs will be processed to predict focus states. This preparation is crucial for our next steps: refining the model to accurately interpret EEG signals and provide meaningful insights into enhancing focus and productivity.

Additionally, I worked on integrating a webcam feed into our application. I developed a component named WebcamStream.js. This script prioritizes connecting with an external camera device, if available, before defaulting to the computer’s built-in camera. Users can now view a real-time video feed of themselves directly within the app’s interface. Below is an image of the user when on the application. I will move this to the Calibration page this week.

My progress is on schedule and during the next week, I plan to integrate the webcam feed using MediaPipe instead so that we can directly extract the data on the application itself. I will also continue to work with Rohan on developing the machine learning model for the EEG headset and hopefully have one ready by the end of the week. In addition, I will continue to write code for all the pages in the application.

Team Status Report for 3/9

Part A was written by Rohan, Part B was written by Karen, and Part C was written by Arnav. 

Global Factors

People in the workforce, across a wide variety of disciplines and geographic regions, spend significant amounts of time working at a desk with a laptop or monitor setup. While the average work day lasts 8 hours, most people are only actually productive for 2-3 hours. Improved focus and longer-lasting productivity have many benefits for individuals including personal fulfillment, pride in one’s performance, and improved standing in the workplace. At a larger scale, improving individuals’ productivity also leads to a more rapidly advancing society where the workforce as a whole can innovate and execute more efficiently. Overall, our product will improve individuals’ quality of life and self-satisfaction while simultaneously improving the rate of global societal advancement.

Cultural Factors

In today’s digital age, there’s a growing trend among people, particularly the younger generation and students, to embrace technology as a tool to improve their daily lives. This demographic is highly interested in leveraging technology to improve productivity, efficiency, and overall well-being. Also within a culture that values innovation and efficiency, there is a strong desire to optimize workflows and streamline tasks to achieve better outcomes in less time. Moreover, there’s an increasing awareness of the importance of mindfulness and focus in achieving work satisfaction and personal fulfillment. As a result, individuals seek tools and solutions that help them cultivate mindfulness, enhance focus, and maintain a healthy work-life balance amidst the distractions of the digital world. Our product aligns with these cultural trends by providing users with a user-friendly platform to monitor their focus levels, identify distractions, and ultimately enhance their productivity and overall satisfaction with their work.

Environmental Factors

The Focus Tracker App takes into account the surrounding environment, like background motion/ light, interruptions, and conversations to help users stay focused. It uses sensors and machine learning to understand and react to these conditions. By optimizing work conditions such as informing the user that the phone is being used too often or the light is too bright, it encourages a reduction in unnecessary energy consumption. Additionally, the app’s emphasis on creating a focused environment helps minimize disruptions that could affect both the user and its surroundings.

Team Progress

The majority of our time this week was spent working on the design report.

This week, we sorted out the issues we were experiencing with putting together the data collection system last week. In the end, we settled on a two-pronged design: we will utilize the EmotivPRO application’s built-in EEG data recording system to record power readings within each of the frequency bands from the AF3 and AF4 sensors (the two sensors corresponding to the prefrontal cortex) while simultaneously running a simple python program which takes in Professor Dueck’s keyboard input, ‘F’ for focused ‘D’ for distracted and ‘N’ for neutral. While this system felt natural to us, we were not sure if this type of stateful labeling system would match Professor Dueck’s mental model when observing her students. Furthermore, given that Professor Dueck would be deeply focused on observing her students, we were hoping that the system would be easy enough for her to use without having to apply much thought to it. On Monday of this week, we met with Professor Dueck after our weekly progress update with Professor Savvides and Jean for our first round of raw data collection and ground truth labeling. To our great relief, everything ran extremely smoothly with the EEG quality coming through with minimal noise and Professor Dueck finding our data labeling system to be extremely intuitive and natural to use. One of the significant risk factors for our project has been EEG-based focus detection. As with all types of signal processing and analysis, the quality of the raw data and ground truth labels are critical to training a highly performant model. This was a significant milestone because while we had tested the data labeling system that Arnav and Rohan designed, it was the first time Professor Dueck was using it. We continued to collect data on Wednesday on a different one from Professor Dueck, and this session went equally as smoothly. Having secured some initial high-fidelity data with high granularity ground truth labels, we feel that the EEG aspect of our project has been significantly de-risked. Going forward, we have to map the logged timestamps from the EEG readings to the timestamps from Professor Dueck’s ground truth labels so we can begin feeding our labeled data into a model for training. This coming week, we hope to have this linking of the raw data with the labels complete as well as an initial CNN trained on the resulting dataset. From there, we can assess the performance of the model, verify that the data has a high signal-to-noise ratio, and begin to fine-tune the model to improve upon our base model’s performance.

A new risk that could jeopardize the progress of our project is the performance of the phone object detection model. The custom YOLOv8 model that has been trained does not currently meet the design requirements of mAP ≥95%. We may need to lower this threshold, improve the model with further training, or use a pre-trained object detection model. We have already found other datasets that we can further train the model on (like this one) and have also found a pre-trained model on Roboflow that has higher performance than the custom model that we trained. This Roboflow model can be something we fall back on if we cannot get our custom model to perform sufficiently well.

The schedule for camera-based detections was updated to be broken down into the implementation of each type of distraction to be detected. Unit testing and then combining each of the detectors into one module will begin on March 18.

To mitigate the risks associated with EEG data reliability in predicting focus states, we have developed 3 different plans:

Plan A involves leveraging EEG data collected from musicians and Professor Jocelyn uses her expertise and visual cues to label states of focus and distraction during music practice sessions. This method relies heavily on her understanding of individual focus patterns within a specific, skill-based activity. 

Plan B broadens the data collection to include ourselves and other participants engaged in completing multiplication worksheets under time constraints. Here, focus states are identified in environments controlled for auditory distractions using noise-canceling headphones, while distracted states are simulated by introducing conversations during tasks. This strategy aims to diversify the conditions under which EEG data is collected. 

Plan C shifts towards using predefined performance metrics from the Emotiv EEG system, such as Attention and Engagement, setting thresholds to classify focus states. Recognizing the potential oversimplification in this method, we plan to correlate specific distractions or behaviors, such as phone pick-ups, with these metrics to draw more detailed insights into their impact on user focus and engagement. By using language model-generated suggestions, we can create personalized advice for improving focus and productivity based on observed patterns, such as recommending strategies for minimizing phone-induced distractions. This approach not only enhances the precision of focus state prediction through EEG data but also integrates behavioral insights to provide users with actionable feedback for optimizing their work environments and habits.

Additionally, we established a formula for the productivity score we will assign to users throughout the work session. The productivity score calculation in the Focus Tracker App quantifies an individual’s work efficiency by evaluating both focus duration and distraction frequency. It establishes a distraction score (D) by comparing the actual number of distractions (A) encountered during a work session against an expected number (E), calculated based on the session’s length with an assumption of one distraction every 5 minutes. The baseline distraction score (D) starts at 0.5. If A <= E: then D = 1 – 0.5 * A / E. If A > E, then 

This ensures the distraction score decreases but never turns negative.  The productivity score (P) is then determined by averaging the focus fraction and the distraction score. This method ensures a comprehensive assessment, with half of the productivity score derived from focus duration and the other half reflecting the impact of distractions.

Overall, our progress is on schedule.

 

Karen’s Status Report for 3/9

This week I spent the majority of my time working on the design report. Outside of that, I experimented with object detection for phone pick-up detection. One component of the phone pick-up detection is phone object recognition, so I trained the YOLOv8 model to detect phones using the MUID-IITR dataset. This was the closest dataset I could find online to match scenarios for the Focus Tracker App. The dataset includes images of people using a phone while performing day-to-day activities as well as annotations of the coordinates of the phones in each image. The dataset required some converting to match the YOLOv8 formatting, and then I used the Python package Ultralytics to train the model. Below are results of the training with 100 epochs. The recall and mAP never exceed 0.8, which does not satisfy the design requirements we specified. Testing the model, I noticed that it sometimes predicted just a hand as a phone. The FPS is also fairly low ~10 FPS.

There are some other datasets (like this one) that I can try to continue training the model on that are just the phone itself, which could prevent the false negatives of the hand being classified as a phone. My risk mitigation plan for my custom YOLOv8 model not achieving sufficient performance is to use a model that has already been trained, available on Roboflow. This is a YOLOv5 model trained on 3000+ images of phones and people using phones. This model is linked here. This option may be better, because the training time is very costly (>12 hours for 100 epochs). The FPS for the Roboflow is also higher (~20 FPS).

I also have a plan to collect and annotate my own data. The MUID-IITR dataset puts a fairly large bounding box around the hand which may be the reason for so many false positives too. Roboflow has a very usable interface for collecting data, annotating images, and training a YOLO model.

Here is the directory with the code for manipulating the data and training my custom YOLOv8 model. And here is the directory with the code for facial recognition.

My progress is overall on schedule, but the custom YOLOv8 model not performing as well as desired is a bit of a setback. In the coming week, I plan to further train this custom model or fall back onto the Roboflow model if it is not successful. I will also integrate the hand landmarker to make the phone pick-up detection more robust by also taking into account the hand that is picking up the phone. I will also further experiment with the face recognition library that I will use for detecting interruptions from others.

Team Status Report for 2/24

This week we finalized our slides for the design presentation last Monday and used the feedback received from the students and Professors in our design report. We split up the work for the design report and plan to have it finalized by Wednesday so that we can get the appropriate feedback before the due date on Friday. We are also working on building a data labeling platform for Professor Dueck and plan to meet with her this week so that we can begin the data-gathering process. No changes have been made to our schedule and we are planning for risk mitigation by doing additional research for Microphone/ LLMs in case the EEG headset does not provide the accurate results we are looking for. Overall, we are all on schedule and have completed our individual tasks as well. We are looking forward to implementing more features of our design this week.

Arnav’s Status Report for 2/24

This week I worked on setting up the React Frontend, Django Backend, and the Database for our Web Application and made sure that all necessary packages/ libraries are installed. The Home/ Page looks very similar to the UI planned in Figma last week. I utilized react functional components for the layout of the page and was able to manage state and side effects efficiently. I integrated a bar graph, line graph, and scatter plot into the home page using Recharts (React library for creating interactive charts). I made sure that the application’s structure is modular, with reusable components so that it will be easy to add on future pages that are part of the UI design. Regarding the backend, I did some experimentation and research with Axios for API calls to see what would be the best way for the frontend and backend to interact with each other, especially for real-time updates. Django’s default database is SQLite and once we have our data ready to store in the database the migration to a PostgreSQL database will be very easy. All of the code written for the features mentioned above has been pushed on a separate branch to the shared GitHub repository for our team: https://github.com/karenjennyli/focus-tracker.

Lastly, I also did some more research on how we can use MediaPipe along with React/ Django to show the live camera feed of the user. The live camera feed can be embedded directly into the React application, utilizing the webcam through Web APIs like navigator.mediaDevices.getUserMedia. The processed data from MediaPipe, which might include landmarks or other analytical metrics will be sent to the Django backend via RESTful APIs. This data will then be serialized using Django’s REST framework and stored in the database.

My progress is currently on schedule and during the next week, I plan to write code for the layout of the Calibration and Current Session Pages and also get the web camera feed to show up on the application using MediaPipe. Additionally, I will do more research on how to integrate the data received from the Camera and EEG headset into the backend and try to write some basic code for that.