Aditi’s Status Reports – Team B3: Scenescribe

December 2, 2023

Aditi’s Status Report for 12.02.23

This week, I completed most of the hardware/software/ML integration. I was able to get the slide description model fully integrated, as well as the “start” button. I also helped gather and label data for the slide matching and graph description models, as well as work on geometrically preprocessing for the slide matching models. I was able to begin the integration for the graph description pipeline, but cannot finish until the graph description model itself is finished.

I am personally ahead of schedule, as my subsystem has been finished, but have taken on more tasks in the past few weeks. Our team is behind schedule, since we should have tested this week. However, we will be able to test once everything is on the Jetson. Next week, I plan to finish integration and begin testing.

November 18, 2023

Aditi’s Status Report for 11.18.23

This week, since I had already finished my portion of the project, I gathered data for the slide detection model. I was able to gather about 6000 pictures of slides across 5 locations and multiple angles.

I’m still ahead of schedule. During Thanksgiving break, I plan on emailing the disabilities office to see if there may be any visually-impaired students/faculty who can test our product after we get back from break. The week after will likely just be testing and working on final deliverables, as well as helping my teammates with their subsystems.

November 11, 2023November 12, 2023

Aditi’s Status Report for 11.11.23

This week, I was able to integrate the hardware and the software by creating a server on my app-side. This server would wait for a STOP signal from the stop button, then would instantly stop the audio. I was having problems with the audio not playing earlier, but I was able to update my XCode version and have it work as normal. Now, every part of my app is fully working, except the API key functionality. That will not take much time, but I don’t want to add it unless I’m sure that it’s absolutely necessary — I am not totally sure if every professor needs to send their API key, or whether just one will suffice. This week, I plan to have my teammates create test Canvas classes with their own accounts in order to see if that matters, then either add or remove that functionality.

I am very on track with my progress, I am almost completely done, and am just waiting on integration from the ML side. Next week, I plan to help Nithya collect data for the ML model and do anything else my teammates need, since my subsystem is basically done.

These are the tests I’m planning on running:
1. Have a visually-impaired person test the app for compatibility with VoiceOver. I will give them my phone, and have them navigate the app and ask for feedback about where it can be improved to make it more accessible.
2. Have a visually-impaired person rate the usefulness of the device, and the outputs from the ML model by allowing them to use the app in conjunction with a pre-prepared slideshow, and compare these values with what we said in our use-case requirements (>90% “useful”)
3. Have sighted volunteers rate the usefulness of the device, and rate the graph description model, and compare with what we said in use-case requirements (>95% useful). If they feel like we are excluding too much of the graph information, we will refine the ML model and gather and train with more data.
4. Test the latency of button press to the start/stop of sound, and compare with what we said in our use case requirements (<100ms)
5. Test the accuracy of our ML model by comparing it to the test set error, as we said in use case requirements (>95% accurate).

November 4, 2023November 5, 2023

Aditi’s Status Report for 11.04.23

This week, I worked mostly on creating the settings functionality, as well as allowing users to add as many classes as they wanted in a semester. When they enter a new semester, they must re-update these fields. It was difficult to figure out how to allow data to persist across sessions, and the Swift tutorials I followed weren’t working. My solution was just to create a JSON saved to the user’s phone file system, and have all of the information update that JSON.

This is the settings screen:

When you click on New Semester, the user can enter in the corresponding info:

Here is the UI, which uses a mix of high contrast, accessible colors:

When “class 2” is clicked, this is the retrieved result, which is the first slide of the most recently uploaded pdf under the “lectures” module. Currently, it’s printed to the XCode terminal:

Here is the Canvas account I created, with the three test courses:

Here is the userData struct, which is saved to the user’s phone files:

name is the name of the class, and classID is the id of the Canvas course, which is viewable in the Canvas link for the course. The URL for Class 1 is https://canvas.instructure.com/courses/7935642, and the corresponding classID is 793562. The functionality for the user setting an API key will be done later, and must be present across app sessions as well.

When the user saves the settings, and clicks on the button, the first slide will be outputted and read out loud. When we finish the ML model and are able to receive communication from the Jetson, we’ll change it so that the output will be whatever slide the professor is on.

Next week, I plan on helping Nithya and Jaspreet integrate with my Flask server, and reach out to start finding visually-impaired volunteers to test our product. I am still ahead of schedule, and am almost completely finished with my subsystem.

October 19, 2023

Aditi’s Status Report for 10.21.23

This week I mocked up a finalized design for the iOS app. Since we added the functionality for students to automatically scrape the lectures from Canvas (instead of having to download the PDF themselves, email it to their phone, then upload it to the app), I had to change the design. Here is what it might look like, for a student taking the corresponding 4 lecture classes:

I also had to modify the Swift and Flask server code to allow for the following functionality in the test app: When I click one of the buttons in the iPhone simulator, the Flask server should send the app a PDF corresponding to that lecture. Previously, I only had one button corresponding to one lecture, but I had to modify it to send multiple. I also looked into the Canvas scraping code, but I plan on implementing that next week.

I also spoke with Catherine from Disability Resources, who said that I should be looking into the Web Content Accessibility Guidelines (WCAG), and look into preexisting work on data accessibility. She also thought that we might want to add a feature to let the user know when the slide has changed. I will look into all of this next week.

The other main task I completed before break was working on the design report, which took a lot more time than all of us expected. However, we were able to turn out a good finished product.

I was ahead of schedule for the past couple of weeks, but now I am exactly on schedule. Because I was ahead of schedule, I decided to take on the task of building the Canvas scraper, which I plan to finish next week, along with the integration of the scraper with my Flask server.

The individual question states: What new tools are you looking into learning so you are able to accomplish your planned tasks? I plan on learning more about how to use the Canvas API, since I have never used one before. I need to figure out what headers to send to authenticate the user, and what it will send me back, and how to parse that information. I need to look at some prebuilt Canvas scraping code, which I found here: https://github.com/Gigahawk/canvas-file-scraper/blob/master/README.md and figure out how to adapt it to my use case. I also likely will end up helping Jaspreet with integrating the RaspberryPI with the Flask server, and I have never worked with sending image data from the RPi before so I need to find out how to do that.

October 6, 2023

Aditi’s Status Report for 10.07.23

This week, I created an app on XCode that can perform the following functions: allow users to upload a PDF file, send this PDF to a Flask server, receive the extracted text from the server, and speak this text out loud. I also built the associated Flask server in Python. On the server, there is some python code to extract only the information on a single slide — only one slide needs to be spoken aloud at a time. For the final app, the actual screen description won’t be printed (this is just for testing).

Here are images of the working code on an iPhone 14 Pro simulator:

One issue I ran into was not being able to test out the haptics and accessibility measures I implemented, as well as the text to speech. This cannot be tested on a simulator, and to feel the physical vibrations from the haptics, I needed to test it on my personal phone. However, I need an Apple Developer account in order to test anything on my personal phone (rather than a built-in simulator). So, I emailed professors and faculty who know about iOS development (like Prof. Larry Heimann who teaches an iOS dev course at CMU), and am waiting on a response from them. However, for the most part, the application seems to be finished and working for the time being. The major additions I will have to make is implementing logic on the python code running on the Flask server, but this will be done after the ML model is completed.

I am still ahead of schedule, but I expected to get some work done with respect to gathering training data for the ML model. However, I was unable to do that since the bugs I had in my Swift code took a long time to debug.

Next week, I will primarily focus on gathering a large portion of the data we need to train the slide and graph recognition models, as well as spend a lot of time working on my design presentation.

September 28, 2023October 1, 2023

Aditi’s Status Report for 09.30.23

This week I worked mainly on the pre-processing of potential slide image data, as well as the post-processing of possible text outputs.

First, I took an image on my phone of my computer screen with a sample slide on it:

For the pre-processing, I first figured out how to de-skew the image and crop it so the existence of the keys on my computer keyboard wouldn’t affect the output:

However, I found that no amount of purely geometric de-skewing is going to perfectly capture the four corners of the slide, simply due to the fact that there could be obstructions, and the angle might be drastically different depending on where the user is sitting in the classroom. This led me to believe that we could use this deskewing as a preprocessing step, but we’ll likely need to build out an ML model to detect the exact coordinates of the four corners of the slide.

I tried using contour mapping to find a bounding box for the slide, but this didn’t work well when there were big obstructions (which definitely can be present in a classroom setting):

Here the contour is mapped in the red line. It seems okay, but the bottom left corner has been obstructed, and thus has shifted upwards a little. It would simply be a better solution to train a model to identify the slide coordinates.

I tried using combinations and variations of some functions used in image processing that I found online: blur, thresh, opening, grayscale, canny edge, etc. Eventually, I found that the best output was produced when applying a combination of grayscale, blur, thresh, and canny edge processing:

All of this produced the following output:

Use Case & Application & Problem: visually impaired paople cannet easily read text on whiteboards and Slides in the classroom, as a professer Is presenting.
® Scope: our solution addresses reading text during a lecture/presentation.= ‘The device will be a universal camera attachment which clips ento glasses, uses an ML medel to extract text, and reads the text aloud to the user through an OS app upon a button press.

So, I looked into post-processing methods using NLTK, Symspell, and TextBlob. After testing all three, NLTK’s autocorrect methods seemed to work the best, and provided this output:

Use Case & Application & Problem : visually impaired people cannot easily read text on whiteboards and Slides in the classroom , as a professor Is presenting . ® Scope : our solution addresses reading text during a lecture/presentation . = ‘ The device will be a universal camera attachment which clips into glasses , uses an ML model to extract text , and reads the text aloud to the user through an Of app upon a button press .

There were errors with respect to the bullet points and capitalization, but those can be easily filtered out. Other than that, the only two misspelled words are “Of” which should be “iOS” and “into” which should be “onto.”

After all of this processing, I used some ChatGPT help to write up a quick program (with Python’s Tkinter) that could help us tag the four corners of a slide by clicking on the edges, with a button that allows you to indicate whether an image contains a slide or not. The following JSON will be outputted, which we can then use as validation data to train our future model:

[
  {
    "image_file": "filename",
    "slide_exists": false,
    "bounding_box": null
  },
  {
    "image_file": "filename",
    "slide_exists": true,
    "bounding_box": {
      "top_left": [
        48,
        195
      ],
      "top_right": [
        745,
        166
      ],
      "bottom_right": [
        722,
        498
      ],
      "bottom_left": [
        95,
        510
      ]
    }
  }
]

Outside of this, I worked on the design presentation with the rest of my team. I am ahead of schedule, since I did not expect to do so much of the pre/postprocessing work this week! I actually had meant to build out the app this week, but I will update the schedule to reflect this change, since the processing seemed the most valuable to do this week.

Next week, I plan to write a simple Swift app that can communicate with a local Flask server running on my laptop. The app should be able to take in an image, send the image to the Flask server, receive some dummy text data, and use text to speech to speak the text out loud. Since I don’t have experience developing apps, most of this code (at least for this test app) will be scraped from the internet, as well as through ChatGPT. However, I will make sure to understand everything I implement. I also plan to take lots of images of slides from a classroom so I can start tagging the data and training an ML model.

For my aspect of the design, I learned the NLP postprocessing methods when I worked on a research project in sophomore year to recognize the tones of students’ responses when asked questions through a survey. I learned the image processing methods through working on planar homography for a lunar rover for the course 16-861 Space Robotics this semester. I haven’t specifically learned iOS development before, but I learned Flutter for a project in high school, and familiarized myself with the basics of XCode and SwiftUI through GeeksForGeeks last week, and I will use the principles I learned in 17-437 Web App Development to develop it. I am planning on implementing HTTP protocol, and I learned this in 18-441 Computer Networks.

September 22, 2023September 22, 2023

Aditi’s Status Report for 09.23.23

This week, I started to familiarize myself with Swift and XCode. I decided to watch the following tutorials first:
1. Swift Essentials
2. XCode Tutorial

And I continued to read through some Swift documentations, as well as the entire Swift tutorial on GeeksForGeeks. I wanted to be as thorough as possible in this so that in the weeks following, I would be able to catch any errors, and also focus on optimizing the application if that becomes an issue.

The rest of the work done this week was group work: we discussed which parts to order and how we could improve our project based on the presentation feedback we received.