nssampat – Team B3: Scenescribe

December 9, 2023December 10, 2023

Nithya’s Status Report for 12.09.23

This week, I finalized and retrained/saved the graph detection and slide matching models, continuing to make changes to hyper parameters and values for pre-processing. The main task this week was dealing with issues in the graph description model. I was able to train the model last week and get results on input scatterplots and line graphs, and during the same run, visualize results. However, when I tried to save the model and reload it (so that we wouldn’t have to retrain every time we wanted results), I faced several issues. The root of these issues is that in Keras, only the most basic type of model (a sequential model with fully connected or convolutional layers) can be saved and loaded in this way. Since my architecture is significantly more complex and consists of a CNN followed by an encoder/decoder network, Keras would not allow me to save and reload the model with its weights after training to be used for inference.

I spent several days trying to work around this issue by looking into the Keras documentation and Stack Overflow, but in the end decided that it would be better to use a different framework for the model. I then translated all of my code into PyTorch, in which it is much easier to save and load models. I then retrained the model.

Once I did this, I also worked on the integration – I completed the slide matching integration with Aditi earlier in the week, and I worked on the graph detection and description pipeline (I had to combine them into one full system since we are using the extracted graphs from detection for the graph description). This pipeline is now complete.

I finished on schedule and all that is left for this coming week is to try to make minor tweaks to perhaps improve the slide matching accuracy further and do some user testing so that we can include results for that in our final report. We also need to finish up our final report and do some testing tomorrow of the full system before the demo.

December 2, 2023

Nithya’s Status Report for 12.2.23

There has been a lot of progress on the ML models in the last two weeks.

1.I gathered and annotated 1000 real slides from lectures with graphs to augment the dataset.

2.I trained and evaluated the graph detection model (YOLO) – this was fairly successful, but there are some small issues with the bounding box cutting off axis labels. Here are some example results.

3.I labeled 850 images of captured lecture slides, modified Siamese network, and trained it twice (once with class imbalance and the other with balance). This didn’t work so I switched the approach to a combined integer detection and then classification problem.

4.I wrote a script to go through all lecture PDFs and extract slides as images. I wrote another script to place the slide number in a red box in the bottom right corner of the image. There was a lot of iteration required for this in order to produce the best possible image for number detection.

5.I gathered and annotated bounding boxes for 1000 lecture slides with the slide number in the particular format described above. I trained the object detection model YOLO to detect these slide numbers, which was successful. Here are some example results.

6.We tried using image processing methods such as denoising, thresholding, etc followed by tesseract (OCR) to extract the number from the cropped detected bounding box. This was not successful, so I switched to MNIST multi digit detection approach. I implemented a CNN from scratch which would classify each of the detected digits. I gathered and labeled about 600 images to augment the MNIST dataset for slide matching. I wrote a script to extract graphs from the 1000 gathered lecture slides. However the results of this were very poor, so we decided to switch back to the detection and processing approach, except using a simple MNIST classifier instead of OCR. This was fairly successful with an accuracy of 73% on a test set of 110 images from TechSpark. Here are some examples.

7.I wrote a script to generate 5000 each of line and scatter plots as well as reference descriptions.

8.For graph description, I modified the CNN-LSTM code and helped write sample descriptions for the approximately 780 graphs we captured and extracted in TechSpark. Here are the results for a scatterplot:

9.I helped Aditi with integration of slide matching by compiling detection and ML models into the proper format.

We are on schedule – what still needs to be done is improving the slide matching model and simply training and evaluating the graph description model, which will be done tonight and tomorrow. We will continue to make improvements and finish the integration tomorrow.

November 18, 2023

Nithya’s Status Report for 11.18.23

This week, our entire team focused primarily on data collection for the image-slide matching model. We set up 3 different sessions for about 2 hours each in various rooms across campus and compiled a master presentation of almost 5000 unique slides to take images of using our camera. We did this in Wean, Gates, and Hamerschlag classrooms, and the images we took were also varied in terms of lighting, content of the lecture slide in the image, distance to the lecture slide, and angle. See some of these images in the team status report.

I also did some more research into the Siamese Network we will need for the image-slide matching task, particularly the loss function. One method is simple binary cross-entropy loss, where we have a training pair labelled as 1 (if the image is of the lecture slide), or 0 (if the image is different from the lecture slide). Alternatively, we can consider triplet loss. To make this work, one training example will now consist of 3 images: an anchor a (a randomly selected image from the training set – this will be a pure lecture slide), a positive example p (an image from the training set in the same class as the anchor, so an image of the same slide), and a negative example n (an image from the training set that is of a different lecture slide). We then get embeddings for each of these 3 images, and in the loss function, we want to ensure that the distance between the anchor and positive example is less than the distance between the anchor and negative example. To prevent trivial zero-embeddings or the same embeddings for all images, we will require that ||f(a) – f(p)||^2 – ||f(a) – f(n)||^2 <= -alpha, where alpha is some positive constant that we will set as a hyperparameter (note that f is the function performed on the input images by the CNN); it acts as a threshold, specifying that the distance between the anchor and positive example must be at least alpha smaller than the distance between the anchor and negative example.

Finally, I had to go back and fix a bug from the random slide generation which was causing text to appear as rectangles on the slide. I made a Stack Overflow post to try to get help (https://stackoverflow.com/questions/77467424/getting-checkerboard-output-when-drawing-text-on-pil-image) but I ended up figuring out the problem on my own: the problem was with a specific font that I was randomly selecting, so once I removed that from the list of fonts to choose from, the problem was solved. See the team status report for some images of this.

In terms of schedule, I am a little bit behind but I think with the extra time we have next week, I will definitely be able to catch up; I will be using the extra time Wednesday – Friday to continue working on the image-slide matching model. In terms of deliverables, I hope to complete the image-slide matching model, train it using our gathered image data, and show some results here next week.

November 11, 2023November 12, 2023

Nithya’s Status Report for 11.11.23

This week, I worked on generating data for and training the graph detection algorithm. At first, I tried gathering data from online and from my own lecture slides, but I was not able to find a sufficient number of images (only about 50 positive examples, whereas a similar detection algorithm for faces required about 120,000 images). I used a tool called Roboflow to label the graph bounding box coordinates and use them for training, but with such few images, the model produced very poor results. For example, here was one image where several small features were identified as graphs with moderately high confidence:

Due to the lack of data and the tediousness of having to manually label the bounding box coordinates of each graph on a particular slide, I decided to generate my own data: slides with graphs on them. To ensure diversity in the data set, my code for generating the data probabilistically selects elements of a presentation such as slide title, slide color, number and position of images, number and position of graphs, text position, amount of text, font style, types of graphs, data on the graphs, category names, graph title and axis labels, gridlines, and scale. This is important because if we used the same template to create each slide, the model might learn to only detect graphs if certain conditions are met. Fortunately, part of the code I wrote to randomly generate these lecture slides – specifically, the part that randomly generates graphs – will be useful for the graph description algorithm as well. Here are some examples of randomly generated graphs:

Of course, the graph generation code will have to be slightly modified for the graph description algorithm (specifically graph and axis titles) so they are meaningful and not just random strings of text; however, this should be sufficient for graph detection.

Here are some examples of the generated full lecture slides with graphs, images, and text:

Since my code randomly chose the locations for any graphs within the slides, I easily modified the code to store these in the text file to be used as the labeled bounding box coordinates for graphs in each image.

With this, I am able to randomly generate over 100,000 images of slides with graphs, and I will train and be able to show my results on Monday.

In addition to this, I looked into how Siamese Networks are trained, which will be useful for the slide-to-image matching. I watched the following video series to learn more about few-shot learning and Siamese Networks:

Few-Shot Learning (1/3)

Few-Shot Learning (2/3)

Few-Shot Learning (3/3)

I have not run many tests on the ML side yet, but I am planning to run ablation tests by varying parameters in the graph detection model, and measuring the accuracy of bounding box detection (using the loss function based on intersection over union metric) on a generated validation data set. We will also be testing the graph detection on a much smaller set of slides from our own classes, and we can provide accuracy metrics in terms of IoU for those as well. We can then compare these accuracies to those laid out in our design and use-case requirements. Similarly, for the matching algorithm, measuring accuracy is just a matter of counting up how many images were correctly vs. incorrectly matched; we will just have to separate the data we gather into training, validation, and testing sets.

My progress is almost on schedule again, but we still need to collect a lot of images for the image-slide matching, and I plan to work on this with my group next week. By next week (actually earlier, since I have completed all but the actual training), I will have the graph detection model fully completed and hopefully some results from the graph description model as well, since we are aiming to gather as many images as we can this week for that model and generate reference descriptions.

November 4, 2023November 5, 2023

Nithya’s Status Report for 11.04.23

This week, I finished up the tutorial of the image-captioning model using the Flickr8k dataset. I decided not to train it so as to save my GPU quota for training our actual graph-description and graph identification models.

I also looked more in depth into how image labels should be formatted. If we use a tokenizer (such as the one from keras.preprocessing.text), we can simply store all of the descriptions in the following format:

1000268201_693b08cb0e.jpg#0 A child in a pink dress is climbing up a set of stairs in an entry way .

1000268201_693b08cb0e.jpg#1 A girl going into a wooden building .

1000268201_693b08cb0e.jpg#2 A little girl climbing into a wooden playhouse .

1000268201_693b08cb0e.jpg#3 A little girl climbing the stairs to her playhouse .

Another thought I had about reference descriptions was whether it would be better to have all reference descriptions follow the same format, or have varied sentence structure. Since the standard image captioning problem typically requires captioning a more diverse set of images (as compared to our graph description problem), there is not much guidance on tasks similar to this. Inferring from image captioning, I think that a simpler/less complex dataset would benefit from a consistent format, and this also addresses the need for precise, unambiguous descriptions (as opposed to descriptions that would be more “creative”). This is why I think we should do an ablation study where we have maybe 3 captions per each image that will differ, and then either 1) follow the same format for all images, or 2) follow a different format for each image. Here is an example of the difference on one of the images:

Ablation Case 1:

This is a bar graph. The x-axis displays the type of graph, and the y-axis displays the count for training and test data. The highest count is just_image, and the lowest count is growth_chart.

Ablation Case 2:

This bar graph illustrates the counts of training and test graphs over a variety of different graphs, including just_image, bar_chart, and diagram.

As you can see, the second case has more variation. We might want to have a balanced approach since we have multiple reference descriptions per image, so some of them can always have the same format, and others can have variation.

Unfortunately, it looks like the previous Kaggle dataset we found includes graph and axis titles in a non-English language, so we will have to find some new data for that.

Additionally, I looked into the algorithm which will detect bounding boxes around the specific types of graphs we are interested in from a slide. I wrote some code which should be able to do this – the main architecture includes a backbone, a neck, and a head. The backbone consists of the primary CNN, which will be something like ResNet-50, and will give us feature maps. Then we can build a FPN (feature pyramid network) for the neck, which will essentially sample layers from the backbone, as well as perform upsampling. Finally, we will have 2 heads – a classification head and a regression head.

The first head will perform classification (is this a graph or not a graph?) and the second head will perform regression to identify the 4 coordinates of the bounding box. There is also code for anchor boxes and non-max suppression (which eliminates anchor boxes with a high enough IoU, so that we are not identifying 2 bounding boxes for the same object).

Here is some of the code I worked on.

My progress is almost caught up as I got through the tutorial, but I do still need to find new graph data, generate reference descriptions with my teammates, and train the CNN-LSTM. However, I also started working on the graph detection algorithm above ahead of schedule. My plan for the next week is to work on data collection as much as possible (ideally every day with my teammates) so that I can begin training the CNN-LSTM as soon as possible.

October 28, 2023

Nithya’s Status Report for 10.28.23

This week, I continued to work through the CNN-LSTM tutorial from last week and specifically tried to address the issue I faced last week with training. Training such large models requires GPU, and resources like Google Colab provide only a limited number of hours using the GPU for free. Since another one of my classes has assigned us many AWS credits and I am likely to have some extra, I looked into training this tutorial model (and eventually our CNN-LSTM model) on AWS. Since I was totally new to AWS, I worked through this presentation to get set up:

AWS

I learned about different Amazon Machine Images (AMIs) as well as EC2 instance types. I faced some issues launching these instances and I discovered this was because of my vCPU limits. It took several days for my limit increase request to be approved, and then I was able to launch the instance.

I also learned about persistent storage, EFS in particular. This is important because when an instance is stopped, the file storage for that instance will be deleted (and unfortunately I experienced that when I had to stop my instance and restart it). I am still having some issues working out how to set up EFS and this is something that will definitely need to be fixed before we start training our model, in order to minimize the time we need to spend paying for active instances.

My progress is still behind as I wanted to work on the graph description model this week but was still caught up with the tutorial. I did work on the tutorial several days this week but due to the delay in getting my vCPU limit increased, I did not have much time to work through the subsequent issues with EFS. For next week, I will again try to work on the tutorial and graph description model several days, and I am currently (and will continue to) look into ways to solve the EFS issue to ensure smooth training when our model is ready.

For next week, I would like to have the results of the tutorial and a draft architecture for the CNN-LSTM. I also hope to work out how to do all of the labeling/figure out how the reference captions and images should be formatted so that they can be easily fed to the model.

October 21, 2023October 22, 2023

Nithya’s Status Report for 10.21.23

This week, I dove deeper into understanding the CNN-LSTM model and got started working with some code for one implementation of this kind of network for image captioning. I worked through the tutorial on this site:

How to Develop a Deep Learning Caption Generation Model

This tutorial uses images from the Flickr8k dataset. The CNN-LSTM model has 2 parts: the first is a feature extractor, and the second is the sequence model which generates the output description of the image. One important thing that I learned this week while working through this tutorial is about how the sequence model works: it generates the output sequence one word at a time by using a softmax over all words in the vocabulary, and this requires an “input sequence”, which is actually the sequence of previously generated words. Of course, we can upper-bound the length of the total output sequence (the example used 34 words as the bound).

Since the sequence model requires a “previously generated” sequence as input, we also make use of <startseq> and <endseq> tokens. Here is the example given in the article of how the CNN-LSTM would generate the description for an image of a girl:

Here is a portion of the full code, which encodes the architecture of the CNN-LSTM model, as well as a diagram which is provided in the article which helped me to better understand the layers in the model. I need to do a little more research into how the feature extractor and the sequence processor are combined (the last few layers where the two branches join) using a decoder, as I didn’t fully understand the purpose of this when I read this article. This is one of the things I will be looking into next week.

I wasn’t able to train the model and test any sample images this week (the model has to train for a while and given the number of images, it would take a lot of GPU power, and my quota for this week has been exhausted) – however, this is something else I hope to do next week.

I am a little behind schedule as I was hoping to have a more comprehensive grasp of the CNN-LSTM as well as have tried out the standard version of the model (without the necessary tweaks for our use case) by this time; however, I am confident that I will be able to catch up next week, as next week is also devoted to working on the graph description model. In terms of actions I will take, I will make sure to work on the graph description model both during class-time as well as on Tuesday and Thursday to catch up.

For next week, I hope to have the test version of CNN-LSTM for image captioning fully working with some results on sample images from the Flickr8k dataset, as well as determine how to do the labelling for our gathered/generated graph data.

October 7, 2023

Team Status Report for 10.07.23

Our team used several principles of engineering, science, and math to develop the design solution for our project.

Engineering Principle: one engineering principle we considered while developing our design was sustainability. We chose components and emphasized through our use case requirements that the device’s power consumption should not be too high and the app’s power consumption on the user’s phone should also be limited.
Scientific Principle: one scientific principle we used was peer review – one of the major changes we made to our design was based on the feedback we got from our proposal presentation, regarding having the slides pre-uploaded to the app. Peer review is an extremely important part of the design process since it allows for improvement in iterations.
Math Principle: one math principle we considered was precision, and this informed our decision for which models to select, as well as the change we made to our original idea of using Tesseract directly during the lecture, to now perform text extraction from the pre-uploaded slides.

One major risk is the ML model not being finished in time. However, this risk is being managed by simply working on gathering data as soon as possible, which we also mentioned in last week’s status reports. Our contingency plans include using more data that is readily available on the internet, rather than relying on tagging and collecting our own data.

No major changes were made to the design. However, based on the feedback from the design presentation, we plan on thinking more about the design requirements and fleshing these out for the upcoming design report deliverable. We also had questions about why the glasses were necessary, and whether the audio from the device would be too distracting during class. We thought about this feedback, and we decided not to move forward with any changes — we plan on addressing why these concerns were not extremely pressing, and why we think our product will not fall victim to these issues in the actual design report.

See Aditi’s status report for some progress photos on the app simulation!

October 7, 2023October 8, 2023

Nithya’s Status Report for 10.07.23

This week, I mainly worked on looking into data sources for training the graph description model. Of course, we will need a significant number of images of bar graphs, line graphs, scatterplots, and pie charts for training the graph description algorithm, even though we are planning to use pre-trained weights for the CNN-LSTM model.

For a dataset, here is one Kaggle dataset that contains 16,000 images of graphs, separated into 8 classes – this includes bar charts and pie charts, both of which we need for our use case. It was difficult finding a source which contained scatter plots and line graphs. One alternative to trying to find a single source containing a collection of graphs is web scraping, using Beautiful Soup. Essentially, if we have a list of websites, we can write a Python script as described here (https://www.geeksforgeeks.org/image-scraping-with-python/) to extract any images from that site. We would have to manually get these URLs and also filter out any non-graph images that are found this way, but it is another way we can gather data.

I also looked into some methods for increasing the number of images in our training set – some of these augmentation techniques are randomly cropping the image, flipping the image horizontally, changing the colors in the image, etc. Many of these, like the random crop and flipping the image, unfortunately may not be well-suited for graphs even though they are common augmentation techniques for CV tasks in general.

Another idea I had was to have some Python program generate graphs for us. This would be particularly useful for line and scatterplots, where a lot of the data on the x and y axes could be randomly generated using numpy. For this idea, we could also come up with a large list of axis labels so that what the graph is plotting wouldn’t be the same for all of the computer-generated graphs. Overall, I think this idea would allow us to generate a lot of varied graphs in a small amount of time, and would be a good addition to the data that we are able to gather online.

Here’s an example of what the random graph generation code would look like, as well as the resulting graph (these would be supplemented with axis labels and a title of course):

My progress is on schedule, as my goal for this week was to look into some graph data sources. For next week, my goal is to get started on understanding the code for the CNN-LSTM model, making changes to the architecture, and potentially begin training on the images of the Kaggle dataset.

September 30, 2023October 1, 2023

Team Status Report for 09.30.23

One risk to take into account with the changed system design is the method of matching images taken from the camera with slides from the deck, particularly in edge cases such as animations on the slides where only part of the content is shown when the image is captured, or annotations on the slides (highlight, circles, underlines, drawing, etc.). Our contingency plan to mitigate this risk is having multiple different matching methods ready to try – we have the text extraction method which we will try first, as well as another CNN structure similar to how facial recognition works. In facial recognition, there is a “database” of known faces and an image of a new face that we want to recognize is first put through a CNN to get a face embedding. This embedding can then be compared to those of other people in the database and the most similar one is returned as the best match.

Another risk/challenge may be generating the graph descriptions so that they are descriptive yet concise; the success of the graph descriptions will depend on the reference descriptions that we provide, so our contingency plan is to each provide a description for each image we use during the training process so that we get less biased results (since each reference description will be weighted equally when calculating loss).

We made two main modifications to our project this week.

1. We added a graph description feature on top of our text extraction feature. This was mostly meant to add complexity, as well as help provide users with more information. This requires another separate ML model that we have to collect data for, train, and implement. This incurs the cost of increasing the amount of time it will take for us to collect data, since we were only previously expecting to collect data for slide recognition. However, we plan on mitigating these costs by finding a large chunk of the data online, as well as starting to create our own data as soon as possible, so we can get started early. This also feeds into our next modification, which is meant to help ease the burden of manufacturing our own data.

2. We modified our product to include a feature where the user should upload the slides to the app which their professor has made available before the lecture. Then, when the user takes a picture with our camera during class, our app will just compare the text on the picture the user has taken with the text on the pre-uploaded slides, and recognize which slide the professor is currently on. This is important because our product will extract both text and graph data with much better accuracy since we would be working directly off of the original formatted slides, rather than a potentially blurry and skewed image that the user would take with our fairly low resolution camera. This also decreases the latency, because now all of the text and graph descriptions can be extracted before the lecture even begins, so the only latency during the lecture will be related to determining which slide the user is on, and the text to speech — both of which should be fairly quick operations. This modification also allows us to scrape lots of the training data for our ML model from online: previously we would have to collect data of blurry or angled photographs of slides, but now that we’re working directly off of the original slides and graphs, we would only need pictures of the slides and graphs themselves, which should be easy to find ourselves (and create if needed). This does incur the extra cost of the user having to upload the slides beforehand, which raises accessibility concerns. However, we plan on including haptics and accessibility functions to the app that would make this very easy. It also incurs the cost of the professor having to upload the slides beforehand, but this is usually done regardless of our product’s existence, so we don’t think this is too big of a concern.

Here is our updated schedule, which is also included in the Design Presentation:

Here are some progress photos from this week:

Slide bounding box recognition:

Slide pre-processing:

ML Data Tagging Tool:

September 30, 2023October 1, 2023

Nithya’s Status Report for 09.30.23

This week, I set up PyTesseract on my computer and was able to test it with a few images. Getting the tesseract module on my M1 Mac was a little challenging since certain packages are not compatible with the M1 chip (as opposed to the standard Intel chip). After getting that working, I tried running tesseract with a somewhat low quality image of my student ID card, and the results from that were fairly good, although certain parts were misidentified or left out. For example, my student ID number was correctly identified, but “Carnegie Mellon University” was read as “patnecie Mellon University”. Given that we will mainly be performing text recognition on very high quality images with our new change of pre-uploading slides, I don’t think the accuracy of PyTesseract will pose a large problem for us. This is the code I used for testing tesseract:

I learned from visiting this site (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) that tesseract already has support for identifying math symbols and equations. I hope to test this out more thoroughly next week, but preliminary results from this week seemed promising (identifying numbers and symbols like = correctly).

Since we decided to extend our use case by identifying and generating descriptions for certain types of graphs, I did a lot of research on how this would work. Initially, I looked at a few papers on the topic of “scene understanding” and “scene graph generation”, which is a similar problem to the one we are trying to solve; this involves building relationships between objects identified in images in the form of a graph.

I went on to look at papers on the topic of image description/captioning, which I feel is the most relevant to our graph description problem. From this paper (https://aclanthology.org/P19-1654.pdf), which actually proposes a new metric for measuring how accurate descriptions of images are, I learned that there are 2 standard methods of evaluating image descriptions/captions. The first is by human judgment – ask a person to rate the overall quality of a description and also rate based on certain criteria (relevance fluency, etc.). The second is automatic metrics, which compare the candidate description to a human-authored reference description. This is typically done by Word Mover’s Distance, which measures the distance between two documents in a word embedding space. We will use the same metric for our graph descriptions.

Courses which covered the engineering, science, and math principles our team used to develop our design include 18-240, 16-385 (Computer Vision), 18-290, 18-491 (Digital Signal Processing), 11-411 (Natural Language Processing), and 18-794 (Intro to Deep Learning & Pattern Recognition for Computer Vision). The last course in particular has already helped me a lot in understanding the different types of computer vision tasks we need to solve and the appropriate algorithms for them.

My progress is on schedule as this week, my main goal was to get PyTesseract working locally and start testing it out. Since we also made some changes to the use case, I updated my tasks in the schedule to focus more on the graph description model.

For the next week, I want to do more research into and maybe begin implementing the first version of the matching algorithm (for matching images of slides to pre-uploaded slides). I will also begin collecting and labeling graph data with my teammates.

September 22, 2023September 22, 2023

Nithya’s Status Report for 09.23.23

This week, I did some more research on the models for Optical Character Recognition. Here are some of the sources I looked at:

Optical Character Recognition Wiki

Tesseract GitHub

I learned more about which algorithm specifically allow OCR to work; OCR uses a combination of image correlation and feature extraction (both of which are computer vision methods that utilize filters) to recognize characters.

I also learned that certain OCR systems, such as Tesseract (which we mentioned in our proposal, and is open-source), use a two-pass approach. On the first pass-through, the system performs character recognition as usual, and on the second pass-through, the system actually uses the characters that it recognized with high confidence on the first pass to help predict characters that it was not able to recognize on the first pass.

I looked into post-processing techniques for OCR, which is something we might decide to try to improve accuracy. This involves getting the extracted text and then comparing it to some kind of dictionary, acting as a sort of ‘spell check’ to correct inaccurately-recognized words. This may be harder to do if there are proper nouns which don’t appear in a dictionary, so I’d like to try an implementation with and without post-processing, and compare the accuracy.

My progress is on schedule, as this week was meant for researching OCR models.

For next week, I will create a small test project using Tesseract and play around with the hyperparameters and training set, in order to ascertain the current level of accuracy that this system can achieve.

September 22, 2023

Team Status Report for 09.23.23

We have identified a few risks that could jeopardize the success of the project. Some of these challenges include optimizing the speed of data transfer and ML model (because we need the user to receive feedback in real-time), dealing with transcribing/reading math symbols and equations, performing extraction on poor-quality images, and making sure that the power consumption of both the device and the phone are appropriate for the device to be useful to user throughout the school day.

We have come up with some risk management plans, which include potentially switching to a smaller-size NN to reduce latency, including a wide variety of training images (which have mathematical symbols, low-contrast, blurry, etc.), making sure we have a high-quality camera to capture images, and switching out components for more power-friendly alternatives if needed.

The major change we made was narrowing the scope of the project to only perform text extraction and reading on presentations in classroom settings. This change was necessary since gathering enough training data for the system to perform recognition on any setting would have been difficult, and we wanted to prioritize the accuracy of our system. The change did not incur any costs in particular, but if time allows, we may expand our system to perform scene description on images included in a presentation.

Here is our most recent schedule:

Here is our most recent block diagram:

For our project, we had to consider multiple ethical concerns in order to make sure the product was as accessible to the public as possible. One specific concern we addressed was welfare: we wanted our product to be comfortable for the user, and we didn’t want it to detract from their overall learning experience, so we are planning to develop our device to be as lightweight and convenient as possible. Welfare concerns are extremely important to consider whenever developing a product, because we want to make sure that the user is as comfortable as possible, and we must make sure that it is accessible to everyone: a product might be convenient for some people, while not so much for others. Both of these kinds of people are valid users, and a product should be developed with both kinds in mind so that as many people as possible can benefit from the device.