This week, our entire team focused primarily on data collection for the image-slide matching model. We set up 3 different sessions for about 2 hours each in various rooms across campus and compiled a master presentation of almost 5000 unique slides to take images of using our camera. We did this in Wean, Gates, and Hamerschlag classrooms, and the images we took were also varied in terms of lighting, content of the lecture slide in the image, distance to the lecture slide, and angle. See some of these images in the team status report.
I also did some more research into the Siamese Network we will need for the image-slide matching task, particularly the loss function. One method is simple binary cross-entropy loss, where we have a training pair labelled as 1 (if the image is of the lecture slide), or 0 (if the image is different from the lecture slide). Alternatively, we can consider triplet loss. To make this work, one training example will now consist of 3 images: an anchor a (a randomly selected image from the training set – this will be a pure lecture slide), a positive example p (an image from the training set in the same class as the anchor, so an image of the same slide), and a negative example n (an image from the training set that is of a different lecture slide). We then get embeddings for each of these 3 images, and in the loss function, we want to ensure that the distance between the anchor and positive example is less than the distance between the anchor and negative example. To prevent trivial zero-embeddings or the same embeddings for all images, we will require that ||f(a) – f(p)||^2 – ||f(a) – f(n)||^2 <= -alpha, where alpha is some positive constant that we will set as a hyperparameter (note that f is the function performed on the input images by the CNN); it acts as a threshold, specifying that the distance between the anchor and positive example must be at least alpha smaller than the distance between the anchor and negative example.
Finally, I had to go back and fix a bug from the random slide generation which was causing text to appear as rectangles on the slide. I made a Stack Overflow post to try to get help (https://stackoverflow.com/questions/77467424/getting-checkerboard-output-when-drawing-text-on-pil-image) but I ended up figuring out the problem on my own: the problem was with a specific font that I was randomly selecting, so once I removed that from the list of fonts to choose from, the problem was solved. See the team status report for some images of this.
In terms of schedule, I am a little bit behind but I think with the extra time we have next week, I will definitely be able to catch up; I will be using the extra time Wednesday – Friday to continue working on the image-slide matching model. In terms of deliverables, I hope to complete the image-slide matching model, train it using our gathered image data, and show some results here next week.