James’s Status for 10/23

Last week, I mainly focused on the design review report. I didn’t get it as far along / as polished as I would have hoped; this means it will be more work for us when updating it for the final document. I ended up writing sections 1, 2, 3, 3.2, 4.1, 4.2, 5.1, 6.1, 9, 9.1, acronym glossary, BoM, and references. I was unable to write more, or organise with my partners to write more in their sections, but I know that we messed this up badly, and we will aim to rectify it before the time the final report comes around.

This week, I focused more on optimising CNN operations on the FPGA. This is a little bit out of order, but I decided to do things a bit out-of-order because it works much more synergystically with where we are in reconfig right now. I have so far increased throughput (on a basic fast-CNN implementation) by 25% to 20MOp/s, but am expecting to settle at two orders of magnitude higher than where I am right now. I also helped Kunal on-ramp with some Vitis stuff, as he was slipping behind on ramping with the platform. I shared excerpts from Vitis tutorials that we were given for reconfig, as well as pointing him more directly in the direction of online resources for Vitis. I need to circle back with him and check where he is with progress on i/o, and plan to do so this coming Monday. This may effect the Gantt chart / schedule, but we have the slack to allow for it for now. I will be keeping tabs on how much slack I use in my tasks because I know that I have begun cutting things close with the amount of remaining slack which I am allotting myself.

Kunal’s Status Update (10/16)

I looked into the hls language directives & methods to write a pipelined matrix multiplier in vivado hls on the ultra 96 fpga component. The process seems relatively straightforward albeit for the different components involved with the design & maintenance of the buffers implementing the various pipeline stages.  These parameters will need to be benchmarked and tuned in order for optimal feed-forward latencies. Since this is a fully trained model, the only latencies which would matter are the ones moving forward through inference on the pipeline. I’m going to continue work on this through this week and maintain everything through proper source control.

Joshua’s Status Report for 10/9/21

This week, I spent a lot of time incorporating our new design change of using SSIM instead of VMAF as our metric, by rewriting some of the code I had locally and benchmarking more using SSIM. The end results were very satisfactory – SSIM was a lot faster than VMAF and suited the training portion of our project much better, and I also experimented with the idea of using the previous/next frames as side inputs, but decided against it due to the complexity and, importantly, the little added value of doing that compared to a solely frame-by-frame upscaling method. I also worked on the design review with my teammates to refine the details of our implementation and our project overall.

For the next week, I will spend the majority of my time writing code on AWS, finishing up my model and starting to train it with our chosen dataset. Since I have several midterms next week, I will have to balance my time well and coordinate with my team to make sure that we are working on the project on time.

Note: This individual status report was published late due to a technical issue. The initial draft shows it was created last Saturday, before the due date, but was not published and not fully saved.

Kunal’s Status Update (10/9)

This week, I looked at the specifications of our platform and thought of particular ways to implement I/O in a way that preserves the real-time aspect of our deep learning system. An approach involves directly copying blocks of data from the USB bus to the main memory that the ARM core can directly start processing frames and forwarding it to the FPGA’s core logic. Sort of based off DMA, but some parameter tuning with regards to optimizing the system to achieve low latency. Planning on doing this work this coming week after a couple midterms.

James’s Status for 10/9

This week was very busy for me in other courses and so subsequently I did not hit my targets for this week. I didn’t get a chance to sync up with Kunal to see what I/O routines he has written or to begin testing/validating them. Research on I/O however seems to be mostly done; it ended early in the week, on Monday, so that was one task I am able to check off for this week.

This coming week, I plan to grind on I/O and start on CNN math. We also have the Design Review Report to further flesh out. Hopefully this doesn’t encroach too much time from my working on my other tasks, but I don’t think it will, since we have ‘mid-semester break’ which I can use as a day to get things done.

Kunal’s Status Update (10/2)

This week we worked on identifying & fine tuning various metrics to define video resolution quality. We have identified VMAF as being too slow for the latency constrained environment that our solution will be operating in. As a loss function for the convolutional neural network we are building, we found that the latency hit is too large for a real time application like this one. With that in mind, we have identified SSIM as a viable heuristic for quantifying image quality. SSIM is largely based off mean-squared error & uses the algorithm to model & quantify discrepancies between video frames.

We benchmarked the VMAF algorithm on two identical videos at 1080p and this took roughly 1 minute and this is significantly too long to use for a loss function in a real-time ML pipeline for video upscaling. Hence, we are going with SSIM & a mean squared error approach for the loss function in this system. We have benchmarked SSIM and it fits the threshold defined for a max latency for a loss function. We are going to go with this heuristic as a measure of upscaling quality for our deep learning based approach to doing video upscaling in real time.

Joshua’s Status Update for 10/2/21

This week, our team met up with Joel and Byron, and we discussed our progress on the project. We went into detail about the specific use of VMAF in the training part of the CNN, and we discussed various problems/issues that may arise with the use of VMAF, and came up with several solutions. Byron reiterated the importance of traditional DSP methods, specifically, wanting us to justify and confirm how using a CNN would be superior to those traditional methods, and I incorporated that into our design presentation.

Since we didn’t receive AWS credits until much later in the week, I downloaded our intended dataset and attempted to benchmark VMAF as well as Anime4K locally, a project on Github with similarities to our project, to see how they would perform. Since there were many different videos available on the website where I was getting my dataset (CDVL), I ran into a slight issue with the difference between 1080i and 1080p, as well as the FPS of videos in the dataset, but after discussing with James, I managed to compile a list of videos which were 1080p @30FPS, and worked with my team members to successfully benchmark VMAF and Anime4K. Our development of Python code for training was delayed, since we couldn’t start until much later on in the week, and we intend to catch up on that ASAP as soon as our design presentation is concluded.

I also met up with TAs Joel and Edward outside of class hours to further discuss our project and refine the details in preparation for our presentation. I also wrote the team status report for this week, and worked on the design presentation with my team members.

Team Status Report for 10/2/21

This week, we met up with Byron and Joel again to discuss more about our project, specifically to address any comments from the feedback from our proposal presentation, as well as following up on our initial, first meeting. During the meeting, we addressed the concerns about the use of VMAF as a metric for our training, as well as our dataset and some other things that weren’t fully justified during our presentation. Byron commented on how we have to make sure that implementing a CNN is better compared to traditional DSP methods, and to make sure that implementing something much harder is still the best choice. To that end, we benchmarked both VMAF and Anime4K, a project on Github that does something similar with, specifically, animation, and we obtained concrete, quantitative measurements which we can elaborate on our design presentation to fully justify our design choice.

Joel also raised a good point about how upscaled, lower resolution videos compared to original, native resolutions videos would always result a lower score, and we addressed that by limiting our training to only comparing videos that have been upscaled to the native resolution, e.g. 1080p to 1080p. We also talked about the importance of benchmarking as soon as possible, which we successfully did this week.

Although throughout the week our team members were slightly overwhelmed by work from other classes, we managed to catch up sufficiently by meeting up after class hours and communicating to make sure our tasks were still completed on time. James and Kunal continued their research on I/O, and calculated specific quantitative measurements to put on our design presentation, and I continued my research into VMAF, as well as the model being used for training our upscaling. Referring back to our Gantt chart/schedule, we were slightly behind on developing the Python code for training our own CNN, as we only received AWS credits Friday morning, but we used that surplus time efficiently by benchmarking locally, as well as researching in more detail Anime4K. As per the feedback from Tamal, we are taking the risk of our CNN not working/not being developed in time on the software side more seriously, and our backup plan would be to simply use the CNN implemented in Anime4K and start implementing that on hardware if we cannot get it working on the software side after Week 7. We’ve changed our schedule/Gantt chart to reflect that accordingly.

Looking further into the peer/instructor feedback, we see that they were a lot of comments about the absence of justification for our FPGA selection during the proposal presentation. We’ve focused on elaborating on the choice much more for our design presentation, and we are similarly going into a lot more detail for our software section, as well as our quantitative requirements.

Overall, despite some things not going as planned this week, we believe our team was very successful in overcoming all the problems we encountered, and our initial planning, which allowed for slack time/small delays, proved useful. We look forward to delivering our well-prepped presentation on Monday, addressing all feedback from our previous one, and continued success towards the progress of our project.

James’s Status for 10/2

This week I continued research on IO for the Ultra96. I was able to find example code for video in and video out, which I will need to modify to work with video file in — the example used video stream in to 1080p@60FPS video out. In looking at the specs of the board, and the available data for training that Joshua found, we decided to change our spec framerate to 30FPS. There are more video datasets at 30FPS than our original idea of 24FPS; this was because of the difference between “p” and “i” formats which we initially overlooked. The “p” formats are more common on pixel screens and “i” format is from when video was still interlaced.

I was also able to get communications setup between the ARM core and FPGA of the Ultra96. This ended up being a prerequisite to some setup I had to do for looking at IO, so in this case, I was able to check off something that we planned to be further down the line in our gantt chart, which is always a good thing to be able to do.

I started our slideshow for the design presentation and began the draft for the report as well. I have been working on those, and running metrics to have hard data to present on those as well, specifically runs comparing MSE to SSIM to VMAF to motivate the need for VMAF as a metric over the other two more well-known ones.

Kunal’s Status Update (9/25)

So this week, we looked at the various compute infrastructures we would use to train our CNN-based video upscaling algorithm. We have isolated the compute to be either an AWS SageMaker instance or a private gpu cloud that AWS offers as well. This will enable model training to take place much more efficiently, and so we can then take the trained model and write it directly on an FPGA for cycle level optimization. Without a hardware based fpga implementation, the convolution and gradient descent operations would take a significant amount of cycles on a Jetson or other embedded platform. We believe that writing it directly in hardware will significantly improve latencies of inference particularly for this task. It’s more of an exercise in ASIC engineering & hardware design coupled with machine learning.