Kunal’s Status Update (11/20)

This week I finished the implementation for the iterative frame based processing of the video file that is fed into the hardware-based implementation of the algorithm. The way I wrote it involves iterating over the frames in a particular video file and building the buffers and necessary metadata in order to pass along the frame to the fpga in as low of a latency as possible. The nature of the computation I wrote is aimed to be real time in the sense that as frames are received from the cnn they are directly forwarded to the video port of the ultra 96 board.  The video port is a peripheral that I  exposed via a xilinx library that can do host to device communication particularly through a video device as the output. This part needs extensive testing along with the latency improvements for the CNN model itself. I’ve been working with James in order to coordinate these changes in our overall system. We are planning on squashing the layers of the CNN to build out a 1-layered super-resolution system in order to optimally fit the model on the fpga. This is an ongoing effort and will be taking a look at this throughout Sunday and the following week.

 

Kunal’s Status Report (11/13)

This week I worked on pinning down the frame analysis of the avi file. I have code written for doing this and have kept track of this in our source control. I’ve also looked into how to do this iteratively through opencv and related mat processing.  This week I will be testing that implementation and move into benchmarking the speed of I/O across real video. The nature of the computation is that it is iterative and requires additional buffering mechanisms in order to verify the integrity of the speed of streaming. From here, taking these streaming mechanisms and porting it to the fpga is another goal for the coming week.

Kunal’s Status Update (11/6)

I wrote the I/O portion involving taking an avi file and deconstructing it into its native frames for running it through the neural network which is placed on the fpga. I have a lot of the source code up to date on our git repository but will need to test dependencies for the boost & opencv libraries. There may be versioning issues across the distributions. In any case, I will be taking a look at that tomorrow and throughout the week. We are close to being able to send frames to the fpga which will run the machine learning algorithm to do super-resolution of the image. The speed at which the frames transfer across the fpga will also be crucial and here I’m working on a circular buffer implementation that will be able to synchronize clock differences across the fpga unit and the speed at which the video is being sent over the usb channel. I am currently testing these mechanisms out.

Kunal’s Status Report (10/30)

This week I (Kunal) along with James worked on the I/O portion of the project. I’m currently building a mechanism to take mp4 data and break it down into frames for the upscaling algorithm implemented & placed on the fpga. Fundamentally, our algorithm is dependent on the streaming & deconstruction of video from which the proper mechanisms for upscaling are then ported onto the fpga unit. I’m currently implementing this through OpenCV and it’s libraries in C++. The host program that will communicate with the fpga I’m also writing in C++ & will be the core of how the real-time streaming aspect of this system will work. The implementation involves pinning devices to a registry, where in our case we only have 1 external device and that’s our CNN implementation on the hardware device. This is an iterative process that involves checking the programmability of a peripheral device and if the device is programmable then I break out of the loop and use that context to send kernels of data to the fpga unit. The speed at which the decoding of the mp4 into frames and then the rate at which the frames hit the fpga is an important heuristic which I’m working on optimizing for real-time streaming. Ideally we’d want to maintain the fps of the video as it is streamed into the fpga unit. For this to happen, the latencies of decoding the mp4 is going to be crucial and we will be benchmarking this part of the real-time streaming pipeline extensively.

Kunal’s Status Update (10/16)

I looked into the hls language directives & methods to write a pipelined matrix multiplier in vivado hls on the ultra 96 fpga component. The process seems relatively straightforward albeit for the different components involved with the design & maintenance of the buffers implementing the various pipeline stages.  These parameters will need to be benchmarked and tuned in order for optimal feed-forward latencies. Since this is a fully trained model, the only latencies which would matter are the ones moving forward through inference on the pipeline. I’m going to continue work on this through this week and maintain everything through proper source control.

Kunal’s Status Update (10/9)

This week, I looked at the specifications of our platform and thought of particular ways to implement I/O in a way that preserves the real-time aspect of our deep learning system. An approach involves directly copying blocks of data from the USB bus to the main memory that the ARM core can directly start processing frames and forwarding it to the FPGA’s core logic. Sort of based off DMA, but some parameter tuning with regards to optimizing the system to achieve low latency. Planning on doing this work this coming week after a couple midterms.

James’s Status for 10/9

This week was very busy for me in other courses and so subsequently I did not hit my targets for this week. I didn’t get a chance to sync up with Kunal to see what I/O routines he has written or to begin testing/validating them. Research on I/O however seems to be mostly done; it ended early in the week, on Monday, so that was one task I am able to check off for this week.

This coming week, I plan to grind on I/O and start on CNN math. We also have the Design Review Report to further flesh out. Hopefully this doesn’t encroach too much time from my working on my other tasks, but I don’t think it will, since we have ‘mid-semester break’ which I can use as a day to get things done.

Kunal’s Status Update (10/2)

This week we worked on identifying & fine tuning various metrics to define video resolution quality. We have identified VMAF as being too slow for the latency constrained environment that our solution will be operating in. As a loss function for the convolutional neural network we are building, we found that the latency hit is too large for a real time application like this one. With that in mind, we have identified SSIM as a viable heuristic for quantifying image quality. SSIM is largely based off mean-squared error & uses the algorithm to model & quantify discrepancies between video frames.

We benchmarked the VMAF algorithm on two identical videos at 1080p and this took roughly 1 minute and this is significantly too long to use for a loss function in a real-time ML pipeline for video upscaling. Hence, we are going with SSIM & a mean squared error approach for the loss function in this system. We have benchmarked SSIM and it fits the threshold defined for a max latency for a loss function. We are going to go with this heuristic as a measure of upscaling quality for our deep learning based approach to doing video upscaling in real time.

Kunal’s Status Update (9/25)

So this week, we looked at the various compute infrastructures we would use to train our CNN-based video upscaling algorithm. We have isolated the compute to be either an AWS SageMaker instance or a private gpu cloud that AWS offers as well. This will enable model training to take place much more efficiently, and so we can then take the trained model and write it directly on an FPGA for cycle level optimization. Without a hardware based fpga implementation, the convolution and gradient descent operations would take a significant amount of cycles on a Jetson or other embedded platform. We believe that writing it directly in hardware will significantly improve latencies of inference particularly for this task. It’s more of an exercise in ASIC engineering & hardware design coupled with machine learning.