We finished up benchmarking the latencies of our fsrcnn model and I wrote a profiler that would take upscaled frames and build metrics based on the quality of these images. This we will use in the final demo to compare and contrast latencies of the software model running on the hardware device. The nature of the computation involves a scan over the pixel set and a buffering mechanism for the previous pixels fed into this pipeline. We are actively tuning the model and analyzing our results, we expect to get to a significantly lower latency and be able to profile the results much more efficiently.
Kunal’s Status Update (11/20)
This week I finished the implementation for the iterative frame based processing of the video file that is fed into the hardware-based implementation of the algorithm. The way I wrote it involves iterating over the frames in a particular video file and building the buffers and necessary metadata in order to pass along the frame to the fpga in as low of a latency as possible. The nature of the computation I wrote is aimed to be real time in the sense that as frames are received from the cnn they are directly forwarded to the video port of the ultra 96 board. The video port is a peripheral that I exposed via a xilinx library that can do host to device communication particularly through a video device as the output. This part needs extensive testing along with the latency improvements for the CNN model itself. I’ve been working with James in order to coordinate these changes in our overall system. We are planning on squashing the layers of the CNN to build out a 1-layered super-resolution system in order to optimally fit the model on the fpga. This is an ongoing effort and will be taking a look at this throughout Sunday and the following week.
Kunal’s Status Report (11/13)
This week I worked on pinning down the frame analysis of the avi file. I have code written for doing this and have kept track of this in our source control. I’ve also looked into how to do this iteratively through opencv and related mat processing. This week I will be testing that implementation and move into benchmarking the speed of I/O across real video. The nature of the computation is that it is iterative and requires additional buffering mechanisms in order to verify the integrity of the speed of streaming. From here, taking these streaming mechanisms and porting it to the fpga is another goal for the coming week.
Kunal’s Status Update (11/6)
I wrote the I/O portion involving taking an avi file and deconstructing it into its native frames for running it through the neural network which is placed on the fpga. I have a lot of the source code up to date on our git repository but will need to test dependencies for the boost & opencv libraries. There may be versioning issues across the distributions. In any case, I will be taking a look at that tomorrow and throughout the week. We are close to being able to send frames to the fpga which will run the machine learning algorithm to do super-resolution of the image. The speed at which the frames transfer across the fpga will also be crucial and here I’m working on a circular buffer implementation that will be able to synchronize clock differences across the fpga unit and the speed at which the video is being sent over the usb channel. I am currently testing these mechanisms out.
Kunal’s Status Report (10/30)
This week I (Kunal) along with James worked on the I/O portion of the project. I’m currently building a mechanism to take mp4 data and break it down into frames for the upscaling algorithm implemented & placed on the fpga. Fundamentally, our algorithm is dependent on the streaming & deconstruction of video from which the proper mechanisms for upscaling are then ported onto the fpga unit. I’m currently implementing this through OpenCV and it’s libraries in C++. The host program that will communicate with the fpga I’m also writing in C++ & will be the core of how the real-time streaming aspect of this system will work. The implementation involves pinning devices to a registry, where in our case we only have 1 external device and that’s our CNN implementation on the hardware device. This is an iterative process that involves checking the programmability of a peripheral device and if the device is programmable then I break out of the loop and use that context to send kernels of data to the fpga unit. The speed at which the decoding of the mp4 into frames and then the rate at which the frames hit the fpga is an important heuristic which I’m working on optimizing for real-time streaming. Ideally we’d want to maintain the fps of the video as it is streamed into the fpga unit. For this to happen, the latencies of decoding the mp4 is going to be crucial and we will be benchmarking this part of the real-time streaming pipeline extensively.
Team Status Report (10/30)
This week I (Kunal) along with James worked on the I/O portion of the project. James helped me on host-side programming for the U96 board, and provided me with various resources to look at, so that I can get going with the implementation.
James attempted to make further gains of the CNN kernel, but it was not as successful this week. However, he has worked out various strategies which will help speedup our implementation, and has been attempting to implement those strategies. He also worked with Joshua to research pre-trained models. He also setup a git to help Kunal with his portion of the project.
Joshua worked on researching pre-trained models along with James, and also attempted to get AWS up and running, but after testing Google Colab, decided to go with that instead. Our request came back but was not fully fulfilled, as we weren’t provided with enough vCPU limits to use our preferred instance, so after purchasing Google Colab Pro, he decided to use that instead to speed up the training process.
In terms of the whole project, we almost have a working CNN model. The training should be done by around Wednesday/Thursday this week, and James and I will be working extensively on CNN acceleration before then, and then take our weights/hyperparameters from our software model and implement them this coming Friday/weekend to ensure everything is going smoothly. Overall, our project is around a week behind, but we are confident it will go smoothly as we have worked out enough slack time, and also addressed the issues that were preventing our project from moving forward.
Kunal’s Status Update (10/23)
This past week I focused on building the core neural network architecture on the fpga. I got ramped up with the vivado hls & vitis platforms being used to implement the neural network on the fpga. This week I’m planning on making more substantial progress with James on the bicubic interpolation algorithm and its implementation directly in hardware. I’m getting acquainted with pragma based directives in hls and will be exploring these functionalities in depth with the hardware implementation this week. We have been working on perfecting the model so we can see a noticeable increase in resolution and then we can look into how specifically to implement the buffers across the various pipeline stages in the neural network design. This is highly dependent on the number of layers and the architecture of the neural network itself. Once we have this set in stone this week I will get into the hardware details involving the Ultra96 and it’s onboard fpga unit, and also will setup and benchmark the frame relay rates from the usb bus. This week will mostly be focused on setting up the hardware related infrastructure & get bytes from the usb interface to the fpga core and relay acks back.
Kunal’s Status Update (10/16)
I looked into the hls language directives & methods to write a pipelined matrix multiplier in vivado hls on the ultra 96 fpga component. The process seems relatively straightforward albeit for the different components involved with the design & maintenance of the buffers implementing the various pipeline stages. These parameters will need to be benchmarked and tuned in order for optimal feed-forward latencies. Since this is a fully trained model, the only latencies which would matter are the ones moving forward through inference on the pipeline. I’m going to continue work on this through this week and maintain everything through proper source control.
Kunal’s Status Update (10/9)
This week, I looked at the specifications of our platform and thought of particular ways to implement I/O in a way that preserves the real-time aspect of our deep learning system. An approach involves directly copying blocks of data from the USB bus to the main memory that the ARM core can directly start processing frames and forwarding it to the FPGA’s core logic. Sort of based off DMA, but some parameter tuning with regards to optimizing the system to achieve low latency. Planning on doing this work this coming week after a couple midterms.
Kunal’s Status Update (10/2)
This week we worked on identifying & fine tuning various metrics to define video resolution quality. We have identified VMAF as being too slow for the latency constrained environment that our solution will be operating in. As a loss function for the convolutional neural network we are building, we found that the latency hit is too large for a real time application like this one. With that in mind, we have identified SSIM as a viable heuristic for quantifying image quality. SSIM is largely based off mean-squared error & uses the algorithm to model & quantify discrepancies between video frames.
We benchmarked the VMAF algorithm on two identical videos at 1080p and this took roughly 1 minute and this is significantly too long to use for a loss function in a real-time ML pipeline for video upscaling. Hence, we are going with SSIM & a mean squared error approach for the loss function in this system. We have benchmarked SSIM and it fits the threshold defined for a max latency for a loss function. We are going to go with this heuristic as a measure of upscaling quality for our deep learning based approach to doing video upscaling in real time.