October 2021 – Team B0: Real Time Video Upscaling

October 31, 2021

Kunal’s Status Report (10/30)

This week I (Kunal) along with James worked on the I/O portion of the project. I’m currently building a mechanism to take mp4 data and break it down into frames for the upscaling algorithm implemented & placed on the fpga. Fundamentally, our algorithm is dependent on the streaming & deconstruction of video from which the proper mechanisms for upscaling are then ported onto the fpga unit. I’m currently implementing this through OpenCV and it’s libraries in C++. The host program that will communicate with the fpga I’m also writing in C++ & will be the core of how the real-time streaming aspect of this system will work. The implementation involves pinning devices to a registry, where in our case we only have 1 external device and that’s our CNN implementation on the hardware device. This is an iterative process that involves checking the programmability of a peripheral device and if the device is programmable then I break out of the loop and use that context to send kernels of data to the fpga unit. The speed at which the decoding of the mp4 into frames and then the rate at which the frames hit the fpga is an important heuristic which I’m working on optimizing for real-time streaming. Ideally we’d want to maintain the fps of the video as it is streamed into the fpga unit. For this to happen, the latencies of decoding the mp4 is going to be crucial and we will be benchmarking this part of the real-time streaming pipeline extensively.

October 31, 2021November 1, 2021

Team Status Report (10/30)

This week I (Kunal) along with James worked on the I/O portion of the project. James helped me on host-side programming for the U96 board, and provided me with various resources to look at, so that I can get going with the implementation.

James attempted to make further gains of the CNN kernel, but it was not as successful this week. However, he has worked out various strategies which will help speedup our implementation, and has been attempting to implement those strategies. He also worked with Joshua to research pre-trained models. He also setup a git to help Kunal with his portion of the project.

Joshua worked on researching pre-trained models along with James, and also attempted to get AWS up and running, but after testing Google Colab, decided to go with that instead. Our request came back but was not fully fulfilled, as we weren’t provided with enough vCPU limits to use our preferred instance, so after purchasing Google Colab Pro, he decided to use that instead to speed up the training process.

In terms of the whole project, we almost have a working CNN model. The training should be done by around Wednesday/Thursday this week, and James and I will be working extensively on CNN acceleration before then, and then take our weights/hyperparameters from our software model and implement them this coming Friday/weekend to ensure everything is going smoothly. Overall, our project is around a week behind, but we are confident it will go smoothly as we have worked out enough slack time, and also addressed the issues that were preventing our project from moving forward.

October 31, 2021

James’s Status for 10/30

Since the problems we were having with AWS were reaching critical path for the completion of our project this week, I helped Josh look for alternative pre-trained models in the case that AWS/training fell through. While there do exist pre-trained models, many of them would not be exactly what we would need for our use-case. The pre-trained models we found were ‘rated’ for up to 4x upscaling, meaning that their performance would degrade for the 4.5x scaling factor that we will be using. Additionally, we found many models had extra layers of DSP preprocessing which we did/do not plan to use. In this case, if our hand were forced to use a pre-trained model, we have settled on an open source version, found on github that implements SRCNN without the extra preprocessing, knowing that this means that we may not be able to attain the picture reconstruction accuracy we originally set out to do (since the model will only have been trained to support good restoration up to 4x).

This week I also further helped Kunal ramp on host-side programming for the U96 board, and pointed him in the direction of various resources so he could get started on its implementation.

I also set up a git for us to use for the U96 vitis project. As of now it only has the vector-vector addition template example as an aid to Kunal to get him started on programming the host. I tried making further incremental gains on the CNN kernel, but was unable to realise any more this week. On the bright-side, I was able to rule out a good few different strategies for speedup, so the design space is, at the very least, still converging. I think that Kunal should be pretty much fully ramped by now, and so I should have more time this coming week to further explore the design space for CNN acceleration.

October 25, 2021October 25, 2021

Joshua’s Status Report for 10/23/21

Last week, a.k.a Week 7, October 11th-17th, I worked on the design review report with my team, and I contributed by writing the parts: Abstract, 5.2, 6.2, 7.2, 7.4, 8, and lastly, formatting and revising our Gantt chart + schedule. As my teammates have pointed out, this design report was clearly a poor reflection on our team as a whole, and there were many other parts of the project that we have explored in detail and should have added to the document, but we simply were not organized and could not meet the deadline, due to other classes, midterms and general lack of communication. My team and I hope this does not reflect extremely poorly on our group overall, as we will be addressing each and every concern that was raised in the design report evaluations and more, and will have a much more refined, detailed and accurate final report + project deliverable by the end of the semester.

To add onto the work I accomplished last week, I have been working on the model locally and ironing out bugs, and there was a very superficial but hard to find problem with conflicting Python library versions that took almost 3 days to discover and solve, but I also ran into another problem with running the code on AWS. Although I had initially made an instance using the free tier before we had received credits, I had eventually refined our design later on and pinpointed the exact instance I wanted to use – the P3 instance that uses the NVIDIA V100 GPUs, which should theoretically speed up our training process much more. Unfortunately, we hadn’t increased our vGPU limit, and AWS did not allow us to create an instance, so we had to apply for it, and we still hadn’t gotten back an answer after a full week.

Fortunately, we had already prepared for such things to happen, and we had deliberately added 2 weeks of slack/extra time to handle such things in our project, so although this software portion is basically a full week behind now due to the above listed complications, I am still confident that I can finish it in time to allow the hardware portion of the project to start and complete on time.

Looking forward to the coming week, I am aiming to finish the software portion by the end of the week, which James and Kunal are heavily involved with, as to not delay their progress. To address the concerns about my lack of work after this week, I will be helping Kunal and James with the integration portion of our project as much as possible, whilst actively participating in the other remaining aspects of the project to ensure it goes smoothly.

October 25, 2021October 25, 2021

Kunal’s Status Update (10/23)

This past week I focused on building the core neural network architecture on the fpga. I got ramped up with the vivado hls & vitis platforms being used to implement the neural network on the fpga. This week I’m planning on making more substantial progress with James on the bicubic interpolation algorithm and its implementation directly in hardware. I’m getting acquainted with pragma based directives in hls and will be exploring these functionalities in depth with the hardware implementation this week. We have been working on perfecting the model so we can see a noticeable increase in resolution and then we can look into how specifically to implement the buffers across the various pipeline stages in the neural network design. This is highly dependent on the number of layers and the architecture of the neural network itself. Once we have this set in stone this week I will get into the hardware details involving the Ultra96 and it’s onboard fpga unit, and also will setup and benchmark the frame relay rates from the usb bus. This week will mostly be focused on setting up the hardware related infrastructure & get bytes from the usb interface to the fpga core and relay acks back.

October 24, 2021October 25, 2021

Team Status Report for 10/23/21

Last week we mainly focused on writing the design review report. To address the elephant in the room, we know that our submission was nowhere near as well done or well polished as it could be, and frankly was not even fully done in some parts. We know that this now means that we will have to pick up more work leading into the final report to make sure that it is done well, and we more than intend to do this. We do not want to repeat submitting something of such sub-par quality for the final report at all.

As for this week:

James focused mainly on improving CNN performance with marginal gains so far. More details are included in his status report.

Joshua focused on refining the software implementation of the project and ironing out bugs, as well as sorting out issues with training due to problems with AWS.

Kunal helped with improving CNN performance, as well as acquainting himself more with some of the content from reconfig, which James is currently taking but Kunal is not.

Overall, the project is about one whole week behind according to the Gantt chart, but this is not a concern since we left two extra weeks in order to address unexpected issues with our project’s development. A lot of it came down to our other courses ramping up in terms of time commitment and effort, and all members having to focus on other things, but in the end, we made a steady amount of progress, and we are still on track to finish the project on time.

October 24, 2021October 24, 2021

James’s Status for 10/23

Last week, I mainly focused on the design review report. I didn’t get it as far along / as polished as I would have hoped; this means it will be more work for us when updating it for the final document. I ended up writing sections 1, 2, 3, 3.2, 4.1, 4.2, 5.1, 6.1, 9, 9.1, acronym glossary, BoM, and references. I was unable to write more, or organise with my partners to write more in their sections, but I know that we messed this up badly, and we will aim to rectify it before the time the final report comes around.

This week, I focused more on optimising CNN operations on the FPGA. This is a little bit out of order, but I decided to do things a bit out-of-order because it works much more synergystically with where we are in reconfig right now. I have so far increased throughput (on a basic fast-CNN implementation) by 25% to 20MOp/s, but am expecting to settle at two orders of magnitude higher than where I am right now. I also helped Kunal on-ramp with some Vitis stuff, as he was slipping behind on ramping with the platform. I shared excerpts from Vitis tutorials that we were given for reconfig, as well as pointing him more directly in the direction of online resources for Vitis. I need to circle back with him and check where he is with progress on i/o, and plan to do so this coming Monday. This may effect the Gantt chart / schedule, but we have the slack to allow for it for now. I will be keeping tabs on how much slack I use in my tasks because I know that I have begun cutting things close with the amount of remaining slack which I am allotting myself.

October 18, 2021

Kunal’s Status Update (10/16)

I looked into the hls language directives & methods to write a pipelined matrix multiplier in vivado hls on the ultra 96 fpga component. The process seems relatively straightforward albeit for the different components involved with the design & maintenance of the buffers implementing the various pipeline stages. These parameters will need to be benchmarked and tuned in order for optimal feed-forward latencies. Since this is a fully trained model, the only latencies which would matter are the ones moving forward through inference on the pipeline. I’m going to continue work on this through this week and maintain everything through proper source control.

October 13, 2021

Joshua’s Status Report for 10/9/21

This week, I spent a lot of time incorporating our new design change of using SSIM instead of VMAF as our metric, by rewriting some of the code I had locally and benchmarking more using SSIM. The end results were very satisfactory – SSIM was a lot faster than VMAF and suited the training portion of our project much better, and I also experimented with the idea of using the previous/next frames as side inputs, but decided against it due to the complexity and, importantly, the little added value of doing that compared to a solely frame-by-frame upscaling method. I also worked on the design review with my teammates to refine the details of our implementation and our project overall.

For the next week, I will spend the majority of my time writing code on AWS, finishing up my model and starting to train it with our chosen dataset. Since I have several midterms next week, I will have to balance my time well and coordinate with my team to make sure that we are working on the project on time.

Note: This individual status report was published late due to a technical issue. The initial draft shows it was created last Saturday, before the due date, but was not published and not fully saved.

October 10, 2021

Kunal’s Status Update (10/9)

This week, I looked at the specifications of our platform and thought of particular ways to implement I/O in a way that preserves the real-time aspect of our deep learning system. An approach involves directly copying blocks of data from the USB bus to the main memory that the ARM core can directly start processing frames and forwarding it to the FPGA’s core logic. Sort of based off DMA, but some parameter tuning with regards to optimizing the system to achieve low latency. Planning on doing this work this coming week after a couple midterms.