Joshua’s Status Report for 12/4

This week, I made several adjustments to priorities before the end of semester. Firstly, James was making good progress on the implementation of FSRCNN-s, which meant that getting weights for the model was integral for our final demo and deliverable. Taking into account the potential worst-case scenario, I did research on publicly available weights for our CNN model, and did comparison between the dataset they used to see if they would be suitable, in the possible event that the training for our new model could not be done in time. After talking with the team, I reevaluated our priorities realized that finishing and printing our CAD model would have to be put on the backburner –  figuring out hyperparameters and looking through possible solutions with James was much more impertinent, in order to meet our deliverables by the end of the semester.

In terms of the final demo, I voiced several ideas on the structure of how we could present, including a multi-monitor setup, as well as interaction with the listeners, such as allowing them to rate which video they thought was the better one, as well as prompting them to give a subjective score, etc.

I made the final presentation and also prepared to deliver it. In doing so, I also worked on the final report, detailing a lot on our testing and metrics on our initial model, even though we’ve moved on from that one, as to fully document our developmental process.

In terms of the final video, I discussed with my team on what approaches we could take, and since none of us really had any video-editing experience, we decided to prepare in advance. I looked at videos from previous Capstone courses, specifically the winners of last semester, as I thought their video was correctly paced and well-edited.

(This report was made late, and was added on December 12th.)

Kunal’s Status Report (12/4)

We finished up benchmarking the latencies of our fsrcnn model and I wrote a profiler that would take upscaled frames and build metrics based on the quality of these images. This we will use in the final demo to compare and contrast latencies of the software model running on the hardware device. The nature of the computation involves a scan over the pixel set and a buffering mechanism for the previous pixels fed into this pipeline.  We are actively tuning the model and analyzing our results, we expect to get to a significantly lower latency and be able to profile the results much more efficiently.

Team Status Report for 12/04

James continued to work on squeezing performance from the fsrcnn model, but ran into diminishing returns. Using fixed weights allowed for some additional improvements in memory accessing, and since we have fixed weights, we have the ability to do this. Integration with the host side led to additional slowdowns. Thinking of ways to improve this, a multikernel approach was decided and James began writing this. He expects to finish implementing this by the end of the week of 11/29.

<Josh Edits>

<Kunal Edits>

James’s Status for 12/04

From last week have an implementation of fsrcnn which runs faster than srcnn, still slow though. One optimisation that I tested was using fixed weights as opposed to weights stored in host-side memory which is mapped to the kernel. This led to a decent improvement in latency but not enough to meet our initial specifications. Porting and integrating with host code has produced further slowdowns. Trying to remedy this with a multikerneled approach which should be finished by tonight. Will be focusing on writing the paper, the video, and making a narrative to sell what we have for the coming week, as we aren’t in the position schedule wise to try for more optimisations, even if that’s what I would like to do.

Project-management-wise, I also helped Josh practice for the Wednesday presentation on Tuesday.

Team Status Report for 11/20

This week, our group spent a lot of time working on getting a working version of our upscaling model on the FPGA. Since we had several delays in the previous week, we had to work hard to catch up on lost time, and we are making a lot of progress each day in order to meet our deadlines.

James worked on optimizing the latency on the U96 board, as well as improving on the FPGA architecture of the CNN. He ran into several problems near the beginning of the week, but quickly caught up and is in the middle of addressing all the numerous limiting factors that are preventing our upscaling model from working.

Joshua worked on addressing problems with the end-to-end latency by training a smaller model from scratch, since there were unexpected issues with the end-to-end latency on the U96 board. He was also in charge of starting the final presentation, as he will be presenting, as well as continuing work on the final report. He also did further research on a case for the U96 board, something that James had an initial design for last week, but was for an older generation of the board.

Kunal is working more on the I/O portion of the board, by looking at the different ways the frames can be passed in to increase the speed of implementation.

Next week, we are looking to have a final product working, even if it doesn’t fully meet our initial requirements. From the work we’ve put in these last few months, it seems that some trade-off is inevitable, and this week, we will pinpoint exactly which trade-off we are willing to take, and look to have at least a semi-working demo in preparation for our final presentation on 11/29.

James’s Status for 11/20

***** Apologies for lateness *****

Again I will be structuring my status this week in daily updates alongside an end-of-week review.

Daily Update 11/14:

I was unable to get the benchmark data that I wanted. I am running into a massive bug where the code compiles and synths but, when run on the U96, causes it to brick itself until a manual reboot is issued, either via unplugging or the power button – a soft reset isn’t even an option. I want to get this into vitis_hls so I can see if this is running but just taking ages and overloading the board to the point of not having capacity to run the heartbeat anymore (which would be very bad since it would mean that our model is taking up far too much computation for our board) or if there is an error in my code. In all honesty, I wouldn’t be surprised and almost expect it to be that there is an error in my code.

Daily Update 11/15:

Today was productive. I was able to get the code into vitis_hls and get it properly building, albeit with some vast codebase restructuring. Running in HLS let me see that, in simulation, that we should be making timing, or at the very least the FPGA is not going to be the limiting factor, as, I was able to reach 10ns latency, and could probably push it even further, as 10ns is just the default target that HLS builds to. There could still be considerable delays from data movement on the host side, or other memory issues into which the test I ran would not give insight. Additionally, also from further testing in HLS, I was able to pin down the cause of the U96 hanging that I was running into: it was memory overruns that I didn’t catch when porting over the full-sized system’s true sizes. I’ve gone ahead and fully parameterised my code now, and as such, there is no room for this error to happen again. While this issue is fixed, now I am running into an XRT error regarding memory usage causing the host to segfault, with the particular error being a “bad_alloc” error. Doing some preliminary digging into docs, this seems to point to allocating too much memory. I’m going to look a bit further into this tomorrow and also look into using lower-precision FP types so that the amount of memory may be lower. IF these don’t pan out tomorrow, I will also fork a branch on our Git for a different FPGA architecture of CNN. The two options I have in mind are: 1) Using a fixed  full-feature-map layer-kernel as opposed to how I have it implemented currently, as a model-kernel. In this way I would have to apply the layer-kernel three times from the host side, loading in the relevant weights as it goes along. 2) Using a single-feature-map layer-kernel. This would be very data-lightweight, but would put more responsibility onto the host in coordinating memory movement, and this movement might end up being the dominating factor for latency and throughput.

Daily Update 11/16:

Doing some hand calculations on my current implementation, just as a sanity check, it looks like the issue is a memory related one, in that I am trying to request from the system more memory than what is should be able to provide.  The dominant factors are the hidden layer buffers, which I am storing as memory buffers. Since I can see this now, I’m going to more tightly couple the layers of the networks so that I can remove these inter-layer memory requirements.

Daily Update 11/17:

Thinking further into the interlayer optimisations, there is no way to keep the overall structure I currently have and implement it. Hence, I am trying a new strategy whereby the calculation is done not in a tiled fashion but in a sub-tiled tiled fashion. I will spend today finishing up getting this up and running, and then will sweep a few values.

Daily Update 11/18:

This new architecture looks promising, have been able to get lower numbers than before, still too high to be useful though. I did a more full calculation of what the max bandwidth should be, and it was extremely concerning, as the ideal bandwidth was around 6s, ~2 orders of magnitude where we need it to be (~60ms), and that’s still assuming I can achieve ideal, with memory strucutre constraints given by how the frame is structured by OpenCV when we read it in.

Daily Update 11/19:

Tried to restructure the network, but misunderstood the architecture I was going after, ended up being a waste. Did some value sweeping with vitis_hls, have found what seems to be a minima, unsure if it is a global or local one.

Daily Update 11/20:

Didn’t have much bandwidth to work on capstone today, just able to sweep a few values, didn’t amount to much more performance. Ending the week at E2E latency of 116011ms.

End-of-Week Report 11/14-11/20:

I am making incremental improvements, but they aren’t coming fast enough, and still there is the ideal cap that we run into. I’m not sure what can budge anymore, but likely we will not achieve one of our set benchmarks. This is not good.

Kunal’s Status Update (11/20)

This week I finished the implementation for the iterative frame based processing of the video file that is fed into the hardware-based implementation of the algorithm. The way I wrote it involves iterating over the frames in a particular video file and building the buffers and necessary metadata in order to pass along the frame to the fpga in as low of a latency as possible. The nature of the computation I wrote is aimed to be real time in the sense that as frames are received from the cnn they are directly forwarded to the video port of the ultra 96 board.  The video port is a peripheral that I  exposed via a xilinx library that can do host to device communication particularly through a video device as the output. This part needs extensive testing along with the latency improvements for the CNN model itself. I’ve been working with James in order to coordinate these changes in our overall system. We are planning on squashing the layers of the CNN to build out a 1-layered super-resolution system in order to optimally fit the model on the fpga. This is an ongoing effort and will be taking a look at this throughout Sunday and the following week.

 

Joshua’s Status Report for 11/20

As the semester is wrapping up, this week was very busy in terms of workload and addressing issues that came up. At the beginning of the week, I was doing research on CAD designs for our FPGA board – a kind of holder that wraps around the board to prevent static discharges, something that my team and I had looked into last week. Since none of us on our team really had much CAD experience, it was a process which was more time-consuming than expected. I ended up asking a friend who had more CAD experience that me to teach me, and I had a design finalized for the board.

However, our group quickly ran into a bigger problem, which was that our end-to-end latency was a lot higher than we had initially calculated it to be. The main problem was that although our throughput was close to meeting our requirements, our end-to-end latency is significantly higher than what we wanted for it to be ‘real-time’, i.e. a delay of less than 30ms. To address this, a smaller model was needed, but that would come at the cost of quality for our final upscaled videos. We had to decide whether or not the sacrifice to quality was worth it, or take the hit in terms of latency and justify why that was the better choice overall.

I talked through the problem with James and Kunal, and we were thinking about ideas such as training a single-layered CNN from scratch, which was still possible with the time we had left but cutting it very close, or looking at other solutions. In the end, I prepared a smaller model which was still had 3 layers, but with drastically less filters, and started training that in parallel with our other initial model, by training the new model on my GPU and having the training for our initial model continue on Google Colab. I aim to fully address this issue in the coming week, and coordinate with James to ensure that we’ll have at least a final working product, even if it doesn’t meet our initial requirements.

In terms of the final presentation and the final report, I’ve started working on the former and continuing work from last week on the latter, since I will be the one presenting next week.

Kunal’s Status Report (11/13)

This week I worked on pinning down the frame analysis of the avi file. I have code written for doing this and have kept track of this in our source control. I’ve also looked into how to do this iteratively through opencv and related mat processing.  This week I will be testing that implementation and move into benchmarking the speed of I/O across real video. The nature of the computation is that it is iterative and requires additional buffering mechanisms in order to verify the integrity of the speed of streaming. From here, taking these streaming mechanisms and porting it to the fpga is another goal for the coming week.