Team Status Report for 12/04

James continued to work on squeezing performance from the fsrcnn model, but ran into diminishing returns. Using fixed weights allowed for some additional improvements in memory accessing, and since we have fixed weights, we have the ability to do this. Integration with the host side led to additional slowdowns. Thinking of ways to improve this, a multikernel approach was decided and James began writing this. He expects to finish implementing this by the end of the week of 11/29.

<Josh Edits>

<Kunal Edits>

James’s Status for 12/04

From last week have an implementation of fsrcnn which runs faster than srcnn, still slow though. One optimisation that I tested was using fixed weights as opposed to weights stored in host-side memory which is mapped to the kernel. This led to a decent improvement in latency but not enough to meet our initial specifications. Porting and integrating with host code has produced further slowdowns. Trying to remedy this with a multikerneled approach which should be finished by tonight. Will be focusing on writing the paper, the video, and making a narrative to sell what we have for the coming week, as we aren’t in the position schedule wise to try for more optimisations, even if that’s what I would like to do.

Project-management-wise, I also helped Josh practice for the Wednesday presentation on Tuesday.

James’s Status for 11/20

***** Apologies for lateness *****

Again I will be structuring my status this week in daily updates alongside an end-of-week review.

Daily Update 11/14:

I was unable to get the benchmark data that I wanted. I am running into a massive bug where the code compiles and synths but, when run on the U96, causes it to brick itself until a manual reboot is issued, either via unplugging or the power button – a soft reset isn’t even an option. I want to get this into vitis_hls so I can see if this is running but just taking ages and overloading the board to the point of not having capacity to run the heartbeat anymore (which would be very bad since it would mean that our model is taking up far too much computation for our board) or if there is an error in my code. In all honesty, I wouldn’t be surprised and almost expect it to be that there is an error in my code.

Daily Update 11/15:

Today was productive. I was able to get the code into vitis_hls and get it properly building, albeit with some vast codebase restructuring. Running in HLS let me see that, in simulation, that we should be making timing, or at the very least the FPGA is not going to be the limiting factor, as, I was able to reach 10ns latency, and could probably push it even further, as 10ns is just the default target that HLS builds to. There could still be considerable delays from data movement on the host side, or other memory issues into which the test I ran would not give insight. Additionally, also from further testing in HLS, I was able to pin down the cause of the U96 hanging that I was running into: it was memory overruns that I didn’t catch when porting over the full-sized system’s true sizes. I’ve gone ahead and fully parameterised my code now, and as such, there is no room for this error to happen again. While this issue is fixed, now I am running into an XRT error regarding memory usage causing the host to segfault, with the particular error being a “bad_alloc” error. Doing some preliminary digging into docs, this seems to point to allocating too much memory. I’m going to look a bit further into this tomorrow and also look into using lower-precision FP types so that the amount of memory may be lower. IF these don’t pan out tomorrow, I will also fork a branch on our Git for a different FPGA architecture of CNN. The two options I have in mind are: 1) Using a fixed  full-feature-map layer-kernel as opposed to how I have it implemented currently, as a model-kernel. In this way I would have to apply the layer-kernel three times from the host side, loading in the relevant weights as it goes along. 2) Using a single-feature-map layer-kernel. This would be very data-lightweight, but would put more responsibility onto the host in coordinating memory movement, and this movement might end up being the dominating factor for latency and throughput.

Daily Update 11/16:

Doing some hand calculations on my current implementation, just as a sanity check, it looks like the issue is a memory related one, in that I am trying to request from the system more memory than what is should be able to provide.  The dominant factors are the hidden layer buffers, which I am storing as memory buffers. Since I can see this now, I’m going to more tightly couple the layers of the networks so that I can remove these inter-layer memory requirements.

Daily Update 11/17:

Thinking further into the interlayer optimisations, there is no way to keep the overall structure I currently have and implement it. Hence, I am trying a new strategy whereby the calculation is done not in a tiled fashion but in a sub-tiled tiled fashion. I will spend today finishing up getting this up and running, and then will sweep a few values.

Daily Update 11/18:

This new architecture looks promising, have been able to get lower numbers than before, still too high to be useful though. I did a more full calculation of what the max bandwidth should be, and it was extremely concerning, as the ideal bandwidth was around 6s, ~2 orders of magnitude where we need it to be (~60ms), and that’s still assuming I can achieve ideal, with memory strucutre constraints given by how the frame is structured by OpenCV when we read it in.

Daily Update 11/19:

Tried to restructure the network, but misunderstood the architecture I was going after, ended up being a waste. Did some value sweeping with vitis_hls, have found what seems to be a minima, unsure if it is a global or local one.

Daily Update 11/20:

Didn’t have much bandwidth to work on capstone today, just able to sweep a few values, didn’t amount to much more performance. Ending the week at E2E latency of 116011ms.

End-of-Week Report 11/14-11/20:

I am making incremental improvements, but they aren’t coming fast enough, and still there is the ideal cap that we run into. I’m not sure what can budge anymore, but likely we will not achieve one of our set benchmarks. This is not good.

James’s Status for 11/13

First off, I’d like to apologise for the lack of a status update on the previous week (nothing posted on 11/6). I was extremely busy getting our hardware working for the interim demo. For the sake of coverage and good documentation, I will include what I would have had in that update here, clearly backdating and marking entries where applicable. I’ve included an end-of-week overview of last week, daily reports for each day this week, and an end-of-week overview of this week. I’ve decided to add daily reports for myself now for two reasons: 1) to keep myself accountable for making regular progress on the projects, and 2) because reaching this stage of the project, I have a lot more to do than previously and hence want a better way to organise it.

———-

End-of-Week Update: (10/31-11/6)

This week I got hyperparameters back from Josh, so I was able to get the CNN built on the Ultra96. Unfortunately, because he was still recovering from illness, I didn’t get the hyperparameters back as early as I’d have liked, and so was not able to run all the experiments I wanted to this week. One big takeaway I found when building the full size of the model was that I didn’t fully appreciate the size of the model before this point, and so didn’t realise that each hidden layer (at least as implemented now) has to have calls back to DRAM. This may cause slowdowns but I haven’t had the chance to benchmark this yet. This is a big TODO for the upcoming week. Also because of the size of the model, it causes builds to take a very long time to finish synthesizing and routing, around 30 minutes for an incremental build, far longer for a clean build. Development in this sized environment will be far slower than I anticipated just due to this turnaround. As of now, I just have the model hyperparameters, no weights, but the model I have implemented on the FPGA is agnostic to the weights, they will just be loaded from a file by the host. There could be improvements based on precomputation but I’m not sure if this is actually the case. I would have to do a cost benefit profiling for how much computation / memory accessing it would actually save. At the same time having the model agnostic to the weights gives more modularity to our system which is very good for a short turnaround testing environment like what we have. In the coming days we will need to get the system partially integrated for demo, and then keep moving forward with progress on the rest of the coming week.

Daily Update 11/8: (Interim Demo)

I did integration this weekend and ran into a great deal of immediate issues, especially with the timeline of the interim demo being so soon. The first issue I ran into was finding decent data sources. So for expediency and a proof of concept of getting video from the host to the fabric, I wanted to store a video on the home directory of the board’s file system, but couldn’t get them to play nicely (issues with file formats, dimensions, file size, and so on). In the interest of time, I reverted to using an mp4. After our first demo I will ask Josh to share the data set so we have better/more applicable files to use. The size of the files will also be less of a problem since it will live on a USB as opposed to on the same microSD on which the image of PetaLinux lives. The second main issue I ran into was that the code that Kunal gave was riddled with bugs and errors. In order to fix it, the most clear and effective path forward was to rewrite the entirety of the host code. This ended up being a bit painful in linking the correct OpenCV libraries with Vitis, as the project file does not store the config for the build in an obvious way, but in all did not end up being as painful as it could have been. The host code (for demo) took a few hours to write, debug was minimal as I made sure to code carefully as builds/compilations are quite expensive. Another thing to note is that, for the demo and only for the demo, I reduced the sizes of the filter maps to have shorter builds and hence a faster iteration cycle to make sure there was a live demo available as a deliverable. I ended up achieving this with a much reduced spec (as expected for interim demo) where the host reads a video file with known file path and name, launches the kernel on the fabric, reads back the data, and serialises this to a file. Moving forward, we will want to send data to video output on the miniDisplayPort as opposed to serialising. We will also still need benchmarking added, both for accuracy and time. Lastly, just with wall clock time, it seems like serialisation takes an untenable amount of time (few seconds). We will need to investigate if this is also the case for streaming video and make sure this time does not act as a bottleneck for us.

Daily Update 11/10:

I re-integrated the correct input/output map sizes to the FPGA. The builds still take ~30 minutes. I want to find a better way to iterate on the full design that doesn’t take as long for a build, but at the same time I don’t want to devote too much time to something that might not amortise out. If I’m being honest, with the runway we have left, I don’t think that it will be worth it, and so will not devote that much time to optimising builds. I plan to block out three hours tomorrow to try and improve the iteration cycle, if nothing comes of it, so be it, I’ll just need to be careful with every build I do.

Daily Update 11/11:

Because of what Tamal told us yesterday in the interim demo regarding static discharging on the U96, I began looking into cases for the U96 that we could use to mitigate the risk of discharging due to touching the components of the device. I didn’t find many existing options, just one 3D-printable model on thingiverse, linked here. The main drawback with this model is that it includes space for the JTAG/UART extension, which we aren’t using, and so would be more bulky than what we want/need. I might look into modifying this model so that we can have a case with a better form factor. At the same time, however, I’m not sure if I have the bandwidth to add this to all the other tasks which I need to complete as per our schedule. I plan to leave this as lower priority – it wouldn’t be the worst thing in the world if we had the extra space for the pod – but also I’m planning to ask my group if any of them have more bandwidth / more experience with CAD / 3D printing

Daily Update 11/12:

I didn’t get much work for this class on Friday, mostly focused on deadlines I had in other courses.

Daily Update 11/13:

Again had other coursework to attend to during the day. Today in the evening, I’m running some benchmarks on the CNN kernel so I can get a sense for how much further I need to push it. I won’t have numbers in time for this update’s due date, but will have them later on tonight, past midnight.

End-of-Week Update/Overview: (11/7-11/13)

This week was fairly productive – we have a full(-ish) system, we just need to flesh it out and iron out some kinks. The build times, in retrospect, should not be a huge issue, I’ll just need to be smart with what I run, plus it’s good practice for industry codebases and learning the lesson that compiles are not always free.  The case has kind of taken a place on the back burner for now, it would be a nice convenience, but not something which we would need for MVP.  With tonight’s profiling and getting some readings done, I should be ready to start iterating in earnest and with a more solid goal to reach. At this point, I am fairly confident that I can get my part done on time or ahead of schedule. I may attach an update to this after due to include results from benchmarking that finishes late into the night so that coursestaff can review it before Monday.

James’s Status for 10/30

Since the problems we were having with AWS were reaching critical path for the completion of our project this week, I helped Josh look for alternative pre-trained models in the case that AWS/training fell through. While there do exist pre-trained models, many of them would not be exactly what we would need for our use-case. The pre-trained models we found were ‘rated’ for up to 4x upscaling, meaning that their performance would degrade for the 4.5x scaling factor that we will be using. Additionally, we found many models had extra layers of DSP preprocessing which we did/do not plan to use. In this case, if our hand were forced to use a pre-trained model, we have settled on an open source version, found on github that implements SRCNN without the extra preprocessing, knowing that this means that we may not be able to attain the picture reconstruction accuracy we originally set out to do (since the model will only have been trained to support good restoration up to 4x).

This week I also further helped Kunal ramp on host-side programming for the U96 board, and pointed him in the direction of various resources so he could get started on its implementation.

I also set up a git for us to use for the U96 vitis project. As of now it only has the vector-vector addition template example as an aid to Kunal to get him started on programming the host. I tried making further incremental gains on the CNN kernel, but was unable to realise any more this week. On the bright-side, I was able to rule out a good few different strategies for speedup, so the design space is, at the very least, still converging. I think that Kunal should be pretty much fully ramped by now, and so I should have more time this coming week to further explore the design space for CNN acceleration.

Team Status Report for 10/23/21

Last week we mainly focused on writing the design review report. To address the elephant in the room, we know that our submission was nowhere near as well done or well polished as it could be, and frankly was not even fully done in some parts. We know that this now means that we will have to pick up more work leading into the final report to make sure that it is done well, and we more than intend to do this. We do not want to repeat submitting something of such sub-par quality for the final report at all.

As for this week:

James focused mainly on improving CNN performance with marginal gains so far. More details are included in his status report.

Joshua focused on refining the software implementation of the project and ironing out bugs, as well as sorting out issues with training due to problems with AWS.

Kunal helped with improving CNN performance, as well as acquainting himself more with some of the content from reconfig, which James is currently taking but Kunal is not.

Overall, the project is about one whole week behind according to the Gantt chart, but this is not a concern since we left two extra weeks in order to address unexpected issues with our project’s development. A lot of it came down to our other courses ramping up in terms of time commitment and effort, and all members having to focus on other things, but in the end, we made a steady amount of progress, and we are still on track to finish the project on time.

James’s Status for 10/23

Last week, I mainly focused on the design review report. I didn’t get it as far along / as polished as I would have hoped; this means it will be more work for us when updating it for the final document. I ended up writing sections 1, 2, 3, 3.2, 4.1, 4.2, 5.1, 6.1, 9, 9.1, acronym glossary, BoM, and references. I was unable to write more, or organise with my partners to write more in their sections, but I know that we messed this up badly, and we will aim to rectify it before the time the final report comes around.

This week, I focused more on optimising CNN operations on the FPGA. This is a little bit out of order, but I decided to do things a bit out-of-order because it works much more synergystically with where we are in reconfig right now. I have so far increased throughput (on a basic fast-CNN implementation) by 25% to 20MOp/s, but am expecting to settle at two orders of magnitude higher than where I am right now. I also helped Kunal on-ramp with some Vitis stuff, as he was slipping behind on ramping with the platform. I shared excerpts from Vitis tutorials that we were given for reconfig, as well as pointing him more directly in the direction of online resources for Vitis. I need to circle back with him and check where he is with progress on i/o, and plan to do so this coming Monday. This may effect the Gantt chart / schedule, but we have the slack to allow for it for now. I will be keeping tabs on how much slack I use in my tasks because I know that I have begun cutting things close with the amount of remaining slack which I am allotting myself.

James’s Status for 10/9

This week was very busy for me in other courses and so subsequently I did not hit my targets for this week. I didn’t get a chance to sync up with Kunal to see what I/O routines he has written or to begin testing/validating them. Research on I/O however seems to be mostly done; it ended early in the week, on Monday, so that was one task I am able to check off for this week.

This coming week, I plan to grind on I/O and start on CNN math. We also have the Design Review Report to further flesh out. Hopefully this doesn’t encroach too much time from my working on my other tasks, but I don’t think it will, since we have ‘mid-semester break’ which I can use as a day to get things done.

James’s Status for 10/2

This week I continued research on IO for the Ultra96. I was able to find example code for video in and video out, which I will need to modify to work with video file in — the example used video stream in to 1080p@60FPS video out. In looking at the specs of the board, and the available data for training that Joshua found, we decided to change our spec framerate to 30FPS. There are more video datasets at 30FPS than our original idea of 24FPS; this was because of the difference between “p” and “i” formats which we initially overlooked. The “p” formats are more common on pixel screens and “i” format is from when video was still interlaced.

I was also able to get communications setup between the ARM core and FPGA of the Ultra96. This ended up being a prerequisite to some setup I had to do for looking at IO, so in this case, I was able to check off something that we planned to be further down the line in our gantt chart, which is always a good thing to be able to do.

I started our slideshow for the design presentation and began the draft for the report as well. I have been working on those, and running metrics to have hard data to present on those as well, specifically runs comparing MSE to SSIM to VMAF to motivate the need for VMAF as a metric over the other two more well-known ones.

Team Status Report 9/25

For this week, we followed our schedule/Gantt chart and attempted to do every thing on our list. Almost every single task was accomplished successfully, with details listed on our personal status reports, with the only hiccup being our AWS setup – the credits will be acquired by next week, and we have already looked into which AWS instance type would be most suitable for our group.

During class time, our entire group also performed peer review on all our classmates. We decided that Kunal would be presenting for the second presentation, and we also aim to address all of the concerns that were addressed during the Q/A session of our first presentation, as well as all feedback we received from TAs/professors through Slack.

As a team, we further discussed the model for our project, as that is a core part of the upscaling process. Reflecting on the feedback from Slack, as well as following our schedule, we also decided on a specific dataset and acquired the videos online from a database that provides it.

Looking towards next week, we are on track according to our schedule, and optimistic of continuing our positive trajectory. Next week, we begin trying to implement I/O with our acquired hardware, as well as preparing well for our design presentation. We will also begin writing code for the training part of our project in Python.