Joshua’s Status Report for 12/4

This week, I made several adjustments to priorities before the end of semester. Firstly, James was making good progress on the implementation of FSRCNN-s, which meant that getting weights for the model was integral for our final demo and deliverable. Taking into account the potential worst-case scenario, I did research on publicly available weights for our CNN model, and did comparison between the dataset they used to see if they would be suitable, in the possible event that the training for our new model could not be done in time. After talking with the team, I reevaluated our priorities realized that finishing and printing our CAD model would have to be put on the backburner –  figuring out hyperparameters and looking through possible solutions with James was much more impertinent, in order to meet our deliverables by the end of the semester.

In terms of the final demo, I voiced several ideas on the structure of how we could present, including a multi-monitor setup, as well as interaction with the listeners, such as allowing them to rate which video they thought was the better one, as well as prompting them to give a subjective score, etc.

I made the final presentation and also prepared to deliver it. In doing so, I also worked on the final report, detailing a lot on our testing and metrics on our initial model, even though we’ve moved on from that one, as to fully document our developmental process.

In terms of the final video, I discussed with my team on what approaches we could take, and since none of us really had any video-editing experience, we decided to prepare in advance. I looked at videos from previous Capstone courses, specifically the winners of last semester, as I thought their video was correctly paced and well-edited.

(This report was made late, and was added on December 12th.)

Team Status Report for 12/04

James continued to work on squeezing performance from the fsrcnn model, but ran into diminishing returns. Using fixed weights allowed for some additional improvements in memory accessing, and since we have fixed weights, we have the ability to do this. Integration with the host side led to additional slowdowns. Thinking of ways to improve this, a multikernel approach was decided and James began writing this. He expects to finish implementing this by the end of the week of 11/29.

<Josh Edits>

<Kunal Edits>

James’s Status for 12/04

From last week have an implementation of fsrcnn which runs faster than srcnn, still slow though. One optimisation that I tested was using fixed weights as opposed to weights stored in host-side memory which is mapped to the kernel. This led to a decent improvement in latency but not enough to meet our initial specifications. Porting and integrating with host code has produced further slowdowns. Trying to remedy this with a multikerneled approach which should be finished by tonight. Will be focusing on writing the paper, the video, and making a narrative to sell what we have for the coming week, as we aren’t in the position schedule wise to try for more optimisations, even if that’s what I would like to do.

Project-management-wise, I also helped Josh practice for the Wednesday presentation on Tuesday.

Team Status Report for 11/20

This week, our group spent a lot of time working on getting a working version of our upscaling model on the FPGA. Since we had several delays in the previous week, we had to work hard to catch up on lost time, and we are making a lot of progress each day in order to meet our deadlines.

James worked on optimizing the latency on the U96 board, as well as improving on the FPGA architecture of the CNN. He ran into several problems near the beginning of the week, but quickly caught up and is in the middle of addressing all the numerous limiting factors that are preventing our upscaling model from working.

Joshua worked on addressing problems with the end-to-end latency by training a smaller model from scratch, since there were unexpected issues with the end-to-end latency on the U96 board. He was also in charge of starting the final presentation, as he will be presenting, as well as continuing work on the final report. He also did further research on a case for the U96 board, something that James had an initial design for last week, but was for an older generation of the board.

Kunal is working more on the I/O portion of the board, by looking at the different ways the frames can be passed in to increase the speed of implementation.

Next week, we are looking to have a final product working, even if it doesn’t fully meet our initial requirements. From the work we’ve put in these last few months, it seems that some trade-off is inevitable, and this week, we will pinpoint exactly which trade-off we are willing to take, and look to have at least a semi-working demo in preparation for our final presentation on 11/29.

Joshua’s Status Report for 11/20

As the semester is wrapping up, this week was very busy in terms of workload and addressing issues that came up. At the beginning of the week, I was doing research on CAD designs for our FPGA board – a kind of holder that wraps around the board to prevent static discharges, something that my team and I had looked into last week. Since none of us on our team really had much CAD experience, it was a process which was more time-consuming than expected. I ended up asking a friend who had more CAD experience that me to teach me, and I had a design finalized for the board.

However, our group quickly ran into a bigger problem, which was that our end-to-end latency was a lot higher than we had initially calculated it to be. The main problem was that although our throughput was close to meeting our requirements, our end-to-end latency is significantly higher than what we wanted for it to be ‘real-time’, i.e. a delay of less than 30ms. To address this, a smaller model was needed, but that would come at the cost of quality for our final upscaled videos. We had to decide whether or not the sacrifice to quality was worth it, or take the hit in terms of latency and justify why that was the better choice overall.

I talked through the problem with James and Kunal, and we were thinking about ideas such as training a single-layered CNN from scratch, which was still possible with the time we had left but cutting it very close, or looking at other solutions. In the end, I prepared a smaller model which was still had 3 layers, but with drastically less filters, and started training that in parallel with our other initial model, by training the new model on my GPU and having the training for our initial model continue on Google Colab. I aim to fully address this issue in the coming week, and coordinate with James to ensure that we’ll have at least a final working product, even if it doesn’t meet our initial requirements.

In terms of the final presentation and the final report, I’ve started working on the former and continuing work from last week on the latter, since I will be the one presenting next week.

Joshua’s Status Report for 11/13/21

This status report will be split into two parts: The first part will highlight the what happened last week, i.e. 10/31-11/6, as my status report was missing for that week, and the second part will cover my report for the week of 11/7-11/13.

For the week beginning on 10/31, I was extremely sick from 10/31-11/3 and could not get much work done nor communicate with my teammates. When I got started on Thursday, my group and I got together to reevaluate our situation, and also worked on presenting for our interim demo. I was responsible for the software demonstration, which involved showing our working software model that could upscale videos from our dataset. I had several demonstrations in mind, one that would output a side-by-side comparison of the videos from bicubic upscaling, CNN upscaling and also the original/native resolution video in real-time. Another demonstration involved a live webcam feed, which would show an upscaled version of our webcam in real-time that involved less filters. This was to demonstrate that although a real-time upscaling tool on software/GPU is possible, it was still lacking compared to the CNN model with more filters, which is not possible on software but possible with our hardware/FPGA implementation. This did not end up working, so I resorted to upscaling still images and showing the difference in SSIM that demonstrated the difference between the number of filters trained on our CNN implementation.

For the week beginning on 11/7, we started the week off with our interim demo. I demonstrated a clear difference in SSIM between a standard bicubic interpolation upscaling method and our software-based implementation of a CNN upscaling method on a video from our dataset. Due to some problems with throttling, I had to resort to showing videos I had previously upscaled instead of demonstrating my model working live.

I did some research to see if a dataset involving still images would end up working better than our initial dataset of frames from videos. The reason is because our upscaled models seem to output blurrier frames and videos, which is subjectively better than the artifact-heavy upscaled videos through bicubic implementation and objectively better in terms of SSIM, but still unusual. I hypothesized that since a lot of images used in our dataset are blurry, it may have somehow caused the CNN to output blurry sets of pixels when they aren’t in the native frames. After discussing with the team and examining our timeline, I decided to explore using other datasets, ones that had still images which weren’t blurry natively. We were able to fit this in because we already had our training infrastructure setup with Google Colab and my local machine. I decided on a dataset used by another open-source upscaling implementation, but one that was aimed on implementing single-image super-resolution instead of video super-resolution. Using the same hyperparameters and through 1-2 days of training, the initial results seemed to yield higher SSIMs on the still images, but when tested on our video datasets, artifacting was a lot more apparent, and the SSIM fell short of my initial model. So, I ended up dropping that idea and continuing to train our previous implementation.

I also discussed with James the current limitations/bottlenecks of our hardware approach, specifically, if there was anything that could be done and tested on the software side that would help with the runtime of the hardware model. He suggested making minor tweaks to the hyperparameters, specifically the number of filters in our second CNN layer, which could, potentially, massively decrease the time taken for the implementation to work with only a very minor decrease in quality of the upscaled video. I also discussed potential issues with the processing of each pixel, specifically, the encoding involved, as the paper was only processing one layer of data, specifically, the luminescence portion (Y) of the YCbCr encoding. There were different encoding methods shown which could potentially reduce the amount of data being processed on our FPGA, which was significant because James mentioned that it was a potential bottleneck for our system.

Minor additions: I added content to our website, as well as getting a start on our final report as to avoid our situation with our previous design report. It was also helpful to state and write down exactly what changes we had put into place since our design report, to double-check that our project has stayed on track and hasn’t deviated from our initial requirements.

James’s Status for 11/13

First off, I’d like to apologise for the lack of a status update on the previous week (nothing posted on 11/6). I was extremely busy getting our hardware working for the interim demo. For the sake of coverage and good documentation, I will include what I would have had in that update here, clearly backdating and marking entries where applicable. I’ve included an end-of-week overview of last week, daily reports for each day this week, and an end-of-week overview of this week. I’ve decided to add daily reports for myself now for two reasons: 1) to keep myself accountable for making regular progress on the projects, and 2) because reaching this stage of the project, I have a lot more to do than previously and hence want a better way to organise it.

———-

End-of-Week Update: (10/31-11/6)

This week I got hyperparameters back from Josh, so I was able to get the CNN built on the Ultra96. Unfortunately, because he was still recovering from illness, I didn’t get the hyperparameters back as early as I’d have liked, and so was not able to run all the experiments I wanted to this week. One big takeaway I found when building the full size of the model was that I didn’t fully appreciate the size of the model before this point, and so didn’t realise that each hidden layer (at least as implemented now) has to have calls back to DRAM. This may cause slowdowns but I haven’t had the chance to benchmark this yet. This is a big TODO for the upcoming week. Also because of the size of the model, it causes builds to take a very long time to finish synthesizing and routing, around 30 minutes for an incremental build, far longer for a clean build. Development in this sized environment will be far slower than I anticipated just due to this turnaround. As of now, I just have the model hyperparameters, no weights, but the model I have implemented on the FPGA is agnostic to the weights, they will just be loaded from a file by the host. There could be improvements based on precomputation but I’m not sure if this is actually the case. I would have to do a cost benefit profiling for how much computation / memory accessing it would actually save. At the same time having the model agnostic to the weights gives more modularity to our system which is very good for a short turnaround testing environment like what we have. In the coming days we will need to get the system partially integrated for demo, and then keep moving forward with progress on the rest of the coming week.

Daily Update 11/8: (Interim Demo)

I did integration this weekend and ran into a great deal of immediate issues, especially with the timeline of the interim demo being so soon. The first issue I ran into was finding decent data sources. So for expediency and a proof of concept of getting video from the host to the fabric, I wanted to store a video on the home directory of the board’s file system, but couldn’t get them to play nicely (issues with file formats, dimensions, file size, and so on). In the interest of time, I reverted to using an mp4. After our first demo I will ask Josh to share the data set so we have better/more applicable files to use. The size of the files will also be less of a problem since it will live on a USB as opposed to on the same microSD on which the image of PetaLinux lives. The second main issue I ran into was that the code that Kunal gave was riddled with bugs and errors. In order to fix it, the most clear and effective path forward was to rewrite the entirety of the host code. This ended up being a bit painful in linking the correct OpenCV libraries with Vitis, as the project file does not store the config for the build in an obvious way, but in all did not end up being as painful as it could have been. The host code (for demo) took a few hours to write, debug was minimal as I made sure to code carefully as builds/compilations are quite expensive. Another thing to note is that, for the demo and only for the demo, I reduced the sizes of the filter maps to have shorter builds and hence a faster iteration cycle to make sure there was a live demo available as a deliverable. I ended up achieving this with a much reduced spec (as expected for interim demo) where the host reads a video file with known file path and name, launches the kernel on the fabric, reads back the data, and serialises this to a file. Moving forward, we will want to send data to video output on the miniDisplayPort as opposed to serialising. We will also still need benchmarking added, both for accuracy and time. Lastly, just with wall clock time, it seems like serialisation takes an untenable amount of time (few seconds). We will need to investigate if this is also the case for streaming video and make sure this time does not act as a bottleneck for us.

Daily Update 11/10:

I re-integrated the correct input/output map sizes to the FPGA. The builds still take ~30 minutes. I want to find a better way to iterate on the full design that doesn’t take as long for a build, but at the same time I don’t want to devote too much time to something that might not amortise out. If I’m being honest, with the runway we have left, I don’t think that it will be worth it, and so will not devote that much time to optimising builds. I plan to block out three hours tomorrow to try and improve the iteration cycle, if nothing comes of it, so be it, I’ll just need to be careful with every build I do.

Daily Update 11/11:

Because of what Tamal told us yesterday in the interim demo regarding static discharging on the U96, I began looking into cases for the U96 that we could use to mitigate the risk of discharging due to touching the components of the device. I didn’t find many existing options, just one 3D-printable model on thingiverse, linked here. The main drawback with this model is that it includes space for the JTAG/UART extension, which we aren’t using, and so would be more bulky than what we want/need. I might look into modifying this model so that we can have a case with a better form factor. At the same time, however, I’m not sure if I have the bandwidth to add this to all the other tasks which I need to complete as per our schedule. I plan to leave this as lower priority – it wouldn’t be the worst thing in the world if we had the extra space for the pod – but also I’m planning to ask my group if any of them have more bandwidth / more experience with CAD / 3D printing

Daily Update 11/12:

I didn’t get much work for this class on Friday, mostly focused on deadlines I had in other courses.

Daily Update 11/13:

Again had other coursework to attend to during the day. Today in the evening, I’m running some benchmarks on the CNN kernel so I can get a sense for how much further I need to push it. I won’t have numbers in time for this update’s due date, but will have them later on tonight, past midnight.

End-of-Week Update/Overview: (11/7-11/13)

This week was fairly productive – we have a full(-ish) system, we just need to flesh it out and iron out some kinks. The build times, in retrospect, should not be a huge issue, I’ll just need to be smart with what I run, plus it’s good practice for industry codebases and learning the lesson that compiles are not always free.  The case has kind of taken a place on the back burner for now, it would be a nice convenience, but not something which we would need for MVP.  With tonight’s profiling and getting some readings done, I should be ready to start iterating in earnest and with a more solid goal to reach. At this point, I am fairly confident that I can get my part done on time or ahead of schedule. I may attach an update to this after due to include results from benchmarking that finishes late into the night so that coursestaff can review it before Monday.

Joshua’s Status Report for 10/30/21

For this week, I set out to accomplish two main tasks – Addressing our problem with AWS, which would allow the training process for our CNN to begin on time, and also expanding on our previous research on pre-trained models. This is to make sure we have one ready to use if either our software model either isn’t ready on time or that it fails unexpectedly, so that we can still submit a final, working product.

In terms of the research, I worked with James to research various pre-trained models online. As we had found out initially, a lot of pre-trained models that are based off the paper we are using don’t follow the method stated in the paper exactly, and use a lot of filters instead of CNNs as their main upscaling method. A surprising number of them also throw in filters such as line-sharpening and and anti-blurring filters, which greatly increase computational time and hence cannot be realistically done in real-time. We eventually found an open-source version of the SRCNN implementation on Github in Python, which uses a CNN, but is only rated for up to 4x the upscaling. This will detract slightly from our initial goal of 4.5x upscaling, which we had determined to be achievable, but it would still be viable to be put on hardware to show the acceleration that is possible from our FPGA implementation. The dataset they use is different to ours, since it was mainly used on still images instead of key frames of videos, but it is still a relevant dataset as it has the characteristics we chose for our dataset – variety of shots, close-up vs zoomed out, nature vs still objects etc.

To address the concerns with AWS, immediately after our meeting, Kunal double-checked his AWS and found that the request had actually already been approved – he had just missed it. Despite this, the request came back insufficient, as the wrong number of vCPUs had been provided to allow us to use our chosen instance – a P3 instance required 8 vCPUs, whereas Amazon only provided us with 1. After following up on our initial request, they replied within 2 days, stating that they did not have enough resources currently to provide us with the vCPUs needed for a P3 instance, and instead recommended us to go with the G4 instances, which we had actually looked at previously and was our second-best choice.

Concurrently, I also attempted to use Google Colab after the advice from Joel, and there were two main problems – as Joel had mentioned before, the free version turns off after some period of time has passed without any activity, which is a problem. Another big issue is that the storage was very limited and couldn’t fit the dataset we had chosen, which was close to 100 GB. As we were on a tight schedule, I bought the paid version for $10 without requesting, which addressed the concerns, upping the storage to around 180GB making it more than sufficient. The code was running fast enough – after ironing out some bugs, I estimate the model to be fully trained by around Wednesday/Thursday this coming week. Since the code runs well enough on Google Colab, we are no longer using AWS, as Google Colab is also significantly more convenient.

For the coming week, since my role on the software section will be completed, I will be helping James and Kunal where necessary for the integration process.

Team Status Report (10/30)

This week I (Kunal) along with James worked on the I/O portion of the project. James helped me on host-side programming for the U96 board, and provided me with various resources to look at, so that I can get going with the implementation.

James attempted to make further gains of the CNN kernel, but it was not as successful this week. However, he has worked out various strategies which will help speedup our implementation, and has been attempting to implement those strategies. He also worked with Joshua to research pre-trained models. He also setup a git to help Kunal with his portion of the project.

Joshua worked on researching pre-trained models along with James, and also attempted to get AWS up and running, but after testing Google Colab, decided to go with that instead. Our request came back but was not fully fulfilled, as we weren’t provided with enough vCPU limits to use our preferred instance, so after purchasing Google Colab Pro, he decided to use that instead to speed up the training process.

In terms of the whole project, we almost have a working CNN model. The training should be done by around Wednesday/Thursday this week, and James and I will be working extensively on CNN acceleration before then, and then take our weights/hyperparameters from our software model and implement them this coming Friday/weekend to ensure everything is going smoothly. Overall, our project is around a week behind, but we are confident it will go smoothly as we have worked out enough slack time, and also addressed the issues that were preventing our project from moving forward.

James’s Status for 10/30

Since the problems we were having with AWS were reaching critical path for the completion of our project this week, I helped Josh look for alternative pre-trained models in the case that AWS/training fell through. While there do exist pre-trained models, many of them would not be exactly what we would need for our use-case. The pre-trained models we found were ‘rated’ for up to 4x upscaling, meaning that their performance would degrade for the 4.5x scaling factor that we will be using. Additionally, we found many models had extra layers of DSP preprocessing which we did/do not plan to use. In this case, if our hand were forced to use a pre-trained model, we have settled on an open source version, found on github that implements SRCNN without the extra preprocessing, knowing that this means that we may not be able to attain the picture reconstruction accuracy we originally set out to do (since the model will only have been trained to support good restoration up to 4x).

This week I also further helped Kunal ramp on host-side programming for the U96 board, and pointed him in the direction of various resources so he could get started on its implementation.

I also set up a git for us to use for the U96 vitis project. As of now it only has the vector-vector addition template example as an aid to Kunal to get him started on programming the host. I tried making further incremental gains on the CNN kernel, but was unable to realise any more this week. On the bright-side, I was able to rule out a good few different strategies for speedup, so the design space is, at the very least, still converging. I think that Kunal should be pretty much fully ramped by now, and so I should have more time this coming week to further explore the design space for CNN acceleration.