***** Apologies for lateness *****
Again I will be structuring my status this week in daily updates alongside an end-of-week review.
Daily Update 11/14:
I was unable to get the benchmark data that I wanted. I am running into a massive bug where the code compiles and synths but, when run on the U96, causes it to brick itself until a manual reboot is issued, either via unplugging or the power button – a soft reset isn’t even an option. I want to get this into vitis_hls so I can see if this is running but just taking ages and overloading the board to the point of not having capacity to run the heartbeat anymore (which would be very bad since it would mean that our model is taking up far too much computation for our board) or if there is an error in my code. In all honesty, I wouldn’t be surprised and almost expect it to be that there is an error in my code.
Daily Update 11/15:
Today was productive. I was able to get the code into vitis_hls and get it properly building, albeit with some vast codebase restructuring. Running in HLS let me see that, in simulation, that we should be making timing, or at the very least the FPGA is not going to be the limiting factor, as, I was able to reach 10ns latency, and could probably push it even further, as 10ns is just the default target that HLS builds to. There could still be considerable delays from data movement on the host side, or other memory issues into which the test I ran would not give insight. Additionally, also from further testing in HLS, I was able to pin down the cause of the U96 hanging that I was running into: it was memory overruns that I didn’t catch when porting over the full-sized system’s true sizes. I’ve gone ahead and fully parameterised my code now, and as such, there is no room for this error to happen again. While this issue is fixed, now I am running into an XRT error regarding memory usage causing the host to segfault, with the particular error being a “bad_alloc” error. Doing some preliminary digging into docs, this seems to point to allocating too much memory. I’m going to look a bit further into this tomorrow and also look into using lower-precision FP types so that the amount of memory may be lower. IF these don’t pan out tomorrow, I will also fork a branch on our Git for a different FPGA architecture of CNN. The two options I have in mind are: 1) Using a fixed full-feature-map layer-kernel as opposed to how I have it implemented currently, as a model-kernel. In this way I would have to apply the layer-kernel three times from the host side, loading in the relevant weights as it goes along. 2) Using a single-feature-map layer-kernel. This would be very data-lightweight, but would put more responsibility onto the host in coordinating memory movement, and this movement might end up being the dominating factor for latency and throughput.
Daily Update 11/16:
Doing some hand calculations on my current implementation, just as a sanity check, it looks like the issue is a memory related one, in that I am trying to request from the system more memory than what is should be able to provide. The dominant factors are the hidden layer buffers, which I am storing as memory buffers. Since I can see this now, I’m going to more tightly couple the layers of the networks so that I can remove these inter-layer memory requirements.
Daily Update 11/17:
Thinking further into the interlayer optimisations, there is no way to keep the overall structure I currently have and implement it. Hence, I am trying a new strategy whereby the calculation is done not in a tiled fashion but in a sub-tiled tiled fashion. I will spend today finishing up getting this up and running, and then will sweep a few values.
Daily Update 11/18:
This new architecture looks promising, have been able to get lower numbers than before, still too high to be useful though. I did a more full calculation of what the max bandwidth should be, and it was extremely concerning, as the ideal bandwidth was around 6s, ~2 orders of magnitude where we need it to be (~60ms), and that’s still assuming I can achieve ideal, with memory strucutre constraints given by how the frame is structured by OpenCV when we read it in.
Daily Update 11/19:
Tried to restructure the network, but misunderstood the architecture I was going after, ended up being a waste. Did some value sweeping with vitis_hls, have found what seems to be a minima, unsure if it is a global or local one.
Daily Update 11/20:
Didn’t have much bandwidth to work on capstone today, just able to sweep a few values, didn’t amount to much more performance. Ending the week at E2E latency of 116011ms.
End-of-Week Report 11/14-11/20:
I am making incremental improvements, but they aren’t coming fast enough, and still there is the ideal cap that we run into. I’m not sure what can budge anymore, but likely we will not achieve one of our set benchmarks. This is not good.