Jeremy’s Status Report – 04/30/2022

This week I primarily worked on integration and the poster for our final presentation. I also worked a little bit on debugging the AXI ports with Ziyi. We integrated the different optimizations together and also worked on getting more data for our optimizations.

I am happy with our progress in this project.

Team’s Status Report – 04/20/2022

With most of the baseline operations implemented and verified, most of the time for this last week was spent on finalizing integration and adding some extra touches to algorithm and the different scenes we want to demo.

On the hardware side, we finalized how we would handle the interfacing between the Host CPU on the FPGA and the FPGA fabric itself. After some testing, we realized that the our port-widening scheme resulted in some faulty values being translated. We believe that this was due to how our datatype is only 24 bits, while the ports would have to be multiples of 32 bits wide. We think that this offset might be messing up our pointer-casting data-packing scheme. However, this is not really that big of an issue, as we can just have a longer burst transaction length. Furthermore, testing the different kernels on the FPGA yielded similar timings as well. It seems that as long as the transaction was bursted, the specifics did not really matter. (Amdahl’s Law suggests that we should turn our attentions elsewhere :^) ). Other than that, we decided to unroll a couple more things, and we mostly locked in our final code.

On the software side, we focused on adding support for displaying new scenes. There was some exploration into supporting other types of primitives such as quads and even meshes, but with a week or two left it was decided that we would just add support for more spheres (the alternative would be looking into compiling OpenGL on the Ultra96 and/or a major refactor for supporting a general Shapes class and changing a lot of std::move operations). We also figured out a new way of visualization so that we could compare the fluid simulation traces directly as point clouds. We still need to do some timing for data transfer between the Ultra96 and the host computer, but that should be trivial since it’s just an scp operation and then using a built-in timing tool in terminal.

Overall we’re pretty excited that our project actually works and we have cool things to show off now. We just need to get a lot of specific timing numbers down to address our requirements now, but we’re confident that we can get that done in the next couple days in time for the presentations.

Ziyi’s Status Report – 4/30/2022

During this final week, we pretty much just finished up working on the integration of the entire project. We merged Jeremy’s changes in the loop logic with my logic for the burst transfers and verified that the results made sense. After this, we investigated unrolling and pipelining a few more loops and managed to squeeze out a bit more performance. As  shared in the presentation, here is a summary of some of the effects of different optimizations.


As a note, these results are only estimates of the kernel operation itself, and do not entirely reflect the costs of both the kernel and its associated data transfer.

 

Other than integration, the rest of this week was pretty much spent on preparing presentation materials, including the final presentation, the poster, and the final video.

Jeremy’s Status Report for 04/23/2022

This week I worked primarily on optimizing the fluid simulation algorithm on the fabric. This involved iterating upon the algorithm and exploring different ways to restructure the hardware and taking advantage of the HLS pragmas to allow the algorithm to run faster.

Initial data points

This also required figuring out what optimizations would break the implementation, and was a fairly iterative process. I was also working on developing the slides for our final presentation this week.

I think that we are on track in our schedule. Next week I will continue work on optimizing, and also work on our poster and video.

Team’s Status Report – 04/23/2022

This week, we got a lot of integration work done.

 

On the software side, Alice finished up implementing the algorithm to support a hardware-friendly compilation. After verifying the results, we determined that it would be sufficient in terms of rendering a product with an appreciable simulation quality. We did have a small hiccup were the synthesis resulted in using 1500+ BRAMs (far more than the 460 we have on the board); however, after rebalancing some constants, we were able to fit everyhting in the device footprint.

 

On the hardware side, Ziyi finished up accelerating the data transfer interface bewteen the FPGA fabric and the FPGA host CPU and began investigating potential improvements for step5 of the kernel, and Jeremy began investigating some optimizations for unrolling and pipelining step2 and step3 in the kernel as well as inlining different function calls in order to reduce the latency of certain instructions.

Initial data points

As per our goals next week, we want to finish up accelerating and benchmarking our different improvements to the kernel. Once we have some appreciable results, we will begin assembling all of our presentation materials.

 

Ziyi’s Status Report – 04/23/22

This was another good week for progressing in hardware-land. The first major contribution of the week expanding the AXI port so that we could transfer a whole Vec3 per transfer, rather than just a single position primitive (our 24 bit particle_pos_t datatype). The simpel effect of this optimization is that if we can move more data per transaction, this means we need fewer transactions to move all the data and thus spend much less time. In the simple example of grouping the three primitives together, this means that we’ll have three times fewer transactions overall, which roughly corresponds to a three times speedup. If we wanted to further send multiple Vec3s per transaction, we could save even more time; however, this could also lead to us hitting the upper bound of a 4kB page per burst transfer.

In order to implement the port widening, we needed to create an arbitrarily-sized datatype that is 3 times the width of a single primitive. Then, we would cast our writes to the output port to the 3-wide packed datatype. This seemed to make vitis happy enough to pack the data together.

Related to port widening, the next major contribution was implementing pipelined burst AXI4 transfers. Basically, the point of having a pipelined AXI transfer is that you amortize away the setup costs of having an isolated transfer and you gain significant throughput boosts from having a pipelined transfer.

However, it should be noted that in order to widen the ports, we needed to preprocess the particles position array by transfering every data value into a contiguous BRAM. This constitutes a pretty obvious design tradeoff for our project, where we expend more resources (and time!) in the effort of saving even more time overall.

 

As for next week, my next task is to accelerate the step5 loop and finish verifying the data transfer interface.

Ziyi’s Status Report – 04/16/22

This week was another week of great progress! The first thing I did this week was to resolve the interfacing issues with the FPGA. After debugging the segfault with a dummy kernel, I was able to get it to successfully transmission from the FPGA fabric to the host CPU. From here, I just switched out the dummy kernel for the most up to date kernel (more on this later) and uploaded the full project onto the board, and presto! Results!

Results from the board (SCP’d from board)
Build Resources and Performance

The next thing that I helped with was synthesizing the most up to date kernel (as mention above). This kernel was a major milestone in that it was the first implementation to include everything (including the nearest neighbor algorithms). While Alice and Jeremy mostly handled the algorithmic part of the implementation, I handled some of the Vitis build errors. One example of which was a dependency between different iterations of the writeback loop. After analyzing the loop body, I was able to fix this bug by introducing the dependency pragma, which allowed Vitis HLS to correctly optimize this.

As an aside, solving the different HLS build warnings is incredibly important. As programmers, the traditional “wisdom” is that warnings are more or less ignorable. The issue with HLS is that in order to adapt to the warning, Vitis HLS will expend a lot of extra unneccessary hardware resources, potentially an order of magnitude more than what the board supports!

My primary task next week is to investigate and document different optimizations and pragmas we can use to accelerate the performance of the kernel. Another tasks is to potentially investingating refactoring the code so that creating the particles also happens off-chip. This would free up some extra space for extra unrolling and optimizations.

Alice’s Status Report for 04/16/2022

Last week my goal was to work on some optimizations for the build. However, I had to shift my attention to what Jeremy was working on with the nearest neighbors. Since hardware needs to have known memory sizes for everything, it makes it challenging to discretize the 3D world space and partition the fluid particles accordingly, so we are currently trying to bound the possible fluid simulation positions. However, right now we still have a bit of work to do with re-working how we read particles in (in case they get dropped) and making sure the output is reasonable. My goal for next week is to finish up getting the algorithm to work correctly in hardware and then work on optimizations.

I definitely feel slightly behind since our team was not able to meet in person at all this week and communication has been slow. However, it seems like Jeremy will be able to meet in person again soon and we’ll be able to get back on track.

Jeremy’s Status Report for 04/16/2022

So this week unfortunately I have been sick with COVID, and wasn’t able to make as much progress as I wanted to originally. For the majority of the week I have been trying to get rest and recover quickly, so I’ll be productive next week. Later in the week I was able to put in some work where I worked on the correctness of our algorithm, as we were facing some issues where particles did not behave as we expected them to. The issues turned out to be with how we calculate the bounds of the scene and the voxels that particles are in, but they are mostly solved now.

I feel slightly behind since I did not expect to get sick this week, but I think that we are still in a good spot, and will be able to produce good results from our project. Next week I am ti help with collecting data and continuing to optimize our implementation.

Alice’s Status Report – 04/10/22

Last week we were rushing to finish the build for the interim demo, and I was unable to complete the evaluation script. This week I was able to do so. I also read up a lot on unrolling, pipelining, and other optimizations outlined here: https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/HLS-Pragmas

This upcoming week I’ll be working on adding said pragmas for optimizations. I’ll also be helping to verify that the Vitis HLS project provides reasonable fluid simulation output once Ziyi finishes the interface work to get an output text file from the FPGA.

We had a really great push at the beginning of this week. I’m optimistic that we can achieve our goals.