Team’s Status Report – 04/20/2022

With most of the baseline operations implemented and verified, most of the time for this last week was spent on finalizing integration and adding some extra touches to algorithm and the different scenes we want to demo.

On the hardware side, we finalized how we would handle the interfacing between the Host CPU on the FPGA and the FPGA fabric itself. After some testing, we realized that the our port-widening scheme resulted in some faulty values being translated. We believe that this was due to how our datatype is only 24 bits, while the ports would have to be multiples of 32 bits wide. We think that this offset might be messing up our pointer-casting data-packing scheme. However, this is not really that big of an issue, as we can just have a longer burst transaction length. Furthermore, testing the different kernels on the FPGA yielded similar timings as well. It seems that as long as the transaction was bursted, the specifics did not really matter. (Amdahl’s Law suggests that we should turn our attentions elsewhere :^) ). Other than that, we decided to unroll a couple more things, and we mostly locked in our final code.

On the software side, we focused on adding support for displaying new scenes. There was some exploration into supporting other types of primitives such as quads and even meshes, but with a week or two left it was decided that we would just add support for more spheres (the alternative would be looking into compiling OpenGL on the Ultra96 and/or a major refactor for supporting a general Shapes class and changing a lot of std::move operations). We also figured out a new way of visualization so that we could compare the fluid simulation traces directly as point clouds. We still need to do some timing for data transfer between the Ultra96 and the host computer, but that should be trivial since it’s just an scp operation and then using a built-in timing tool in terminal.

Overall we’re pretty excited that our project actually works and we have cool things to show off now. We just need to get a lot of specific timing numbers down to address our requirements now, but we’re confident that we can get that done in the next couple days in time for the presentations.

Ziyi’s Status Report – 4/30/2022

During this final week, we pretty much just finished up working on the integration of the entire project. We merged Jeremy’s changes in the loop logic with my logic for the burst transfers and verified that the results made sense. After this, we investigated unrolling and pipelining a few more loops and managed to squeeze out a bit more performance. As  shared in the presentation, here is a summary of some of the effects of different optimizations.


As a note, these results are only estimates of the kernel operation itself, and do not entirely reflect the costs of both the kernel and its associated data transfer.

 

Other than integration, the rest of this week was pretty much spent on preparing presentation materials, including the final presentation, the poster, and the final video.

Team’s Status Report – 04/23/2022

This week, we got a lot of integration work done.

 

On the software side, Alice finished up implementing the algorithm to support a hardware-friendly compilation. After verifying the results, we determined that it would be sufficient in terms of rendering a product with an appreciable simulation quality. We did have a small hiccup were the synthesis resulted in using 1500+ BRAMs (far more than the 460 we have on the board); however, after rebalancing some constants, we were able to fit everyhting in the device footprint.

 

On the hardware side, Ziyi finished up accelerating the data transfer interface bewteen the FPGA fabric and the FPGA host CPU and began investigating potential improvements for step5 of the kernel, and Jeremy began investigating some optimizations for unrolling and pipelining step2 and step3 in the kernel as well as inlining different function calls in order to reduce the latency of certain instructions.

Initial data points

As per our goals next week, we want to finish up accelerating and benchmarking our different improvements to the kernel. Once we have some appreciable results, we will begin assembling all of our presentation materials.

 

Ziyi’s Status Report – 04/23/22

This was another good week for progressing in hardware-land. The first major contribution of the week expanding the AXI port so that we could transfer a whole Vec3 per transfer, rather than just a single position primitive (our 24 bit particle_pos_t datatype). The simpel effect of this optimization is that if we can move more data per transaction, this means we need fewer transactions to move all the data and thus spend much less time. In the simple example of grouping the three primitives together, this means that we’ll have three times fewer transactions overall, which roughly corresponds to a three times speedup. If we wanted to further send multiple Vec3s per transaction, we could save even more time; however, this could also lead to us hitting the upper bound of a 4kB page per burst transfer.

In order to implement the port widening, we needed to create an arbitrarily-sized datatype that is 3 times the width of a single primitive. Then, we would cast our writes to the output port to the 3-wide packed datatype. This seemed to make vitis happy enough to pack the data together.

Related to port widening, the next major contribution was implementing pipelined burst AXI4 transfers. Basically, the point of having a pipelined AXI transfer is that you amortize away the setup costs of having an isolated transfer and you gain significant throughput boosts from having a pipelined transfer.

However, it should be noted that in order to widen the ports, we needed to preprocess the particles position array by transfering every data value into a contiguous BRAM. This constitutes a pretty obvious design tradeoff for our project, where we expend more resources (and time!) in the effort of saving even more time overall.

 

As for next week, my next task is to accelerate the step5 loop and finish verifying the data transfer interface.

Ziyi’s Status Report – 04/16/22

This week was another week of great progress! The first thing I did this week was to resolve the interfacing issues with the FPGA. After debugging the segfault with a dummy kernel, I was able to get it to successfully transmission from the FPGA fabric to the host CPU. From here, I just switched out the dummy kernel for the most up to date kernel (more on this later) and uploaded the full project onto the board, and presto! Results!

Results from the board (SCP’d from board)
Build Resources and Performance

The next thing that I helped with was synthesizing the most up to date kernel (as mention above). This kernel was a major milestone in that it was the first implementation to include everything (including the nearest neighbor algorithms). While Alice and Jeremy mostly handled the algorithmic part of the implementation, I handled some of the Vitis build errors. One example of which was a dependency between different iterations of the writeback loop. After analyzing the loop body, I was able to fix this bug by introducing the dependency pragma, which allowed Vitis HLS to correctly optimize this.

As an aside, solving the different HLS build warnings is incredibly important. As programmers, the traditional “wisdom” is that warnings are more or less ignorable. The issue with HLS is that in order to adapt to the warning, Vitis HLS will expend a lot of extra unneccessary hardware resources, potentially an order of magnitude more than what the board supports!

My primary task next week is to investigate and document different optimizations and pragmas we can use to accelerate the performance of the kernel. Another tasks is to potentially investingating refactoring the code so that creating the particles also happens off-chip. This would free up some extra space for extra unrolling and optimizations.

Ziyi’s Status Report – 04/10/22

This week we got a ton done on the hardware side of things. First of all, after finally squeezing out the final few compilation errors, we managed to get a trimmed down version of the kernel (everything sans nearest neighbor loops) built using Vitis HLS synthesis. After looking at the synthesis analysis, we were able to see that the correct loops were also getting synthesized, which meant that our code was being inferred correctly.

After passing synthesis, we were able to export the rtl implementation (the .xo file) and import it into regular Vitis itself without any additional troubles. Then, after implementing some basic interfacing code between the host program and the kernel, were were able to build a binary to run on the FPGA. In order to get this binary onto the board, we simply scp’d the files onto the board.

Running the program resulted in a segmentation fault, so we still have a bit to go. My goal for next week is to debug the interface and work on some optimizations with Alice and Jeremy.

Ziyi’s Status Report – 04/02/22

This week we got a lot of work done in terms of synthesizing the kernel. At the start of the week, we managed to get a synth running using floats as the main datatype. The problem with this one is that since floats were so computationally intensive, it was pretty much impossible to get the critical path to converge to the target critical path, and so we couldn’t meet timing. Nevertheless, this was a significant step in the right direction.

So my next task was to begin the migration from floats into our custom fixed point datatype. Basically any calculation we would perform on the particle positions or speeds, we would replace with a fixed point datatype. For now, we decided to use a fixed point representation where we have 6 bits representing the integral part of the number and 6 bits representing the fractional part of the number.

Unfortunately, this would prove to be quite an involved task, as we would also need to make significant changes to all of the math libraries in order to change them to use the new datatype. So, a majority of this week was spent incrementally compiling and changing the math libraries to use the arbitrary precision datatype instead of the floats. This was complicated by the fact that there were still some weird C++ issues that led to some weird aliasing of the datatype to the int datatype. Another specific issue we had to investigate was how we would take care of some specific math functions (such as square root); thankfully, this issue was easily addressed using the equivalents in the HLS math library. Other than that, we seem to have mostly figured out those issues, and now we have a compile running!

Build almost there! Still some kinks in our kernel to figure out

Our next step is to finalize the changes to get a complete synth of the kernel and then upload the project to normal Vitis so we can generate a binary file and upload it onto the board in time for the in-class demo. All in all, I am much more optimistic as to the effectiveness of our project after this week.

Team’s Status Report for 03/26/2022

This week, we made progress on two fronts. On the hardware front, Jeremy and Ziyi worked on getting the Vitis HLS build to work on the stripped down version of Scotty3D; on the software front, Alice worked on having Scotty3D read particle postions from a preloaded file for stepping through a simulation.

In terms of hardware, we managed to get the C simulation build option to work and run a small testbench. While this was a good start for getting the binary built, we realized that we still had a long way to go before we could get C synthesis to work. The biggest hurdle is that the C synthesis seems to run on an older version of C++, with a bunch more restrictions and caveats on what can actually get compiled into hardware. Many features that the C simulator readily worked with, the C synthesis just outright refused to cooperate with. Still, despite these setbacks, we believe that we are quite close to a breakthrough for the C synthesis. Once we finally get a working build, we will get the area estimates and also get some timing reports from putting the binary on the Ultra96 development board and running it. It is also worth noting that in our endeavors to get a working build, we  alsorefactored the par_for, seq_for, and for_n (for neighbors) lambda functions into separate, discrete for loops. Though this reduces code reuse, it is important in that it allows us to specifically reconfigure the different instances of these loops so that we can more deliberately assign hardware resources as opposed to accelerating all of the different loops with the same parameters (which would otherwise be hard coded in the lambda functions themselves).

In terms of the Scotty3D interface, we were successful in getting Scotty3D to run a precomputed set of points. Of course, this resulted in a lightning fast visual simulation within the Scotty3D engine, which provides a compelling example of the benefit of hardware accelerated precomputation. Nevertheless, though this is not as interesting as a real-time demo, it is a significant step in the right direction. The next major goal on the interface side is to enable UART communications between the FPGA and a host computer to transfer live simulation data. This will be an essential step to having  the real-time simulation. Of course, we will also have to investigate the UART bandwidth to see if we can keep up with frames per second latency demands of the simulator.

Ziyi’s Status Report for 03/26/2022

At the start of this week, Jeremy and I managed to get the C simulation build to pass in Vitis HLS. While we were quite happy at this initial success, we quickly realized that we still needed a lot more work to get the C synthesis working. One thing that majorly complicated this task was that the C synthesis build script would fail pre-synthesis without actually printing any errors. However, after poking around on the Vitis forums and ticking some specific compilation boxes, we could finally read some of the error logs. As it turns out, most of our problems were just version and build conflicts with different C++ features. For instance, while we were able to get away with the auto function type for our lambda functions in simulation, the synthesis compiler got a bit angry at us. Right now, we are still trying to resolve these issues and hopefully synthesize some version of the kernel soon.

In the process of fixing these compilation errors, we also got started on our optimization work. One major task that we did this week was that we got rid of all instances of par_for, seq_for, and for_n, which – respectively – iterated a function over the particles parallelly, iterated over the particles sequentially, and iterated over the neighbors of each particle. This is a major step, as it allows us to manually control the unroll parameters of the kernel on a per-loop basis, rather than applying a single pragma en masse.

Other than slowly grinding through the myriad waves of compilation errors, there is nothing too much to schedule and to report. Still, once we get a working build of the binary, we’ll throw it onto the FPGA and get some timing values reported. We’ll also need to get a survey of the resource utilization of the base implementation. We envision that this will be done by next week.

Ziyi’s Status Report for 03/19/2022

Most of my effort this week was spent on getting a lightened version of the fluid simulation kernel to build on Vitis HLS. This comes as a transition in our priorities to better make an MVP target for our project. Specifically, we were interested in implementing a “headless” version of the simulation kernel, where the only thing that we would specifically need to run is the step2 function in the fluid simulation kernel. We decided that all of the rest of the rendering tasks are not as pressing and should thus be handled at a later point.

Much of the work of this week was setting up the Vitis HLS project on the ECE number machines. We were running into some significant build issues running on Alice’s computer, and the documentation online was frustratingly sparse. As such, in order to make some breakthroughs in our project, we decided to transition to the existing, working, setup that 643 uses. Setting up the project took a bit of work, but we were eventually able to resolve a lot of the bulid issues and get the C simulation task to build. A large part of this task was refactoring parts of the code to remove C++17 artefacts and using different versions of the different libraries that the Scotty3D library depends on.

Right now, we are still trying to figure out how to get the C synthesis to fully build. It is a bit difficult at the moment, as the synthesis log is not actually highlighting the error. However, I think it might have something to do with the tool inferring multiple TOP functions.

Obviously, the goal for next week is to get a working version of C synthesis running.