Ziyi’s Status Report for 02/26/2022

This week was a bit confusing in terms of workflow for me. As of now, we are still trying to fully compile the project on Vitis. We had to deal with some troubleshooting in terms of connectivity to the FPGA, but I would say that the build system is mostly stable (Nothing that can’t be solved by the solutions we had from last week).

              Other than this, most of my effort this week was spent on trying to figure out the best way to implement the nearest neighbor lookup for the fluid simulation algorithm. We realized that this issue was of primary concern, as how we would choose to store these values would determine the distribution of BRAM usage between the pipeline stages, which would in turn determine the overall organization of pipelining and unrolling for our kernel. First of all, at the beginning of the week, we thought we would be using a three-level hashmap to implement the lookup from (X,Y,Z) address to the neighbor list.

              The problem with this design is that while it would theoretically be great for a sparse tree (which is what our hashmap would likely represent), the implementation is actually not realistically realizable in hardware. Since you can’t “allocate” a BRAM in realtime, every pointer in each of the BRAMs must have hardware backing, so every pointer must point to a dedicated BRAM. A back of the envelope calculation tells us that if we used integers to index, we would need 2^64 BRAMs to implement this design. Note: This would require 8.5 *(10**16) Ultra96s to implement..

              Instead, we realized that we would need to implement some restrictions in order to feasibly implement the kernel. Our first idea was to figure out the realistic range of both the (X,Y,Z) points and the number of neighbors within a hash bucket. For the first part, we figured that this was important in that we could choose to restrict the simulated fluids into a “simulation window,” where every point that falls out of this window is removed from the simulator task. This also has the effect of reducing the address space we need to support for (X,Y,Z). If this value is small enough, then we could create a unique hash that can fit within 32 bits just by bit-shifting the (X, Y, Z). For the second part, we want to figure out a resonable upper limit for neighbors so that we can determine how wide the BRAM needs to be in order to store a list of all the neighbor pointers. As a side note, we note that with a simulation size of 512 particles, we can uniquely address each of the particles using a 9 bit address. We will need to investigate how we can efficiently pack this as a struct in the neighbor array in a way that works with the C++ for Vitis HLS.

              As for next week, we will continue rewriting the C++ to allow for Vitis HLS compilation of the fluid simulation kernel.  

Team’s Status Report 02/19/2022

If you’ve already looked at our individual reports, you’d know that much of our week was spent on realizing the Y2k22 issue with Vitis, so we won’t go on too much about it here. Instead, we’ll focus on the positives. 😊

To start, we now have a local instance of Vitis running on Alice’s laptop! The point of this is that we will no longer be restricted to the build tools present on the ECE machines. As the current Scotty3D code runs on a newer version of CMake, the ECE machine was unable to compile the program. Things should be more straightforward on a local machine, where we can easily install whatever packages we need. In any case, it seems like our build platform woes are hopefully coming to an end…

 

In terms of algorithmic improvements, we met a couple times throughout the week to analyze the code in the fluid.cpp file itself, rather than just identifying the dependencies in the overall algorithm. Of primary importance for these meetings was determining how to represent each of the datatypes and determining what data we would be able to store on chip in the BRAMs, as well as how many copies of the data we would actually be able to store. One particular structure of importance was trying to figure out how we would actually implement the neighbor map on the FPGA. In the regular implementation, the neighbor map is simply an unordered hash map that uses a nearest quantized point to index into a list of neighboring particles. While this implementation is fine in software, if we actually want a performant hardware implementation, we’ll likely need to manually implement the hashmap as a BRAM array of pointers to another BRAM array of particles.  Aside from that, we also took a look at the looping structures in the code and assessed its ability to benefit from pipelining. In fact, we found that a lot of the steps in the code could actually be pipelined. Steps 2 and 3 seem to have no interdependencies, so we can probably unroll and pipeline those two steps. Other than that, we also took note of the instances of multithreading code that we would need to strip from the fluids.cpp file due to depreciation.

For the next week, our primary goal is to get a build of the fluids.cpp kernel working in Vitis HLS, as this will give us the baseline results we need.

Ziyi’s Status Report for 02/19/2022

Basically, much of this week was spent realizing that Vitis has been broken on the ECE machines since the start of the new year. Basically, the very same Y2K22 bug that plagued Microsoft Exchange was affecting the build process for compiling to hardware on Vitis. Unfortunately, this took a bit of dredging to find on the Vitis forums, and so we lost a couple days of progress, as we thought that it was an issue with our personal configuration rather than an issue with the system itself. Thankfully, after pointing out the issue to Professor James Hoe, he was able to quickly implement the patch for fixing the build tools. Finally, we were able to compile a project and generate a PetaLinux image to flash onto the FPGA. But then we needed to actually interface with the FPGA. Unfortunately, due to some weird configurations, the FPGA’s internal WiFi was not automatically set up, so we needed to interface with it through mouse and keyboard (we were also missing the mini Displayport cable, so we had to overnight that).

After finally gaining access to the FPGA interface, we were able to connect the Ultra96 to our local Router. Now, we are able to remote into the FPGA whenever we are connected to the local Router. We’ll still need to do some poking around in order to gain access to the FPGA when on campus, as our apartment network does not play nicely with port forwarding, but I’m sure we can figure something out. Either way, this is a good start for having a more streamlined development platform. We might decide to set up the board  in 1307, so we can just VPN in, but it’s flexible.

Other than this, we did some speicifc code analysis on the fluids.cpp file, but we’ll talk more about this in the team report.

Next week, my main goal is to compile a baseline kernel of the step2 function (which consists of the main body of the fuild simulation compuational kernel) using Vitis HLS. This will involve significant tinkering of the code and perhaps refactoring into more friendly datatypes. The best case scenario is that everything just compiles, but that’s likely quite the pipe dream.

 

 

Ziyi’s Status Report – 2/05/22

This first week of the project was anaylzing the feasibility of accelerating the fluid simulation workload. My first step was to analyze the fluid simulation algorithm to assess where we could stand to benefit from the increased parallelization.

From an initial viewing, we can obviously observe that the “for all particles i do”  loops introduce an obvious avenue of parallelization. For each request of the fluid simulator, we expect to process 512 particles at a time; we could attempt to fully unroll into 512 separate threads, but this could take up a large amount of hardware. Instead, we’d probably want to do a batched pipeline, where we dispatch some N-sized batch of particles into the pipeline at a time. In terms of the exact parameters of the pipeline (parameters such as batch width, pipeline depth), these will be handled on the low-level by the HLS tool itself and on the high-level by the relative importance of different sections of the code. For instance, we might expect that the loop from lines 20-23 will occupy much of the runtime than lines 1-4. As such, Amdahl’s law tells us that we should first focus on deriving speedup for lines 20-23. On the first order, we may say that performance is directly correlated with the amount of hardware resources we assign to a task (as we are just instantiating more threads); so concretely, we may desire a 16-wide pipeline for lines 1-4 and perhaps a 64-wide pipeline for lines 20-23. Of course, we will arrive at some more exact figures once we fully crack open the code and perform some mappings to the hardware resources and determine how much we have to work with.

In terms of progress, I would say that we are certainly a bit behind, due to the pivot from the UNISURF project to this. However, I will say that a lot of the investigations we performed for UNISURF in regards to the hardware resources of the Ultra96 and the organization of data between the CPU and the FPGA fabric map nicely to this new project.

In terms of deliverables we would like to have completed by next week, I would personally like to have the entire Vivado/Vitis project set up. This would mean that we first have to ensure that the base program works nicely on the Ultra96’s ARM core, and then we’d have to designate the different parts of the Vivado project such as the specific compute kernel. Since the Fluid Simulation library is only a portion of the Scotty3D program, I’ll have to investigate to see if there is anything special I’ll need to set up in order to ensure that the different compute tasks are correctly running on the FPGA. In terms of whether Vitis can port in the code, since everything is written locally on the Scotty3D library (no reliance on external libraries), I don’t think that we’ll run into any troubles on that front. Nevertheless, in terms of getting this up an running, I doubt it’ll be as simple as tossing the code into Vitis and hitting build. I will be reviewing 643 documentation to see if I can set up a more streamlined compilation and testing platform.

Ziyi’s Status Report for 2/5/2022

This first week of the project was anaylzing the feasibility of accelerating the fluid simulation workload. My first step was to analyze the fluid simulation algorithm to assess where we could stand to benefit from the increased parallelization.

From an initial viewing, we can obviously observe that the “for all particles i do”  loops introduce an obvious avenue of parallelization. For each request of the fluid simulator, we expect to process 512 particles at a time; we could attempt to fully unroll into 512 separate threads, but this could take up a large amount of hardware. Instead, we’d probably want to do a batched pipeline, where we dispatch some N-sized batch of particles into the pipeline at a time. In terms of the exact parameters of the pipeline (parameters such as batch width, pipeline depth), these will be handled on the low-level by the HLS tool itself and on the high-level by the relative importance of different sections of the code. For instance, we might expect that the loop from lines 20-23 will occupy much of the runtime than lines 1-4. As such, Amdahl’s law tells us that we should first focus on deriving speedup for lines 20-23. On the first order, we may say that performance is directly correlated with the amount of hardware resources we assign to a task (as we are just instantiating more threads); so concretely, we may desire a 16-wide pipeline for lines 1-4 and perhaps a 64-wide pipeline for lines 20-23. Of course, we will arrive at some more exact figures once we fully crack open the code and perform some mappings to the hardware resources and determine how much we have to work with.

In terms of progress, I would say that we are certainly a bit behind, due to the pivot from the UNISURF project to this. However, I will say that a lot of the investigations we performed for UNISURF in regards to the hardware resources of the Ultra96 and the organization of data between the CPU and the FPGA fabric map nicely to this new project.

In terms of deliverables we would like to have completed by next week, I would personally like to have the entire Vivado/Vitis project set up. This would mean that we first have to ensure that the base program works nicely on the Ultra96’s ARM core, and then we’d have to designate the different parts of the Vivado project such as the specific compute kernel. Since the Fluid Simulation library is only a portion of the Scotty3D program, I’ll have to investigate to see if there is anything special I’ll need to set up in order to ensure that the different compute tasks are correctly running on the FPGA. In terms of whether Vitis can port in the code, since everything is written locally on the Scotty3D library (no reliance on external libraries), I don’t think that we’ll run into any troubles on that front. Nevertheless, in terms of getting this up an running, I doubt it’ll be as simple as tossing the code into Vitis and hitting build. I will be reviewing 643 documentation to see if I can set up a more streamlined compilation and testing platform.