Yuhe’s status report

Yuhe’s Status Report:

09/23/2023

To start off the project, all three of us in the team worked on the proposal presentation to think through our user requirements, technical challenges, and solutions. I analyzed the data transmission requirements and the solution platform we proposed to use. I looked into the development flow of Vitis to identify testing and stats collection methods.

During the past week, I worked on running a sample project on the Ultra96 FPGA. This process had some delay because my usb-c adapter was broken, so I could not program the disk image that configures the FPGA. After I acquired the functional tools, I worked through software emulation, hardware emulation, and built hardware target from simple matrix add C++ code. I loaded the disk image to FPGA. I then connected my laptop the FPGA using WIFI and ssh. I was able to run the computation on the board and the result came out correct. The next step is to develop a sequential N-body algorithm to analyze its computation and data movement requirements. 

 

09/30/2023

Summary of progress: 

This week, I investigated the memory bandwidth and latency of the DRAM AXI interface of the Ultra96 FPGA board to analyze the data movement requirement for the N-body simulation. I was able to measure the maximum data throughput and latency of DRAM data accessing. My progress is meeting the schedule so far. 

Result measured from reading 4096 x 4096 matrix from DRAM:

AXI latency: 537 ns

AXI bandwidth: 436 MB/s

To analyze the AXI interface, I developed two programmable kernels on the Ultra96 FPGA to assess the throughput and latency of sequential and parallel computations. The kernel performing sequential data access includes a dependency chain, where each read to the DRAM depends on the result of the previous read. This is implemented by using the previous read data as the address of the nex read request. I also ensured that the address between each read is far apart, so that each read goes through the DRAM instead of cached data. By building this data dependency chain, I was able to perform a large number of reads and then calculated the amortized latency. The kernel performing parallel data access executed vector addition on rows of a 4096 x 4096 matrix. The vector addition is required so that the data read from DRAM is utilized in parallel. I instantiated a buffer as part of the kernel to hold the accumulated vector result, so that future read requests do not have to wait for the current request. This way, I was able to extract a throughput close to the maximum bandwidth provided by the AXI interface. By analyzing the memory throughput, we can then plan out the buffer structure optimized to store data for computing N-body simulation and the pipeline stages to concurrently execute multiple steps of the calculations.

One of the major challenges I encountered was that I had no knowledge of how the high-level synthesis process parallelized my C++ code, so I needed to carefully build data dependencies or buffer structures in each kernel to achieve the desired level of parallelism. Identifying and managing these dependencies is crucial for optimizing a hardware kernel’s performance on FPGA. 

In the next week, I will set up and run a naive All Pairs simulation algorithm, one that comes directly from a software algorithm, on the FPGA. I will measure its latency and throughput to compare with the expected performance of this program based on its number of fixed-pointed operations and data accesses. From here, we will develop an idea of what part of the algorithm is parallelized by the high-level synthesis compiler, and what part of the calculation we need to restructure to achieve higher parallelism.

 

Coursework relevant to the project:

My knowledge of pipelining and concurrency comes from the course 18-447 Introduction to Computer Architecture. My understanding of throughput and latency is gained during my summer research on FPGA compute acceleration. I am currently learning about high-level synthesis techniques in the course 18-643 Reconfigurable Logic: Technology, Architecture and Applications. This course focuses on enhancing the parallelism and data accessing order of a computation to achieve high performance on FPGA.

 

10/07/2023

Summary of progress:

This week I did not accomplish many tasks because I caught the flu when attending the Grace Hopper Conference last week. The flu delayed coursework of my other classes as well as the capstone project. My original plan this week is to analyze the naive All-pairs algorithm in terms of its data access and fixed-point calculation count. Then I was supposed to run the naive algorithm on the Ultra96 FPGA without using any optimization techniques to compare the computation performance against my expectation. I started working on analyzing the algorithm, but I have not finished reasoning through the dependency between data access and computation. This step is not quite straightforward because there are many data reuses within the algorithm already. I will continue working on my analysis as I catch up with my coursework. 

Schedule Update:

I am not exactly behind schedule at this point since we allocated a longer period of time to work on the high-level synthesis algorithm. However, I will now need to speed up my analysis by concurrently developing the naive algorithm on FPGA and doing the conceptual analysis. I will also ask Rene and Abhinav for help since they are familiar with analyzing parallelism.

Next Week Plan:

Next week, I will finish the baseline implementation of the naive All-pairs algorithm, and then start to optimize the high-level synthesis algorithm with our proposed ideas, such as loop unrolling and adding BRAM buffer. 

Weekly Answer:

An important model I use for analyzing an computation’s potential for optimization is the Roofline model. The Roofline model describes the hardware and computation limitations for parallelizing a problem on multiple processing cores. 

 

10/21/2023

Summary of progress: In the week before the fall break and the week during fall break, I successfully set up a Vitis project and and a Vitis_HLS project that runs a naive All-pairs algorithm solution for the Nbody simulation. By naive, I mean the solution employs no optimization techniques and has a constant number of particles as input. This solution will be a starting point for us to analyze our hardware resource usage. To complete our final solution, we will use both Vitis_HLS and Vitis to generate the hardware computation kernel and software interfaces.

The Vitis project passed the software emulation and hardware emulation. After running the hardware emulation, we are able to obtain an FPGA resource usage report summarized as below:

  • Flip Flop: 3300
  • LUT: 3024
  • DSP: 0

The Vitis_HLS project is able to compile and produce a report utilization report as well, but my code currently fails my validation check. I will look into this bug further. The report shows as below:

  • Flip Flop: 1922
  • LUT: 2780
  • DSP: 5

These resource reports matches our expectation because for our set up test, we are not parallelizing our computation and only simulate on 15 particles. Given we have around 141k Flip Flops, 70k LUTs, and 360 DSPs, the naive solution uses very little resources. We have a lot of potential to parallelize our computation.  

Schedule Update: I am on schedule as of now because I have set up the project and worked through the software and hardware interfaces of our MVP. However, I need to fix the bug identified in the Vitis_HLS, so we can begin the optimization process.

Next Week Plan: next week, I will modify the starting code in our project to accommodate variable number of particle inputs. This will also give us insight into how the particle number scales with latency of the naive solution. I will also start adding optimization structures such as local buffers. 

Weekly Answer: What I need to learn for the project is parallel computing techniques. In particular, I need to learn about synchronization because the N-body algorithm involves both parallelized portion and data dependent portion. I need to learn these parallel computing concepts to effectively accelerate our computation.

10/28/2023

Summary of progress: This week, I first mainly focused on setting up our Vitis Project environment. I was able to get this running and was able to run sims for a small amount of particles that were declared locally on the stack. This proved useful in generating reports to verify our hardware usage and also try out optimisations like fixed vs floating point numbers. Once this was done, I then focused on refactoring the initialization of our project to read particle data from memory instead of declaring them directly on our stack. This was important as for our use case we needed to run a simulation of several thousand particles and this would cause the ARM core’s stack on the FPGA to overflow. Lastly, I also explored the effects pipelining would have on the hardware of our code. I saw that by pipelining the result force computation loops, we had more BRAM accesses which consequently led to an increase in BRAM usage. 

Schedule Update:  I think we are reasonably on track with our schedule as this week was set for our integration tests.

Next Stage: Next week I plan to run different versions of our code (eg. pipelining on vs off, fixed vs floating points. etc) on the FPGA itself and measure its performance to display for our interim demo given that our pipeline is set up. We would start small with shorter simulations with fewer particles and gradually scale up. 

11/4/2023

Summary of progress: 

After the ethics discussion this week (where we learned several insightful considerations we should consider for our project), I began finishing our pipeline set up that was started last week. In our Vitis project, I added the file IO feature where the user can scp data into the FPGA and copy the results back to their PC after the host program finishes executing. I built the project with this added feature and tested its functionality on the board. The file IO is successful. At this point, we have implemented the entire skeleton of our MVP. I then created a build that uses some of the optimisations that were tried (mainly fixed point types). This is the build that we plan to use for the interim demo. I also helped Rene with refactoring sections of our code to reduce any memory dependencies and redundant DRAM bandwidth usage. 

 

Schedule Update: 

We are on schedule, we have our base pipeline set up and can now shift our focus completely to the optimisations. We have a Vitis project set up that can run emulations and builds for our code iterations. We plan to present this with for our interim demo next week.

 

Next Week Plan: 

I plan to try some of the hardware optimisations next week such as setting up local buffers using BRAM to improve our read/write latency. We will also look into tiling computation to increase data locality.

 

11/11/2023

Summary of progress:

This week I successfully integrated the IO with the user’s personal computer and the N-body simulation computation kernel on the FPGA board. As in our interim demo, we have now implemented the system for our MVP workflow. I tested our current N-body simulation kernel with DRAM bandwidth optimization on the Ultra96 FPGA board and achieved around 4x speedup compared to a CPU implementation. In addition, I implemented a feature to write simulation results of particles for every parametrized number of time steps instead of only outputting the final state of particles after all the iterations. These intermediate results are useful for our graphic display to demonstrate the gradual movement of the particles. As of now, we do not plan on including the intermediate output feature in our speedup measurement because the IO feature is not a part of the N-body simulation algorithm we aim to optimize. 

 

Schedule Update:

I am on schedule because my group and I have a working foundation for our project that we can experiment with other high-level synthesis optimization techniques. The working techniques I have tested in other independent workspaces include BRAM local buffers to allow parallel computation on multiple elements of an array. To maximally parallelize our computation, we need to tune the loop unrolling and loop pipelining pragmas together with the fixed-point data representation to achieve high parallelization within the hardware resource limit.

 

Next Week Plan:

I will start designing local buffer structures to execute our particles in batches. I have already optimized our DRAM accessing pattern, so I believe I cannot extract any more DRAM bandwidth. What we need to focus on now is to perform more computation on the data already extracted from DRAM.

 

Weekly Answer:

There are two metrics we need to focus on for this project: speedup and accuracy. The tradeoff between these two metrics comes mainly from the fix-point data representation that can increase our parallelization but drops in accuracy. The part of the validation I need to do for the team is the speedup measurement. I use the built-in system time function to measure the end-to-end execution time of our Nbody-simulation kernel. This measurement is important because it is the actual delay our users care about when performing computationally intensive tasks. The end-to-end execution time will also directly reflect the hardware optimization I implement.

 

11/18/2023

Summary of progress:

This week I implemented a local memory structure that buffers multiple particles read from DRAM to be processed simultaneously. By buffering multiple particles together in a local array, I can partition and reshape these arrays to feed the concurrent access to them to enable concurrent processing on independent data streams. The current local buffers are implemented as: 

float BufP[BATCH_SIZE][5]; // batch several particles to calculate in parallel

   #pragma HLS array_reshape dim=2 type=complete variable=BufP

   #pragma HLS array_partition dim=1 type=complete variable=BufP

float BufF[BATCH_SIZE][2]; // batch the forces on the particles

#pragma HLS array_reshape dim=2 type=complete variable=BufF

    #pragma HLS array_partition dim=1 type=complete variable=BufF

In the code above, BATCH_SIZE is the number of particles we are processing in parallel. Currently, we are able to observe in the Vitis_HLS report that as BATCH_SIZE increases, loop latency decreases by the same factor. This is promising because it makes sense that as more data are computed in parallel, our total latency decreases. One challenge we currently face is that we are not able to generate accurate resource utilization reports from Vitis_HLS for our local buffers because we are not using them as function arguments, a Vitis_HLS feature. Without a resource utilization report, it becomes challenging for us to build hardware targets in Vitis because we can’t judge if the build will be successful or not. The solution we are currently working on is to rewrite part of our code to use our local arrays as function arguments. 

Schedule Update:

I am on schedule, as we are implementing and tuning our optimization strategies, such as BRAM storage, array reshaping, and array partitioning. Although I am seeing promising statistics from Vitis_HLS, I need to fix the issue of not seeing the BRAM utilization report in order to further optimize our parallelization parameters.

Next Week’s Plan:

I will fix the problem with BRAM report. Then I will keep tuning BATCH_SIZE, array structures, and fixed-point representations to achieve optimal parallelization while maintaining a high simulation accuracy. 

 

 

12/03/2023

Summary of progress:

This week I made significant progress on achieving speedup. In short, we have recorded a maximum of 25x speedup simulating 10k particles, exceeding our minimum goal of 10x speedup over a parallelized CPU implementation. This is able to be achieved by implementing two major optimizations: 1. Memory throughput optimization, 2. Loop redordering to remove data dependency between iterations. After Rene implemented strategy to burst read all of our data to store them on chip, I am then able to reorder the loops to increase throughput of the inner loop of our algorithm. The idea behind reorder loop come from mapping our computation pattern to a matrix-vector multiplication. This way, I am able to figure out a computation order that resolves data dependency but demand more memory access. To resolve the increased memory access, we store our data on chip, and I used 3D arrays to store our data together with array partition pragmas to allow parallel data accesses. Currently, with increased memory throughput, resolved data dependency, and batch execution, we have achieved around 25x speedup with 95%+ accuracy.

 

Schedule Update:

I am on schedule as I have explored hardware optimizations to achieve the desired speedup. I have also made the number of iterations running parametrizable, so we can run longer simulations and analyze the results of various numbers of iterations.

 

Next Week Plan:

We will keep work on integrating the simulation kernel with the graphical display. I will also explore optimization methods to push for more speedup. The methods to explore include increasing hardware frequency, optimizing routing congestion, and mapping computation to different hardware resources. In addition, we will work on preparation for the final demonstration.

9th December

Summary of progress: 

In the past week, I tried different batch sizes to push for more speedup. Batch size is the number of particles we compute at the same time. We were eventually able to reach a 40x speedup compared to the CPU implementation of Nbody simulation with batch size 8 before running out of hardware resources. We collected data from implementations with different batch sizes and number of iterations. We found that our speedup increases with batch size, which is intuitive because the more computations we do in parallel, the faster to complete the whole computation task. We also found that the speedup increases with the number of iterations we simulate, which is because with more iterations we do, we increases the portion of the simulation that is parallelized, so we see increased speedup according to Amdahl’s Law. In addition, in the data collection process I also found out the optimal batch size for single iteration (time step) simulation is batch size 5. Single iteration speedup is useful to drive our graphics display because we would be outputing simulation result every timestep. Batch size 5 is optimal could be a result of more efficient partition of BRAM, as seen in the resource vs. batch size chart in our presentation. A more efficient BRAM structure can lead to better DRAM burst read scheduling, giving better memory performance.

Schedule Update: 

I am right on schedule as I have completed my portion of the project. Updates from Abhinav and Rene would go in more detail of how the graphics display for our simulation is integrated with the simulation hardware kernel.

Next Week Plan: 

Next week we will be working on our final demo, video, and report to wrap up the project. Thank you very much for all the support during this journey!