During this final week, we pretty much just finished up working on the integration of the entire project. We merged Jeremy’s changes in the loop logic with my logic for the burst transfers and verified that the results made sense. After this, we investigated unrolling and pipelining a few more loops and managed to squeeze out a bit more performance. As shared in the presentation, here is a summary of some of the effects of different optimizations.

As a note, these results are only estimates of the kernel operation itself, and do not entirely reflect the costs of both the kernel and its associated data transfer.
Other than integration, the rest of this week was pretty much spent on preparing presentation materials, including the final presentation, the poster, and the final video.
