I looked into the hls language directives & methods to write a pipelined matrix multiplier in vivado hls on the ultra 96 fpga component. The process seems relatively straightforward albeit for the different components involved with the design & maintenance of the buffers implementing the various pipeline stages. These parameters will need to be benchmarked and tuned in order for optimal feed-forward latencies. Since this is a fully trained model, the only latencies which would matter are the ones moving forward through inference on the pipeline. I’m going to continue work on this through this week and maintain everything through proper source control.