tjjohans – Team A7: Scalable Machine Learning Using FPGAs

Accomplishments

Theodor has been calculating timings for our FPU operations.

There are a couple of things to point out here:

There are cycle counts for the assignment time. This is the number of cycles between the time that the pkt_avail signal is asserted to the Data Pipeline Router and the time that the dpr_done signal is asserted. This is the time it takes to assign a model to a model manager once the packet is on the board
Single-Model train cycle. This is the amount of cycles to run one forward/backward/update pass on a single input for the given model, assuming that only one model is training on the board.
Double- and Quadruple-Model train cycles. This is the amount of cycles to run one forward/backward/update pass on a single input for two or four models are training simultaneously on the board. These columns are empty because Theodor is still filling them out and making small corrections and optimizations to the implementation.

Jared has been supporting integration with the host machine and the data pipeline router. The memory interface in the hardware supports the common memory protocol and the Pi hosts servers to communicate with the host.

Schedule and Accomplishments for Next Week

In preparation for the final report, Theodor will be filling out the above table and getting numbers for various numbers of models being trained simultaneously on one board. Our model throughput for the single-model implementation was not impressive, so we will try training some of our smaller models (of which there are more, now that we aren’t running convolutional layers) on M9k memory, which has single-cycle dual-port read.

Jared will be finalizing integration for more accurate metrics and to possibly meet our integration goal.

Mark will be helping Jared with integration as well as finalizing some GPU benchmarks and working on the final report.

April 27, 2020April 27, 2020

Theodor’s Status Report for April 25

Accomplishments

I’ve been calculating timings for our FPU operations.

There are a couple of things to point out here:

There are cycle counts for the assignment time. This is the number of cycles between the time that the pkt_avail signal is asserted to the Data Pipeline Router and the time that the dpr_done signal is asserted. This is the time it takes to assign a model to a model manager once the packet is on the board
Single-Model train cycle. This is the amount of cycles to run one forward/backward/update pass on a single input for the given model, assuming that only one model is training on the board.
Double- and Quadruple-Model train cycles. This is the amount of cycles to run one forward/backward/update pass on a single input for two or four models are training simultaneously on the board. These columns are empty because I am still filling them out and making small corrections and optimizations to the implementation.

Schedule and Accomplishments for Next Week

In preparation for the final report, I will be filling out the above table and getting numbers for various numbers of models being trained simultaneously on one board. Our model throughput for the single-model implementation was not impressive, so we will try training some of our smaller models (of which there are more, now that we aren’t running convolutional layers) on M9k memory, which has single-cycle dual-port read.

April 19, 2020April 20, 2020

Team Status Report for April 18

On Monday, Professor Low told us that we needed to act on a contingency plan in case I could not get all of the convolutional operations done before the demo on Monday, April 20. He was absolutely right, and we have rescoped the hardware portion to include only the operations needed for Linear Layers, which we already had implemented on Monday. We seriously underestimated the amount of time it would take to implement the necessary layers for convolutional neural networks, and implementing those layers does not achieve the core goal of the project, which is to implement a fast hardware architecture for training neural networks.

At this point, we have the hardware architecture (MMU, Model Manager, and FPU Bank) working with Feedforward Neural Networks with Linear and ReLU layers. By “working”, we mean we are performing a forward, backward, and update pass over an input sample and label. This accomplishes everything we needed from the hardware architecture, and we are currently working on getting the Data Pipeline Router doing so with raw packets rather than testbenches.

On the transport end, the SPI bus is functional. As we could not integrate in time, the current instance of the SPI functions as a simple echo server.

April 19, 2020April 20, 2020

Theodor’s Status Report for April 18

On Monday, Professor Low told us that we needed to act on a contingency plan in case I could not get all of the convolutional operations done before the demo on Monday, April 20. He was absolutely right, and I’ve rescoped the hardware portion to include only the layers needed for Linear Layers, which we already had implemented. I’m disappointed that I wasn’t able to implement my specifications for the convolutional layers (which includes Max Pooling and Flatten operations), but I seriously underestimated the amount of time it would take and it does not achieve the core goal of the project, which is to implement a fast hardware architecture for training neural networks.

Accomplishments

The hardware architecture is complete up to the Data Pipeline Router, which interfaces with the SPI bus that Jared is working on. At this point, we have a top-level module that drives signals to the Model Manager, which exposes memory handles to the FPU, which drives signals to the memory port managers in the MMU, which multiplexes a single-cycle on-chip memory and simulated off-chip SDRAM (which stalls be a number of cycles before servicing a request). We’re currently working on implementing these signals into the Data Pipeline Router, which will read packets and drive the proper signals without needing a testbench.

Schedule & Accomplishments for Next Week

Now that we’re not implementing convolutional layers, we need a benchmark suite of models to train on. We will be making this throughout the next week so we can get some numbers for how fast our hardware implementation can train them.

April 12, 2020April 13, 2020

Team Status Report for April 11

Accomplishments

TJ has completed the following FPU operations:

Operation	Described?	Implemented?	Testbench?
Linear Forward	Yes	Yes	Yes
Linear Backward	Yes	Yes
Linear Weight Gradient	Yes	Yes
Linear Bias Gradient	Yes	Yes
Convolution Forward	Yes
Convolution Backward	Yes	Yes
Convolution Weight Gradient	Yes
Convolution Bias Gradient	Yes
MaxPool Forward	Yes
MaxPool Backward
ReLU Forward	Yes
ReLU Backward	Yes
Softmax Forward
Softmax Backward
Cross-Entropy Backward
Flatten Forward	Yes
Flatten Backward	Yes	Yes
Parameter Update	Yes

And will be finishing the rest to get an end-to-end test working.

Mark has finished the helper function to sort through a model and list out every layer that is called in the specific order. This will be used in order to serialize each model. Mark also made small changes to the Transport Layer Protocol.

Jared has fixed bugs in the SPI protocol and guaranteed its ability to function on the RPi.

Schedule

TJ will spend the next week finishing up the FPU Job Manager and implementing the rest of the Model Manager in preparation for the Demo on Monday.

Mark will spend the next week making sure that models are being serialized over correctly.

Jared will complete the SPI bus implementation, along with additional processing for data receiving.

April 12, 2020

Theodor’s Status Report for April 11

Accomplishments

This week I’ve been working on the FPU Job Manager Operations. I’ve been following my previous process of describing the FSM control signals state-by-state, then simply copying them over into SystemVerilog. Here’s what I have so far:

Operation	Described?	Implemented?	Testbench?
Linear Forward	Yes	Yes	Yes
Linear Backward	Yes	Yes
Linear Weight Gradient	Yes	Yes
Linear Bias Gradient	Yes	Yes
Convolution Forward	Yes
Convolution Backward	Yes	Yes
Convolution Weight Gradient	Yes
Convolution Bias Gradient	Yes
MaxPool Forward	Yes
MaxPool Backward
ReLU Forward	Yes
ReLU Backward	Yes
Softmax Forward
Softmax Backward
Cross-Entropy Backward
Flatten Forward	Yes
Flatten Backward	Yes	Yes
Parameter Update	Yes

Last week, I had the Convolutional Forward and Linear Forward operations described, and only the Linear Forward operation implemented.

I’ve consolidated all of the weight and bias operations into a single “Parameter Update” operation, since they’re all the exact same and the shape of each tensor can be read from memory.

Another work-around I’m implementing is for the Softmax Backward operation. I haven’t been able to find a working floating-point exponent calculator in Verilog, so in the case that I’m unable to find one, I will simply subtract the output from the label, which in terms of optimization will have the same effect as taking the backwards gradient of the softmax direction.

Schedule & Accomplishments for Next Weeks

I’ll be finishing up the FPU Job Manager operations over the next couple days, then preparing the Model Manager for the Demo.

April 6, 2020

Theodor’s Status Report for April 4

Accomplishments

This week, I started building the FPU Job Manager and implemented the Linear Forward operation. For the interim demo, I constructed a testbench that computes a matrix-vector multiplication using the FPU Job Manager.

Schedule

No schedule changes are needed for next week. I will continue implementing FPU Job Manager operations.

Accomplishments for Next Week

Next week I will have Convolutional forward implemented and the FSM Control signals defined for the rest of the operations we plan to implement.

April 5, 2020April 6, 2020

Team Status Report for April 4

Accomplishments

Theodor implemented the skeleton of the FPU and implemented the Linear Forward Operation. He also implemented the assignment phase of the Model Manager.

Mark implemented the Tensor serializer (Dimensions 1-4) as well as a rough skeleton for serializing each of the various layers of a model.

Jared implemented a UDP client for the Raspberry Pi and is debugging the SPI protocol on the FPGA.

Schedule

Accomplishments for next week

Theodor will spend next week implementing the rest of the FPU Job Manager operations. After that, he will finish work on the Model Manager and (if necessary) implement the DPR for end-to-end communication.

Mark will spend next week implementing the serialization of each of the potential layers, as well as cleaning up the communication between the Data Source machine and the Rasp Pis.

Jared will produce a working SPI implementation and a receiving module for the FPGA. This includes correct interpretation of the transport layer protocol.

Author: tjjohans

Final Report

Final Video

Team Status Report for April 25

Theodor’s Status Report for April 25

Team Status Report for April 18

Theodor’s Status Report for April 18

Team Status Report for April 11

Theodor’s Status Report for April 11

Theodor’s Status Report for April 4

Team Status Report for April 4