Mark’s Status Report for April 26th

This week, I revisited the cpu and gpu benchmarking files and modified them in order to reflect the changes made to all of the models. We decided to remove MaxPool/Flatten/Convolution layers from our models since implementing these layers in hardware turned out to be more work than we had originally planned for.  I also had to make some modifications to the serialization as we discovered small issues with the serialization of models on the hardware size. Additionally, we spent most of this week preparing for the Final Presentation that was due on the 27th of April.

This week, I plan on revisiting the gpu metrics and seeing if I can get more details metrics for intermediate values while all the computation is going. I also plan on working on the final report as well.

Jared’s Status Report for Apr 25

Integration is finishing up both with the host and with the data pipeline router.

Host to Bus

With the host using UDP only for the finder, the Pi has both a UDP server for being found and a TCP server for receiving data. The TCP server follows RCE-style order, where every sent command is followed by a response.  The host prepends the length of the message, so that the Pi and bus can detect the length at the start of the command.

Bus to Hardware

TJ gave me the code to the memory model. The bus morphs this into streaming an incoming command, where instead of waiting for read enable, it raises “done” every time a command is received. As the hardware is in charge of copying commands to an assigned location, the hardware keeps up with the bus.

Integration needs to be solidified with the host and I am in contact with TJ to help him complete hardware integration.

Team Status Report for April 25

Accomplishments

Theodor has been calculating timings for our FPU operations.

There are a couple of things to point out here:

  • There are cycle counts for the assignment time. This is the number of cycles between the time that the pkt_avail signal is asserted to the Data Pipeline Router and the time that the dpr_done signal is asserted. This is the time it takes to assign a model to a model manager once the packet is on the board
  • Single-Model train cycle. This is the amount of cycles to run one forward/backward/update pass on a single input for the given model, assuming that only one model is training on the board.
  • Double- and Quadruple-Model train cycles. This is the amount of cycles to run one forward/backward/update pass on a single input for two or four models are training simultaneously on the board. These columns are empty because Theodor is still filling them out and making small corrections and optimizations to the implementation.

Jared has been supporting integration with the host machine and the data pipeline router. The memory interface in the hardware supports the common memory protocol and the Pi hosts servers to communicate with the host.

 

Schedule and Accomplishments for Next Week

In preparation for the final report, Theodor will be filling out the above table and getting numbers for various numbers of models being trained simultaneously on one board. Our model throughput for the single-model implementation was not impressive, so we will try training some of our smaller models (of which there are more, now that we aren’t running convolutional layers) on M9k memory, which has single-cycle dual-port read.

Jared will be finalizing integration for more accurate metrics and to possibly meet our integration goal.

Mark will be helping Jared with integration as well as finalizing some GPU benchmarks and working on the final report.

Theodor’s Status Report for April 25

Accomplishments

I’ve been calculating timings for our FPU operations.

There are a couple of things to point out here:

  • There are cycle counts for the assignment time. This is the number of cycles between the time that the pkt_avail signal is asserted to the Data Pipeline Router and the time that the dpr_done signal is asserted. This is the time it takes to assign a model to a model manager once the packet is on the board
  • Single-Model train cycle. This is the amount of cycles to run one forward/backward/update pass on a single input for the given model, assuming that only one model is training on the board.
  • Double- and Quadruple-Model train cycles. This is the amount of cycles to run one forward/backward/update pass on a single input for two or four models are training simultaneously on the board. These columns are empty because I am still filling them out and making small corrections and optimizations to the implementation.

Schedule and Accomplishments for Next Week

In preparation for the final report, I will be filling out the above table and getting numbers for various numbers of models being trained simultaneously on one board. Our model throughput for the single-model implementation was not impressive, so we will try training some of our smaller models (of which there are more, now that we aren’t running convolutional layers) on M9k memory, which has single-cycle dual-port read.

Mark’s Status Report for April 18th

This week, I finished the serialization of the linear and ReLU layers for any model. We only decided to implement two layers since we did not have enough time to implement the other layers on the hardware side. This also took longer than expected as I ran into issues with converting from a tensor to a multidimensional list to a single dimension list. Additionally, I finished serialization for samples. During this process, we had to change our data transmission protocol from UDP to TCP. The reason for this was because TCP guarantees that data will reach the destination router and allows for easier chunk reading since its a stream based connection as opposed to UDP.

Unfortunately, I did not have time to fully implement the Workload manager, and the current setup is a hard coded manager that sends 15 models to one board. Additionally, I wasn’t able to implement some sort of mechanism that would wait for results from the FPGAs and store this result.

This coming week, I plan on helping Jared with the integration between the Data Source Machine and the RaspPi. During our preparation for our final presentation, we will talk over to see what additional features must be implemented in order to provide a good demo.

Jared’s Status Report for Apr 18

The SPI bus is done. It operates at 15.6 MHz and transfers the data to and from the 50MHz domain successfully. The current protocol is as follows:

If the Pi receives a command:
Transfer a single byte (value 0x1).
Transfer the message length (size 4 bytes).
Transfer the message
Wait for single byte (value 0x1).
Read in the message length (size 4 bytes).
Read the message.
Transfer the message to the original sender.

The routine requires that every message has a response. A possible addition to this is a routine that attempts a short SPI read when receiving an empty message.

A recent change in design was made to use TCP over UDP. At the time I did not realize it, but the protocol does not interpret message fragmenting correctly. While we did not integrate in time to reach this state, I believe it would have caused issues on operation.

Team Status Report for April 18

On Monday, Professor Low told us that we needed to act on a contingency plan in case I could not get all of the convolutional operations done before the demo on Monday, April 20. He was absolutely right, and we have rescoped the hardware portion to include only the operations needed for Linear Layers, which we already had implemented on Monday. We seriously underestimated the amount of time it would take to implement the necessary layers for convolutional neural networks, and implementing those layers does not achieve the core goal of the project, which is to implement a fast hardware architecture for training neural networks.

At this point, we have the hardware architecture (MMU, Model Manager, and FPU Bank) working with Feedforward Neural Networks with Linear and ReLU layers. By “working”, we mean we are performing a forward, backward, and update pass over an input sample and label. This accomplishes everything we needed from the hardware architecture, and we are currently working on getting the Data Pipeline Router doing so with raw packets rather than testbenches.

On the transport end, the SPI bus is functional. As we could not integrate in time, the current instance of the SPI functions as a simple echo server.

Theodor’s Status Report for April 18

On Monday, Professor Low told us that we needed to act on a contingency plan in case I could not get all of the convolutional operations done before the demo on Monday, April 20. He was absolutely right, and I’ve rescoped the hardware portion to include only the layers needed for Linear Layers, which we already had implemented. I’m disappointed that I wasn’t able to implement my specifications for the convolutional layers (which includes Max Pooling and Flatten operations), but I seriously underestimated the amount of time it would take and it does not achieve the core goal of the project, which is to implement a fast hardware architecture for training neural networks.

Accomplishments

The hardware architecture is complete up to the Data Pipeline Router, which interfaces with the SPI bus that Jared is working on. At this point, we have a top-level module that drives signals to the Model Manager, which exposes memory handles to the FPU, which drives signals to the memory port managers in the MMU, which multiplexes a single-cycle on-chip memory and simulated off-chip SDRAM (which stalls be a number of cycles before servicing a request). We’re currently working on implementing these signals into the Data Pipeline Router, which will read packets and drive the proper signals without needing a testbench.

Schedule & Accomplishments for Next Week

Now that we’re not implementing convolutional layers, we need a benchmark suite of models to train on. We will be making this throughout the next week so we can get some numbers for how fast our hardware implementation can train them.