February 2020 – Team A7: Scalable Machine Learning Using FPGAs

February 29, 2020March 22, 2020

Jared’s Status Report for Feb. 29

Its time to get working on that ethernet, and with a couple shake-ups in the design there’s a new board on the radar: the DE10-Nano SoC. The DE10-Nano has an onboard ethernet PHY chip and is cheaper than the RPi/DE0-Nano setup, along with the Cyclone V chip being a generation ahead. This is possibly a major change in the bus protocol, so I’m reordering the tasks for the week ahead to focus on aspects relating to synthesization.

In terms of this week, I gave a presentation on machine learning. I don’t know anything about machine learning.

The RPi is giving me more trouble than its worth, and to connect to it I had to reformat it. So far this week I set up a simple test program for the GPIO and SPI bus via the bcm2835 library. No test on the FPGA yet, but I did get a loopback read/write to work.

This next week is dedicated to a few tasks: Deciding if the DE10-Nano SoC is fine to switch to (I’m favoring it) and writing some Verilog to test the SPI bus and obtain data. If we switch to the DE10, the code will form the implementation of the communication protocol.

February 29, 2020March 1, 2020

Team Status Update for Feb. 29

Accomplishments

After we gave our design presentation, it became clear that we had not adequately conveyed the goal of our project. This in tandem with the fact that we did not have clearly defined overall metrics caused worry that we didn’t know why we were doing this project in the first place, which is a very fundamental perception to lack.

Our team has met about it, and we unanimously agreed that we need a solution. So for posterity, we’re writing our project goals here.

The goal of our project is to develop a system that can train a large number of machine learning models in a short amount of time at a lower price point than a GPU. This is an important problem to solve because CPU training is slow and GPU training is expensive — by building a solution with FPGAs, we can make a system that exploits data locality (by training multiple models on the same input on the same FPGA at the same time). This will decrease the time taken to train a batch of machine learning models (because training is a bus-intensive problem) and it will reduce the cost (because FPGAs are cheap).

We will measure performance with a metric that we call ”model throughput”. Model Throughput is the number of models trained from start to finish divided by the amount of time taken to train them.

We will verify our model with a written benchmark suite of models to be trained on the CIFAR10 dataset, where each individual model will fit in the 32MB SDRAM of a DE0-Nano. Currently we have 30 such models which train on two sets of features: one is full color, and the other is grayscale. The largest of these models requires about 300KB (for weights, weight gradients, and intermediate space), which fits comfortably in SDRAM.

We also measure model throughput per dollar because we want to make a system that outperforms a GPU system of the same cost. Our budget is not big enough to buy a GPU, and therefore not enough to buy a comparable FPGA setup to compare. This goes back to why model throughput is a good metric, in that it is additive. If you add a subsystem that can train 0.1 models per second to a system that can already train 1 model per second, then the total throughput of the system is 1.1 models per second. Because we judge our system by its model throughput, we can estimate the model throughput of a similar FPGA-based system that costs the same as a GPU by simply multiplying by the fraction of costs.

This week, we’ve also made strides in completing concrete work. TJ has put together working code for the SDRAM and M9K controllers, and Mark is close to getting a concrete number for model throughput on a CPU.

Schedule and Accomplishments for Next Week

Next week, TJ will be working on the FPU Job Manager, because that is where most of the remaining unknowns in hardware need to be figured out. SystemVerilog has the 32-bit floating-point shortreal type and several supported operations, so the most complicated part of the Job Manager will be the implementation of matrix operations and convolution.

February 29, 2020February 29, 2020

Theodor’s Status Update for Feb. 29

Accomplishments

Our team has been needing a concrete metric for a very long time now. The most important thing I’ve done this week is clearly defining model throughput and model throughput per dollar::

The underlying goal of this project is to produce a system that can train large batches of distinct machine learning models very fast and at a low price point. Processing more models in the same amount of time, processing the same number of models in less time, and processing the same number of models in the same amount of time with a cheaper setup should all increase the score. Throughput and throughput per dollar together reflect all three of these traits, and thus make good metrics to quantify the improvement that our system will make over existing systems (CPU and GPU).

The throughput metric is also useful because it is additive: if we combine two systems, each with throughput T1 and T2, then the total throughput of the system will be (T1 + T2). This will allow us to quantify the scalability of our system by calculating the difference between actual system throughput and the throughput for a single Worker board multiplied by the number of workers. This metric will quantify the overhead that our system faces as it scales.

This week, I also did research on the M9K blocks. I’ve had some trouble finding example code for these, and discovered this week that Quartus will synthesize modules that act like RAM onto M9K blocks implicitly. Our actual M9K controller will be based on this starter code:

Schedule and Accomplishments for Next Week

I’ve made some good progress on memory modules for the hardware, but there is still piles of work to do on it. Next week, I want to start working on the FPU Job Manager, which will be the hardest working component that we need to make.

February 22, 2020

Mark’s Status Update for February 22nd

This week, I began adding various ML models to the benchmarking suite. The total # and variation of models is yet to be fully discussed, but some basic models have already been implemented. This took longer than expected since due to my unfamiliarity with ML and the PyTorch library, I ran into multiple issues with passing outputs from one function to inputs of another.

Additionally, TJ and I wrote some example code for what a user would write in order to train a set of models on a certain data set and to be able to read specific metrics. The sample code is attached below:

As specified in the code, the DataPipelineManager class and all associated functions will be implemented by us. The user is responsible for providing the specific data set, as well as which models will be used for training.

Overall, I would say progress is on schedule. This week I plan on adding in a wider range of models, making sure that this range encompasses and covers most if not all of our scope. Additionally, once a certain amount of models are setup, I plan on training the given models on both the CPU on the data source machine, as well as a GPU (NVIDIA 1080).

February 22, 2020February 23, 2020

Team Status Update for Feb. 22

General Updates:
Software: Wrote sample code for what a user would write in order to train a certain set of models on a certain data set.
Anything part of the DataPipelineManager() class is to be written by us.
User provides the data set as well as the set of models that they would like to be trained. Attached below is the example code:

Significant Risks:

Bus:
Swapping the bus protocol this late into the schedule is risky, and a working test implementation is required to properly tie together a couple protocols. This work must be done quick.

Hardware:

Design Changes:
Bus:
The custom bus protocol is being swapped for a common and supported protocol, SPI.

Hardware:

Schedule Update:

February 22, 2020February 29, 2020

Theodor’s Status Update for Feb. 22

Accomplishments

This week, I’ve finished up the documentation that describes handshakes between different hardware modules. At this point, the hardware modules are defined well enough that the work can be divided among multiple people and they can be developed independently.

Importantly, the handshakes I defined were:

Data Pipeline Router <-> Model Manager
Floating Point Bank <-> Model Manager

This does sound small compared to what I accomplished last week, but defining how these two modules interact will allow us to completely separate the tasks of implementing the major modules in Hardware.

I’ve also done research on the SystemVerilog shortreal data type, which is effectively the same as a float in C. Having this will make it very easy to implement most floating-point operations in hardware, especially since they will be combinational. However, I still need to do more research on the footprint of default synthesized floating-point circuits.

Schedule

Next week, I really need to get started on implementing some of the smaller modules like the Job Manager and M9K/SDRAM interfaces. These are essential components and are defined well enough that they will not be changing for the rest of the project.

Accomplishments for Next Week

Almost all of the questions related to hardware have now been answered, but implementation, for the most part, has not started yet. It is essential that this start soon, especially with the midway demo happening in Week 11.

February 22, 2020March 22, 2020

Jared’s Status Report for Feb. 22

This week was about experimenting with the GPIO bus protocol, understanding shortcomings and learning about specialized hardware withing the Pi.

Accomplishments

I’ve started on some final documentation for the bus protocol. Initially there was pressure (at least from myself) to prioritize throughput. That hasn’t changed, but after some research on optimum speeds for GPIO, some decisions had to be made to make life simpler down the chain.

The custom bus protocol we earlier designed is likely being scrapped for the built-in SPI protocol. There are a few factors involved:
1. The custom protocol was clock agnostic, in which data was sent parallel with an activating bit. SPI, while not having built-in parallel data, may still have GPIO toggling transfer modes.
2. The optimum throughput rate was a best guess given throughput benchmarks from StackOverflow that may not properly reflect setting multiple GPIO. With the Pi’s SPI clock at a maximum rate of 125 MHz, a more reliable and debuggable interface will help in the future.

Schedule

This week, I need to write a program for the SPI interface and a test module to run on an FPGA. By Friday, the module needs to be done and tested.

February 15, 2020February 15, 2020

Mark’s Status Update for February 15th

This week, I worked with TJ to figure out the block diagram and specific details of the software side of the project.

Additionally, I setup a basic training routine with the CIFAR-10 Dataset using the PyTorch tutorial available online. I also wrote an outline for the Design Review Slides, and populated the Software and Benchmarking/Validation slides of said presentation.

Overall, I would say my progress is slightly behind. I was able to benchmark the dataset on my local CPU, but the benchmarking suite is not fully setup. In order to catch up, I will do more research and reading on setting up my local environment for benchmarking in order to get a better understanding of the concept.

By the end of this week, I plan on having the benchmarking suite fully setup and to have a better understanding of how to use the PyTorch library.

February 15, 2020February 15, 2020

Team Status Update for Feb. 15

Accomplished Tasks:

Designed Top-Level Block Diagram
Wrote Transport Layer Protocol
Made FSM Diagrams for Model Manager, FPU Bank/Job Manager, and Memory Management Unit
Setup barebones benchmarking suite for local machine CPU

Risks

The most significant risk is that we don’t get connectivity working between the Data Source machine and the synthesized hardware. Our contingency plan is to implement the easiest solution possible, which involves having a Raspberry Pi act as a network stack for the FPGA and transmitting data to the board via GPIO. This mitigates the risk because implementing a GPIO protocol in hardware will be much easier than implementing an Ethernet network stack in synthesizable hardware.

Another big risk is that our bus cannot handle the desired throughput and the worker boards spend significant time idling.

Changes Made

We decided that, instead of implementing an Ethernet network stack in Verilog, we would use a Raspberry Pi to handle the internet protocol and communicate information from the Data Source server to the Worker board. This means it will require a lot less work to implement functional connectivity between nodes in the network. This change incurs the cost of a single Raspberry Pi board for each worker, which brings the effective price of a Worker from $89 to $134, which is still within budget for the number of Workers we plan to buy.

Schedule Changes

We spent this week producing design documents to aid our process going forward. We have also realized that the development of hardware components will require more effort than anticipated.

Theodor (TJ) will spend the next few days finishing all of the documentation for the hardware side, excluding the GPIO protocol (still to be implemented by Jared Rodriguez). He will spend the rest of the next week putting together memory interfaces for the M9K Blocks and off-chip SDRAM, some of which can be recycled from his work in 18-341.

February 15, 2020March 22, 2020

Jared’s Status Report for Feb. 15

Status report is a bit late, isn’t it? Turns out reports aren’t due the day before returning to the lab.

Accomplishments

For the past week, I’ve been detailing the protocol for the physical connection between the FPGA and the accompanying Raspberry Pi. The protocol accounts for available GPIO on both devices and attempts to maximize throughput from the Pi to the FPGA.

The document detailing the physical setup is here:

https://drive.google.com/open?id=12h_0X_CB3D_IyGUgf5_b_YGxSVW09nrvIl3oF7lknj8

The core supplemental diagram illustrates the GPIO on each board:

This design uses every GPIO pin on the Raspberry Pi. For convenience, The data pins are joined by direction, split across the send/receive notification pins.

Schedule

This week has a lot riding behind it: according to our Gantt chart, this week we determine if our ethernet implementation is viable. To test this , I need to mess with the DE2-115 boards and get an echo server running over GPIO. This server is likely going to form the backbone of our communication with the FPGA, as we’re limiting the effectiveness of the RPi to act as an alternative Ethernet PHY board. If all goes well, I’ll have something to write about. If all doesn’t go well, I’ll have something to write about.