Team Status Update for March 21 – Team A7: Scalable Machine Learning Using FPGAs

We’ve pasted the information from our Statement of Work here:

Original Goal

From our initial project proposal, our goal for this project is to develop a scalable solution for training ML models using a Field Programmable Gate Array (FPGA) that acts as a backend for existing machine learning frameworks. The hardware setup for this would include multiple Cyclone IV DE0-Nanos, a Data Source Machine (user laptop), and a bus between the FPGAs and the Data Source Machine, with the bus consisting of an ethernet switch and the respective ethernet wires. Using the CIFAR-10 dataset and our own set of ML models, we would measure our system’s performance by measuring model throughput and compare this value to the model throughputs of existing standards such as CPUs and GPUs.

Roadblock

Due to the recent outbreak and the subsequent social distancing practices, we are no longer allowed to meet in person. Additionally, we have also lost access to the lab we were working in as well as some of our hardware components, specifically the DE0-Nano boards, the ethernet switch and ethernet cables as well. Without the required hardware materials, it is impossible for us to build the physical product. In turn, we also cannot measure the performance of our system since the system cannot be physically built.

Solution

Thankfully, the worst case scenario above was not realized and we have or can order all of the parts that we need. The necessary software to program our hardware will also not be an issue, since we are able to install Quartus on our own machines and synthesize code to a DE0-Nano.

Overall, our implementation plan did not change much from the plan we presented in the Design Review Document. Just as well, the planned metrics we described in the Design Review Document have not changed. We will still be using Model Throughput and Model Throughput Per Dollar as metrics to compare our system to other hardware standards. Success for the system as a whole still means that our system will outperform a CPU in terms of model throughput and outperform a GPU in terms of model throughput per dollar. For the hardware subsystem, we will still measure clock cycles taken to make memory accesses and perform FPU ops to identify bottlenecks, and we will still measure the throughput of the Bus subsystem.

Since we are unable to meet, our solution is to validate each of the three subsystems independently from one another. We can thus verify that our implementations work, and hand off final working code to Jared so that he can validate and acquire metrics on the system as a whole. Our plan is to acquire an FPGA and Raspberry Pi setup for TJ (to verify the hardware implementation) and up to four identical systems for Jared.

Contingencies

Due to the instability of the situation, we need to plan for the contingency that our necessary hardware cannot be shipped to us. We are already planning to calculate clock cycles necessary to perform essential hardware operations in simulation. We can calculate the model throughput of the entire system using numbers acquired in simulation:

We can calculate an estimate of the model throughput of the system using the following method:

Ta	Time taken to assign all models to a board
Tt1	Time taken to train all models on a board on one input
TN	Time taken to train all models on a board on the entire dataset
LB	Bus latency for sending a model
LD	Data Pipeline Router latency for assigning a model
C1	Clock cycles taken to train the slowest model on a board on one input, given that the weights and input are already in memory
NM	Number of models on a board
M1	Model Throughput for one board
MN	Model throughput for all boards

Ta = Nm * (LB) + (LD)

Tt1 = (C1) * (50M clock cycles per second)

TN = N * Tt1

M1 = NM / (Ta + TN)

MN = sum of M1 for each board in use

Our simulated estimate for model throughput can then be compared to the model throughput of the CPU and GPU hardware setups, in case we are not able to build the system and measure an end-to-end metric.

The rest of our Team Status Update is below:

Accomplishments

Theodor is almost finished writing the Memory Port Controller. The cache (which asserts control signals to the SDRAM and M9K memory controllers) has been implemented, and the Memory Port Controller itself will be finished by the end of next week. He has also started working on the more difficult operations performed by the FPU Job Manager, which will make it easier to develop the many more simple operations quickly.

Schedule and Accomplishments for Next Week

Next week, Theodor will be finished defining the difficult FPU operations and will have started implementing them. He will also be done working on the Memory Port Controller.

Leave a Reply Cancel reply