mgorelik – Team A7: Scalable Machine Learning Using FPGAs

April 28, 2020

Mark’s Status Report for April 26th

This week, I revisited the cpu and gpu benchmarking files and modified them in order to reflect the changes made to all of the models. We decided to remove MaxPool/Flatten/Convolution layers from our models since implementing these layers in hardware turned out to be more work than we had originally planned for. I also had to make some modifications to the serialization as we discovered small issues with the serialization of models on the hardware size. Additionally, we spent most of this week preparing for the Final Presentation that was due on the 27th of April.

This week, I plan on revisiting the gpu metrics and seeing if I can get more details metrics for intermediate values while all the computation is going. I also plan on working on the final report as well.

April 22, 2020April 22, 2020

Mark’s Status Report for April 18th

This week, I finished the serialization of the linear and ReLU layers for any model. We only decided to implement two layers since we did not have enough time to implement the other layers on the hardware side. This also took longer than expected as I ran into issues with converting from a tensor to a multidimensional list to a single dimension list. Additionally, I finished serialization for samples. During this process, we had to change our data transmission protocol from UDP to TCP. The reason for this was because TCP guarantees that data will reach the destination router and allows for easier chunk reading since its a stream based connection as opposed to UDP.

Unfortunately, I did not have time to fully implement the Workload manager, and the current setup is a hard coded manager that sends 15 models to one board. Additionally, I wasn’t able to implement some sort of mechanism that would wait for results from the FPGAs and store this result.

This coming week, I plan on helping Jared with the integration between the Data Source Machine and the RaspPi. During our preparation for our final presentation, we will talk over to see what additional features must be implemented in order to provide a good demo.

April 13, 2020

Mark’s Status Report for April 11th

This week, I wrote a helper function that takes a ML model and returns a list of all the layers of that model in the order that they are called in. This took longer than expected as some layers were not showing up properly or in the right order. Additionally, I had to make some small modifications to the serialization of the models since we made some changes to the Transport Layer Protocol. The two changes were changing the size of each packet of information (from any size -> 4 bytes), and swapping the order of the preambles before sending each layer.

This coming week, I plan on finishing up the serialization of a specific model, and then checking that every model that was defined in the benchmark.py file serializes correctly.

April 6, 2020

Mark’s Status Report for April 5

This week, I finished the tensor serializer helper function for serializing tensors between 1-4 dimensions. This meant converting any of the various dimensioned tensors into a single dimension list based off of the Transport Protocol that we had described previously. Additionally, I wrote a basic serialization for each of the six possible layers that a model could have (Linear, 2D Convolution, ReLU, Maxpool2D, Flatten, Softmax). As of this point, all tensors use integer values to represent values as it is easier to validate a tensor is being serialized correctly.

This coming week. I plan on fully implementing the serialization for each of the six possible layers, as well as cleaning up the interaction between the Data Source machine and the Worker.

March 29, 2020

Mark’s Status Report for March 28

This week, I set up a basic framework for the Worker Finder using the UDP communications protocol. I tested this feature by setting up a server and sent messages back and forth between the Worker finder and the server. The server in this case acts as a Rasp Pi. There is still a little bit of work left to do on the Worker Finder as I am unclear about some of the very specific details of the implementation. This coming week, I plan on working with Jared to hash out these details as he in charge of the Bus Protocol. I also plan on working on the Workload Manager, specifically using the third party tool that I found a couple weeks back to measure the size of a model given the input parameters.

March 28, 2020March 30, 2020

Team Status Update for March 28

Contingencies (Risk Management)

Due to the instability of the situation, we need to plan for the contingency that our necessary hardware cannot be shipped to us. We are already planning to calculate clock cycles necessary to perform essential hardware operations in simulation. We can calculate the model throughput of the entire system using numbers acquired in simulation:

We can calculate an estimate of the model throughput of the system using the following method:

Ta	Time taken to assign all models to a board
Tt1	Time taken to train all models on a board on one input
TN	Time taken to train all models on a board on the entire dataset
LB	Bus latency for sending a model
LD	Data Pipeline Router latency for assigning a model
C1	Clock cycles taken to train the slowest model on a board on one input, given that the weights and input are already in memory
NM	Number of models on a board
M1	Model Throughput for one board
MN	Model throughput for all boards

Ta = Nm * (LB) + (LD)

Tt1 = (C1) * (50M clock cycles per second)

TN = N * Tt1

M1 = NM / (Ta + TN)

MN = sum of M1 for each board in use

Our simulated estimate for model throughput can then be compared to the model throughput of the CPU and GPU hardware setups, in case we are not able to build the system and measure an end-to-end metric.

The rest of our Team Status Update is below:

Accomplishments

Theodor fully defined a control FSM for the Convolutional Forward FPU operation, which also defines the upper limit of requirements for the FPU Job Manager (i.e. how many registers are needed, whether specific ALU operations are necessary).

Mark finished a framework for the Worker Finder, using a UDP broadcast as discussed and verifying the framework using a local server that emulates the Rasp Pi.

Schedule and Accomplishments for Next Week

Theodor will begin implementing the FPU Job Manager in SystemVerilog.

Mark will be working on the Workload Manager.

Updated Gantt Chart

March 21, 2020

Mark’s Status Report for March 21st

This week, I was able to provide preliminary numbers on the training result of our model test set on a GPU, as well as a different, higher end CPU as well. The full statistics can be seen at the bottom of the status report. The main takeaway is that although the GPU and CPU have almost the same average time, wider models trained much faster on the GPU than the CPU. Likewise, deeper models trained faster on the CPU. Additionally, I also realized that I was not fully utilizing the GPU when training each model sequentially. A possibility that I would like to explore in the future would be to possibly train multiple models simultaneously on a GPU to maybe get a better average throughput.

I also worked with TJ and Jared to write up the Statement of Work as well as figuring out what our plan for the project was in the future. During this stage, I more concretely defined what I would be doing in order to validate the Software portion of our project. These methods are more clearly defined in the Team Status Report.

I would say I am on schedule this week.

This coming week, I plan to start working on setting up a form of validation for the software portion of the Project.

Hardware: NVIDIA GeForce RTX 2060 SUPER
Trained models with an average throughput of 24.454581
Trained models at an average time of 40.866 seconds
Model 1 stats: Time taken: 28.239 seconds, Loss value: 966.509
Model 2 stats: Time taken: 24.933 seconds, Loss value: 732.629
Model 3 stats: Time taken: 25.145 seconds, Loss value: 804.955
Model 4 stats: Time taken: 30.565 seconds, Loss value: 727.075
Model 5 stats: Time taken: 38.075 seconds, Loss value: 749.385
Model 6 stats: Time taken: 48.607 seconds, Loss value: 852.541
Model 7 stats: Time taken: 50.306 seconds, Loss value: 852.214
Model 8 stats: Time taken: 55.584 seconds, Loss value: 902.778
Model 9 stats: Time taken: 60.663 seconds, Loss value: 1151.682
Model 10 stats: Time taken: 65.919 seconds, Loss value: 1150.963
Model 11 stats: Time taken: 24.625 seconds, Loss value: 674.633
Model 12 stats: Time taken: 24.847 seconds, Loss value: 617.159
Model 13 stats: Time taken: 25.726 seconds, Loss value: 606.477
Model 14 stats: Time taken: 32.640 seconds, Loss value: 566.784
Model 15 stats: Time taken: 37.984 seconds, Loss value: 590.729
Model 16 stats: Time taken: 29.991 seconds, Loss value: 1055.854
Model 17 stats: Time taken: 30.329 seconds, Loss value: 965.328
Model 18 stats: Time taken: 30.178 seconds, Loss value: 992.295
Model 19 stats: Time taken: 35.499 seconds, Loss value: 956.919
Model 20 stats: Time taken: 43.565 seconds, Loss value: 1139.789
Model 21 stats: Time taken: 54.319 seconds, Loss value: 1137.847
Model 22 stats: Time taken: 55.855 seconds, Loss value: 1151.514
Model 23 stats: Time taken: 61.021 seconds, Loss value: 1151.118
Model 24 stats: Time taken: 66.462 seconds, Loss value: 1151.826
Model 25 stats: Time taken: 71.953 seconds, Loss value: 1151.610
Model 26 stats: Time taken: 30.304 seconds, Loss value: 936.803
Model 27 stats: Time taken: 30.150 seconds, Loss value: 906.943
Model 28 stats: Time taken: 30.375 seconds, Loss value: 904.646
Model 29 stats: Time taken: 38.212 seconds, Loss value: 895.794
Model 30 stats: Time taken: 43.902 seconds, Loss value: 949.094

Hardware: Intel(R) Core(TM) i7-9700F CPU @ 3.00 GHz 3.00 GHz
Trained models with an average throughput of 24.921083
Trained models at an average time of 40.100 seconds
Model 1 stats: Time taken: 12.247 seconds, Loss value: 994.131
Model 2 stats: Time taken: 14.784 seconds, Loss value: 759.224
Model 3 stats: Time taken: 14.622 seconds, Loss value: 811.628
Model 4 stats: Time taken: 20.136 seconds, Loss value: 715.087
Model 5 stats: Time taken: 22.665 seconds, Loss value: 768.794
Model 6 stats: Time taken: 24.846 seconds, Loss value: 838.152
Model 7 stats: Time taken: 33.105 seconds, Loss value: 889.310
Model 8 stats: Time taken: 36.692 seconds, Loss value: 973.237
Model 9 stats: Time taken: 39.287 seconds, Loss value: 965.914
Model 10 stats: Time taken: 39.266 seconds, Loss value: 1150.552
Model 11 stats: Time taken: 19.270 seconds, Loss value: 697.697
Model 12 stats: Time taken: 24.998 seconds, Loss value: 623.859
Model 13 stats: Time taken: 27.572 seconds, Loss value: 618.720
Model 14 stats: Time taken: 151.185 seconds, Loss value: 606.125
Model 15 stats: Time taken: 83.843 seconds, Loss value: 576.875
Model 16 stats: Time taken: 18.548 seconds, Loss value: 1068.568
Model 17 stats: Time taken: 22.174 seconds, Loss value: 963.580
Model 18 stats: Time taken: 20.550 seconds, Loss value: 971.986
Model 19 stats: Time taken: 26.596 seconds, Loss value: 1005.134
Model 20 stats: Time taken: 28.523 seconds, Loss value: 1049.507
Model 21 stats: Time taken: 31.286 seconds, Loss value: 1139.938
Model 22 stats: Time taken: 36.644 seconds, Loss value: 1152.150
Model 23 stats: Time taken: 37.675 seconds, Loss value: 1150.964
Model 24 stats: Time taken: 38.490 seconds, Loss value: 1152.036
Model 25 stats: Time taken: 40.988 seconds, Loss value: 1151.915
Model 26 stats: Time taken: 26.474 seconds, Loss value: 923.730
Model 27 stats: Time taken: 32.037 seconds, Loss value: 904.503
Model 28 stats: Time taken: 33.611 seconds, Loss value: 905.850
Model 29 stats: Time taken: 156.003 seconds, Loss value: 902.330
Model 30 stats: Time taken: 88.890 seconds, Loss value: 954.624

March 21, 2020March 22, 2020

Team Status Update for March 21

We’ve pasted the information from our Statement of Work here:

Original Goal

From our initial project proposal, our goal for this project is to develop a scalable solution for training ML models using a Field Programmable Gate Array (FPGA) that acts as a backend for existing machine learning frameworks. The hardware setup for this would include multiple Cyclone IV DE0-Nanos, a Data Source Machine (user laptop), and a bus between the FPGAs and the Data Source Machine, with the bus consisting of an ethernet switch and the respective ethernet wires. Using the CIFAR-10 dataset and our own set of ML models, we would measure our system’s performance by measuring model throughput and compare this value to the model throughputs of existing standards such as CPUs and GPUs.

Roadblock

Due to the recent outbreak and the subsequent social distancing practices, we are no longer allowed to meet in person. Additionally, we have also lost access to the lab we were working in as well as some of our hardware components, specifically the DE0-Nano boards, the ethernet switch and ethernet cables as well. Without the required hardware materials, it is impossible for us to build the physical product. In turn, we also cannot measure the performance of our system since the system cannot be physically built.

Solution

Thankfully, the worst case scenario above was not realized and we have or can order all of the parts that we need. The necessary software to program our hardware will also not be an issue, since we are able to install Quartus on our own machines and synthesize code to a DE0-Nano.

Overall, our implementation plan did not change much from the plan we presented in the Design Review Document. Just as well, the planned metrics we described in the Design Review Document have not changed. We will still be using Model Throughput and Model Throughput Per Dollar as metrics to compare our system to other hardware standards. Success for the system as a whole still means that our system will outperform a CPU in terms of model throughput and outperform a GPU in terms of model throughput per dollar. For the hardware subsystem, we will still measure clock cycles taken to make memory accesses and perform FPU ops to identify bottlenecks, and we will still measure the throughput of the Bus subsystem.

Since we are unable to meet, our solution is to validate each of the three subsystems independently from one another. We can thus verify that our implementations work, and hand off final working code to Jared so that he can validate and acquire metrics on the system as a whole. Our plan is to acquire an FPGA and Raspberry Pi setup for TJ (to verify the hardware implementation) and up to four identical systems for Jared.

Contingencies

We can calculate an estimate of the model throughput of the system using the following method:

Ta	Time taken to assign all models to a board
Tt1	Time taken to train all models on a board on one input
TN	Time taken to train all models on a board on the entire dataset
LB	Bus latency for sending a model
LD	Data Pipeline Router latency for assigning a model
C1	Clock cycles taken to train the slowest model on a board on one input, given that the weights and input are already in memory
NM	Number of models on a board
M1	Model Throughput for one board
MN	Model throughput for all boards

Ta = Nm * (LB) + (LD)

Tt1 = (C1) * (50M clock cycles per second)

TN = N * Tt1

M1 = NM / (Ta + TN)

MN = sum of M1 for each board in use

The rest of our Team Status Update is below:

Accomplishments

Theodor is almost finished writing the Memory Port Controller. The cache (which asserts control signals to the SDRAM and M9K memory controllers) has been implemented, and the Memory Port Controller itself will be finished by the end of next week. He has also started working on the more difficult operations performed by the FPU Job Manager, which will make it easier to develop the many more simple operations quickly.

Schedule and Accomplishments for Next Week

Next week, Theodor will be finished defining the difficult FPU operations and will have started implementing them. He will also be done working on the Memory Port Controller.

March 21, 2020

Team Status Update for March 14

Accomplishments

Software:

We talked in more detail about the Python API design, with the new changes making it easier for the user to write code in order to train their specific set of models. We also ran into an issue with the software side in regards to figuring out the size in weights of a particular model. The solution to this was to use a third-party Python package called ‘torchsummary’ to help calculate the size of the model. This PyTorch package is supposed to replicate the Keras API that prints out model statistics for a given model. GPU benchmarking is almost complete, currently fixing a small bug with metrics.

Schedule Changes

Due to the recent outbreak of the COVID-19 virus, we expect a schedule change in the coming week since we no longer have access to our lab and will not be able to meet in person anymore.

March 15, 2020March 15, 2020

Mark’s Status Report for March 14th

This week, I started working on a helper function inside our API that would be able to calculate the size of a user given model. However, while working on this helper function, I realized that this function would be extremely difficult to implement with an API. Because we are fairly limited on our hardware storage space (~30MB), we need to know the size of each model so that the Workload Manager does not overload on the amount of models it sends to a particular board. Initially, I was stuck since PyTorch did not have an available API to calculate the size of a model. Luckily, a third party package called ‘torchsummary’ exists, and I will be looking into this package in this coming week in order to finish the size calculation function.

Additionally, I set up a benchmarking suite for the GPU. However, due to the recent outbreak, I was unable to test it on the originally planned GPU (NVIDIA 1080), and instead used a NVIDIA GeForce RTX 2060 SUPER. Additionally, there were some issues with the models having incorrect parameters, resulting in an incomplete run of the benchmarking suite, so the numbers will be updated soon. I also slightly reworked the skeleton of the user code in order for easier usability for the user.

Overall, I would say I am on schedule.

This coming week, I plan on fixing the GPU benchmarking bug which will allow me to train the full set of defined models, in turn providing me with another cost-effectiveness value that we can use to validate our final design. Additionally, I plan on finishing the helper function to calculate the ML model size, and write some additional helper functions for the Python API.