Team Status Report(12th April)

This week, we had the following goals:

  1. Reevaluate the 700M model and see if it would fit on the FPGA after deleting PYNQ, we didn’t want any surprises.
  2. Identify further speedups that we could generate and pull the profiling data from the FPGA.
  3. Extend the UI script to a generic system that anyone connected to the CMU WiFi setup can use.

We did manage to achieve goal 1(see anirudhp status report) but had some trouble with points (2) and (3). Currently we are almost done and as a result only slightly behind schedule, but we would like to get them early next week.

So over the next week we have the following goals:

  1. Delete Pynq and sub in the new model
  2. Run the advanced UI script on the rest of our laptops and get the system working
  3. Pull power and profiling graphs on demand.

This would essentially get our entire system done and ready for the final presentation and demos.

Team_status_report_March29th

This week, we set things up for the interim demo. Couple of changes that we made:

  1. We moved from a 700 Million parameter model to a 165M parameter model.
  2. We swapped over our quantization techniques due to the custom kernel not being able to accept the smaller model for this quantization form.
  3. We used a client based access control system, and as Prof. Theo said, we observe a starvation based model in our resource requests.

The details will be analyzed further in individual status reports.

We are well on schedule and have actually hit our basic MVP. The only thing left to do now would be to iterate and try to improve the performance of our system. We are also exploring how to extract power and telemetry signals from the FPGA for simple visualizations.

team_status_report_March_15th

This week, we have received our FPGA and the plug-in wiring needed to supply the power.

In order to move forward according to our plans, we began with the following tasks in parallel:

  1. Booting Linux on the FPGA in order to start attempting the model on the embedded core.
  2. Extending the UI script to a multi-client FPGA based approach.
  3. Working out the UI testing system(developing the form and our exact method of analyzing the responses.

The details of how each task was accomplished are in the individual status reports.

At this time, we have accomplished the following:

  1. Sending a query onto the FPGA.
  2. Booting linux on the board.
  3. We have finalized the UI script study questions.

Currently we are ahead of schedule on the technical fronts, but are probably a bit behind schedule on the user study. For now, this is not that concerning given that it is very easy to adjust the UI script to user feedback, and we are anyway improving on it at this time for the FPGA extension.

Team Status Report 03/08/2025

This week we had a couple of targets that were mostly achieved:

  1. We noticed that our setup script had some issues and was a bit unstable to run. But it was still reasonably fast, so we didn’t have any problems running it several times until it worked completely. So we wrapped the script in a loop with a try-except block so that it kept running until the full setup worked. We would have liked to have debugged the code but there are not that many gains that we could have made by doing this, and preferred to focus on the hardware segment.
  2. We analyzed the bitnet paper that Microsoft had published and came up with an overall block diagram that would accelerate the system, and did some preliminary calculations on how much of a speedup we would be able to attain over the classical form of the core that we were using. From the looks of it we would be able to save a number of cycles and shrink the overall size of the arithmetic blocks to meet the speed specificaitons that we had.
  3. We analyzed the ethical impacts of our project and completed the design review report.

 

Over the next week, our aims are:

  1. We should get the final FPGA and then synthesize our base core and model onto it. Using this we want to benchmark the following
    1. Total size footprint — See if we can fit a bigger model in.
    2. Tokens/Sec and latency to first token — This gives us an idea of how much of a speedup we would need over the existing hardware system. We would probably need to adjust the block diagram to meet this value.
    3. Power telemetry — This is a new FPGA so we would need to get an idea of how to pull power data from the new FPGA.
  2. Also, we would like to extend the UI script to interface with the FPGA and start thinking about the authentication and scheduling systems for multi-access. Mainly to see whether it is in fact feasible, not to see how we would implement it.

Team Status Report for Feb 22

This week we focused on a few different things – the design presentation, which was mainly covered by Andrew,  UI development, which was covered by Anirudh and I (Amelia), and FPGA drama, which we all dealt with.

We finally got the UI setup scripts to work on other user’s computers – something we’d been working on. Once the set up process was finalized, we worked on a more usable interface that fits our use case requirements as laid out in our design presentation. We want to have the UI set up and run scripts completed soon because we plan to do some preliminary user testing while there is still plenty of time to make changes based on anything we learn.

The FPGA drama of the week was we found out that the Ultra96-V2 board we got is part of a broken batch that has non-functional wifi. Without the wifi, we can’t ssh into the board, so we had two options, either pick a new board, or obtain a wired uart extension board. We decided to try the Kria KV260, since that was another board we initially considered due to its greater memory capacity. Our initial hesitation with using this board was because none of us were familiar with Vitis, but after a little reading, we felt that we can learn how to use those EDA tools well enough to develop a synthesis flow. Now we may also be able to work with a larger model, which will give us more accurate text completions.

Group Goals for the next week:

  1. We have to complete the design review report for Friday
  2. Our new FPGA is arriving soon, so we need to start working on our new synthesis flows.

Team Status Update Feb 15th

This week we focussed on wrapping up the auxiliary tasks that lead up to the final stage of the project where we’ll focus on iterating on the hardware accelerator. Namely:

  1. Setting up our benchmarking and profiling system for the baseline.
  2. Setting up the FPGA connectivity and synthesis flow.
  3. Evaluating a Chain of thought alternative model for improving model accuracy.

We managed to complete the benchmarking and profiling system, and eventually decided against using deepseek-r1’s smaller variant, however the FPGA system did not end up working as we expected.

We found that the the FPGA system that we used had some flaws in its WiFi connectivity setup, this leads to us not being able to service multiple clients at the same time.

Our goals for next week are:

  1. Run our benchmarking and profiling system on a wide spectrum of input tokens, and collect a comprehensive characterization dataset on our Macs.
  2. Swap to a functional FPGA with WiFi capacity and boot linux as well as our synthesis flow on the board. However, in the duration that the other FPGA takes to arrive, we can still try to synthesize the model onto ours and get that working — but this will only be a viable option if we’re going to use the same FPGA type after changing.
  3. Preemptively prepare the interconnect from FPGA to laptop and begin drawing a block diagram for the accelerated system.

For status report 2: A was written by Anirudh, B was written by Amelia, and C was written by Andrew.

Team Status Report for Feb 8th

We started this week focused on completing the proposal presentation, which included narrowing down our use case to a focus on users who want to use text completion models but are unable to use commercial products due to privacy concerns with sending sensitive information to the cloud. After the proposal presentation, we received some feedback that caused us to change our approach to benchmarking. Instead of synthesizing CPU/GPU cores onto our FPGA to generate timing and power benchmarks, we are now exploring a way to measure those benchmarks on a Mac, which allows us to start developing and synthesizing our architecture sooner than anticipated. In terms of updating our schedule, we now have more room for slack which will be key as we have to do integration more towards the beginning of our project and will likely run into hurdles getting the host computer and FPGA communicating.

We got our FPGA this week – the ultra96v2, and are now in the process of booting Linux on it (and finding a power supply).  We also got a UI working for all text boxes on a Mac as well as a python script that automates the installation process of all libraries required to use the autocomplete feature. The next steps for the UI include finalizing a model that is small enough to fit in the DDR memory on our FPGA but has decent outputs. One risk we have identified is that we haven’t tested the installation process on any computers other than our own, and we may conduct some user testing to ensure it’s a simple installation process for people with and without technical skills.

Our group goals for next week are:

  1. Finalize a model that is small but has a potentially higher output quality than what we are currently working with
  2. Boot linux onto the FPGA
  3. Figure out how to get timing and power data from MacOS
  4. conduct preliminary user testing (and develop a quantifiable metric to benchmark it’s quality)

Our Project Idea

We aim to address two current challenges in ML applications:

  1. Models are too heavyweight to run on local machine.
  2. Consume excessive energy, making them environmentally unsustainable.

To address this problem, we plan to develop an FPGA-based accelerator as a precursor to an ASIC capable of running smaller, lightweight “bitnets” locally.

Bitnets are highly quantized versions of their base bulky models, and recent research by Microsoft, Tsinghua University and the Chinese Academy of Sciences has shown that such models can be trained with a minimal loss in model output quality.

Our proof of concept will demonstrate architectural improvements, achieving faster text generation compared to FPGA- based CPU/GPU systems of similar size and power class. We will validate our approach using a heavier text completion model.

Currently, we are working on identifying the ideal bitnet model that we aim to accelerate, using the following considerations to evaluate the models:

  1. The models should be small enough to run on the FPGA’s limited hardware resources.
  2. The models should be producing good enough outputs that they could be used for applications like text or code completion. With a future goal of predictive text completion. 

    Currently:

    Amelia is investigating potential text to text models that we could use. (Based on the work of Microsoft’s bitnet framework)
    Andrew is looking into the potential of retraining a Flux text to image model for smaller size. (Based on the work of Black Forest Labs)
    Anirudh is trying to create a quantization and training system for the Flux text to image models so that they can be compressed to bitnets (Based on the work of Tiktok Research)