Amelia’s Status Report for Feb 8th

At the beginning of this week I was focused on finished our proposal presentation slides and practicing for that presentation. In addition to practicing, I spent some time refining/narrowing our use case in order to guide benchmark creation for some of the more subjective elements of our project (like user experience). After finishing the proposal presentation I looked into new models for our text completion assistant. Specifically, we found that someone had quantized DeepSeek, and I wanted to see if it would be possible for us to fit that on our FPGA. I started by downloading the quantized model and getting it set up to run on my computer – to test output quality and ensure the quantized version was still of decent quality. The first roadblock I ran into was that this quantized model requires around 3Gb of memory, and our Ultra96v2 FPGA only has 2Gb of RAM. Unfortunately the group that quantized deepseek did not provide their quantization code, so I reached out to them to see if they would be able to provide it to me. If that happens, I plan to look into quantizing the model further, to see if we could fit in on the RAM in our FPGA

For this coming week, my goals are:

  1. Figure out how to load the model onto the FPGA (using the softcore to scp files)
  2. Get the FPGA to computer UART framework in place
  3. Develop a metric for usability of our UI and conduct some preliminary user testing

Team Status Report for Feb 8th

We started this week focused on completing the proposal presentation, which included narrowing down our use case to a focus on users who want to use text completion models but are unable to use commercial products due to privacy concerns with sending sensitive information to the cloud. After the proposal presentation, we received some feedback that caused us to change our approach to benchmarking. Instead of synthesizing CPU/GPU cores onto our FPGA to generate timing and power benchmarks, we are now exploring a way to measure those benchmarks on a Mac, which allows us to start developing and synthesizing our architecture sooner than anticipated. In terms of updating our schedule, we now have more room for slack which will be key as we have to do integration more towards the beginning of our project and will likely run into hurdles getting the host computer and FPGA communicating.

We got our FPGA this week – the ultra96v2, and are now in the process of booting Linux on it (and finding a power supply).  We also got a UI working for all text boxes on a Mac as well as a python script that automates the installation process of all libraries required to use the autocomplete feature. The next steps for the UI include finalizing a model that is small enough to fit in the DDR memory on our FPGA but has decent outputs. One risk we have identified is that we haven’t tested the installation process on any computers other than our own, and we may conduct some user testing to ensure it’s a simple installation process for people with and without technical skills.

Our group goals for next week are:

  1. Finalize a model that is small but has a potentially higher output quality than what we are currently working with
  2. Boot linux onto the FPGA
  3. Figure out how to get timing and power data from MacOS
  4. conduct preliminary user testing (and develop a quantifiable metric to benchmark it’s quality)

Anirudhp_status_Feb8th2025

So this week while Andrew and Amelia were finalizing the model and setting up the FPGA, I dealt with the user interface and hotkey setup.

I utilized a lua interface that sits above the MacOS Kernel to trigger software interrupts and packaged the entire system into a single python script that will allow for the hotkey “CMD + G” to trigger our bitnet llm of choice.

Currently, our bitnet performs reasonably fast — taking around 4-5 seconds on a manual stopwatch to generate the output. This however does not stream the output token by token, and rather sends the entire output to the surface at once. Something that will have to be fixed over the next week.

While I have not taken any power measurements yet, I did notice that it turned my laptop’s fan on after I ran it 10-15 times in quick succession.

My goals for the next week are:

  1. Benchmark the model on a purely MacOS based infrastructure.
  2. Allow the system to stream tokens rather than displaying all at once.
  3. Figure out some way to take power measurements and benchmarks for the Mac based runtime.
  4. Benchmark the model for safety and look into quantizing a deepseek like system in order to improve hallucinations and accuracy(reasoning based models are inherently better in this regard.)

Amelia’s Status Report for Feb 1st

This week I explored a number of trained BitNets that are supported by microsoft’s bitnet inference framework. The goal of this was to find a model that would be small enough to fit on an FPGA, but worked well enough to be repurposed into a viable product.

Initially, we wanted to work with a model that had only 70M parameters, in the hopes that we could fit that model on any FPGA we wanted. However, after trying to chat with it, I found that the low number of parameters contributed to very poor performance as seen in my conversation with it below:

I tested a few more models with larger parameters (up to 10B) from this family of models. While they perform significantly better,  these models are too large to fit on any FPGA we can afford (the 10B parameter model is around 3GB after quantization). I ultimately settled on this model, that has around 700 million parameters and is around 300 MB after quantization. This model is for text completion, as you can see below, so that is likely the direction we will take for our final project.

The prompt here was “what did I do today?” and it autocompleted the rest

Andrew’s Status Report for Feb 1st

I am currently working on selecting proper CPU and GPU soft cores to be synthesized on to the FPGA for performance and power efficiency.

I have looked into multiple open-source RISCV IPs including the Rocket-Core (a widely known UC Berkley project based on HLS(High Level Synthesis) languages), the VexRISCV project (frequently used in 18-525/725 tape-out, proven to work in multiple real chips) and the hazard-3 core designed by Luke Wren, principle engineer of Raspberry Pi, an is currently onboard multiple RPI products. I worked with all of the projects and decide to go forth and select the VexRISC-V core as the benchmark softcore because:

  1. It has a long history of success, the project is designed for FPGA softcore and has been verified on multiple FPGA fabrics, including ones that we might use later in the project. Unlike the Hazard3 core, which is designed to be used in silicon.
  2. The project is simple and has lots of example to draw from, while the Berkley Rocket-Core and the Chipyard framework has a huge dependency of more than 30G in total and ended up not working out of the box.
  3. VexRISC-V, also being an HLS project, offers great flexibility as well, and have vanilla options for multiple bus protocol options, which will facilitate communication when synthesized onto the FPGA. It also has support for directly booting linux for even greater ease of use.

Currently my progress is on schedule, the next steps are testing out the Vex soft core on 240 FPGA (we are planning on using Xilinx boards so the Vivado toolchain would match) and find and evaluate appropriate GPU soft IPs as well.

 

 

Anirudhp_29thJan2025

I am currently working on recreating a Flux 1.58 bit model as announced by Bytedance Research.

However, at this time, the model that they have trained shows a 7.7x times size improvement over the existing 23.5GB Flux model that was released by Black Forest Labs. This model will be in excess of 3Gb, and cannot be accomodated on the FPGAs that we have access to(max size 2Gb).

As a result, I have currently replicated the quantization process for the Flux model, however even though the model was open sourced by Black Forest Labs, the training code and training data are not referenced. As a result, I am currently trying to adapt the quantization system for a fully open-source text to image system such as:

Dall-E Mini or the first Flux.1 Dev model that was released.

However, the FLux model when quantized to 1.58 bits does produce excellent outputs that are almost on par with the original model.

Eg: “A man using a soldering iron to repair a broken electronic device” Produces:

My goal for the end of the next week is to either identify a way of using an FPGA that can accommodate the larger models(Using either a DIMM slot or in an extreme case, networking two FPGAs).

And if this is not possible, either distilling the FLUX model or recreating the quantization code for DALL-E Mini

Our Project Idea

We aim to address two current challenges in ML applications:

  1. Models are too heavyweight to run on local machine.
  2. Consume excessive energy, making them environmentally unsustainable.

To address this problem, we plan to develop an FPGA-based accelerator as a precursor to an ASIC capable of running smaller, lightweight “bitnets” locally.

Bitnets are highly quantized versions of their base bulky models, and recent research by Microsoft, Tsinghua University and the Chinese Academy of Sciences has shown that such models can be trained with a minimal loss in model output quality.

Our proof of concept will demonstrate architectural improvements, achieving faster text generation compared to FPGA- based CPU/GPU systems of similar size and power class. We will validate our approach using a heavier text completion model.

Currently, we are working on identifying the ideal bitnet model that we aim to accelerate, using the following considerations to evaluate the models:

  1. The models should be small enough to run on the FPGA’s limited hardware resources.
  2. The models should be producing good enough outputs that they could be used for applications like text or code completion. With a future goal of predictive text completion. 

    Currently:

    Amelia is investigating potential text to text models that we could use. (Based on the work of Microsoft’s bitnet framework)
    Andrew is looking into the potential of retraining a Flux text to image model for smaller size. (Based on the work of Black Forest Labs)
    Anirudh is trying to create a quantization and training system for the Flux text to image models so that they can be compressed to bitnets (Based on the work of Tiktok Research)