anirudhp_status_report_03/08/25

This week, I focussed on two primary aspects of the project:

  1. Ethical considerations and how this will adjust the benchmark. For this system, I have made some minor improvements to the model so that it simply refuses to autocomplete certain types of text — eg: Medical, urgent action etc.
  2. Analysing the Microsoft Bitnet Paper in order to suggest performance improvements that we could target.

Overall, the aspects that I was able to achieve are:

  1. Reduced hallucination rate by over 6%, but this was naturally at the expense of the model simply refusing to provide an output.
  2. Identified the Look-up-table implementation and the indexing system as the major speedups which would provide 40% more throughput in the system.

My goals for next week are:

  1. I want to be able to connect with the FPGA wirelessly and transmit the query onto the board(this I can do simply after booting Linux on the core that the board has) so I’d probably do this before we start working on the synthesis flow.
  2. Prepare more on Vitis to see how I would synthesize a basic block that detects the query and pasts the exact same text as the query into the output(this can be seen as a prelim step, we would simply replace the short circuit with our model in order to complete the system)

I wanted to keep pretty conservative goals for this week given that we are finally going to start interfacing with hardware, and this will always come with a number of challenges relating to the setup and the use of the system. At the same time, I still think that the above goals that I have listed are reasonable.

We’re currently well ahead of schedule.(Approx 2 weeks)

Team Status Report 03/08/2025

This week we had a couple of targets that were mostly achieved:

  1. We noticed that our setup script had some issues and was a bit unstable to run. But it was still reasonably fast, so we didn’t have any problems running it several times until it worked completely. So we wrapped the script in a loop with a try-except block so that it kept running until the full setup worked. We would have liked to have debugged the code but there are not that many gains that we could have made by doing this, and preferred to focus on the hardware segment.
  2. We analyzed the bitnet paper that Microsoft had published and came up with an overall block diagram that would accelerate the system, and did some preliminary calculations on how much of a speedup we would be able to attain over the classical form of the core that we were using. From the looks of it we would be able to save a number of cycles and shrink the overall size of the arithmetic blocks to meet the speed specificaitons that we had.
  3. We analyzed the ethical impacts of our project and completed the design review report.

 

Over the next week, our aims are:

  1. We should get the final FPGA and then synthesize our base core and model onto it. Using this we want to benchmark the following
    1. Total size footprint — See if we can fit a bigger model in.
    2. Tokens/Sec and latency to first token — This gives us an idea of how much of a speedup we would need over the existing hardware system. We would probably need to adjust the block diagram to meet this value.
    3. Power telemetry — This is a new FPGA so we would need to get an idea of how to pull power data from the new FPGA.
  2. Also, we would like to extend the UI script to interface with the FPGA and start thinking about the authentication and scheduling systems for multi-access. Mainly to see whether it is in fact feasible, not to see how we would implement it.

anirudhp_status_report_Feb22nd

This week, the focus was on packaging and clearing our user interface system in order to start getting feedback form people that can trial our system.

This week I managed to get the power and timing parameters achieved to be printed on the side of the screen in a location that I thought was unobtrusive. It’s one of the details that I would like to verify in our user feedback form.

Additionally, given that we are moving from an Ultra96v2 FPGA to a Kria based FPGA, we would need to learn how to use a different set of EDA tools. So I spent the past week mainly focussing on how to operate and use Vitis to synthesize our softcores and language models.

Over the next week my goals are:

  1. I want to work out how the full synthesis flow to load our models onto the Kria works.
  2. I want to see how to move data onto the FPGA and pull the results out of the FPGA works, this way I can extend our previous python script to use the FPGA for inference.

After this, I would plan to go for the more advanced power and performance data that we want to monitor on the FPGA.

 

We’re currently well ahead of schedule and on track to reach the iteration and architecture phase within another 2 weeks.

Anirudhp status update Feb 15th

This week, I focussed on setting up power and timing infrastructure on my Mac and integrating this into the overall system.

I managed to achieve all of those goals, and evaluate it on a couple of test prompts. This seems to yield some encouraging results:

  1. Mean power dissipation:
    1. CPU — 600–700mW
    2. GPU — 24-40mW
  2. Mean timing:
    1. 1.1 — 1.3 seconds

Which seem to indicate that the FPGA system will effortlessly beat these specifications, so it looks like we’re on the right track in that regard.

A more important aspect now is to be quite thorough in this system, so while the FPGA setup is ongoing I plan to find a dataset to benchmark the power and timing on to find the average performance. I also evaluated the model on truthfulQA and found a score of 30 which is a pretty decent score for a model of this size.

For the next week, I aim to complete the above goals and also extend my python script for WiFi connectivity to the FPGA.

Answering part A:
Most people’s data whenever they wish to leverage large language models or any other AI based systems, get sent into data centres. These data centres process and compute the results. This leads to vulnerability on two ends:

  1. The data may be intercepted and read while in transit.
  2. Without control over the data, you never know what is being done with it after it has been used.
    Which leads to poorer intellectual property protection and personal data safety.

    Additionally, as people become more and more reliant on these systems they will start using it for more critical tasks — like urgent healthcare etc. As a result, in the absence of wireless connectivity, this can cause significant harm.

Our solution aims to provide a fully offline setup for distilled AI systems in order to provide reliable, secure, and offline AI inference to people that want to keep control of their data.

Team Status Update Feb 15th

This week we focussed on wrapping up the auxiliary tasks that lead up to the final stage of the project where we’ll focus on iterating on the hardware accelerator. Namely:

  1. Setting up our benchmarking and profiling system for the baseline.
  2. Setting up the FPGA connectivity and synthesis flow.
  3. Evaluating a Chain of thought alternative model for improving model accuracy.

We managed to complete the benchmarking and profiling system, and eventually decided against using deepseek-r1’s smaller variant, however the FPGA system did not end up working as we expected.

We found that the the FPGA system that we used had some flaws in its WiFi connectivity setup, this leads to us not being able to service multiple clients at the same time.

Our goals for next week are:

  1. Run our benchmarking and profiling system on a wide spectrum of input tokens, and collect a comprehensive characterization dataset on our Macs.
  2. Swap to a functional FPGA with WiFi capacity and boot linux as well as our synthesis flow on the board. However, in the duration that the other FPGA takes to arrive, we can still try to synthesize the model onto ours and get that working — but this will only be a viable option if we’re going to use the same FPGA type after changing.
  3. Preemptively prepare the interconnect from FPGA to laptop and begin drawing a block diagram for the accelerated system.

For status report 2: A was written by Anirudh, B was written by Amelia, and C was written by Andrew.

Anirudhp_status_Feb8th2025

So this week while Andrew and Amelia were finalizing the model and setting up the FPGA, I dealt with the user interface and hotkey setup.

I utilized a lua interface that sits above the MacOS Kernel to trigger software interrupts and packaged the entire system into a single python script that will allow for the hotkey “CMD + G” to trigger our bitnet llm of choice.

Currently, our bitnet performs reasonably fast — taking around 4-5 seconds on a manual stopwatch to generate the output. This however does not stream the output token by token, and rather sends the entire output to the surface at once. Something that will have to be fixed over the next week.

While I have not taken any power measurements yet, I did notice that it turned my laptop’s fan on after I ran it 10-15 times in quick succession.

My goals for the next week are:

  1. Benchmark the model on a purely MacOS based infrastructure.
  2. Allow the system to stream tokens rather than displaying all at once.
  3. Figure out some way to take power measurements and benchmarks for the Mac based runtime.
  4. Benchmark the model for safety and look into quantizing a deepseek like system in order to improve hallucinations and accuracy(reasoning based models are inherently better in this regard.)

Anirudhp_29thJan2025

I am currently working on recreating a Flux 1.58 bit model as announced by Bytedance Research.

However, at this time, the model that they have trained shows a 7.7x times size improvement over the existing 23.5GB Flux model that was released by Black Forest Labs. This model will be in excess of 3Gb, and cannot be accomodated on the FPGAs that we have access to(max size 2Gb).

As a result, I have currently replicated the quantization process for the Flux model, however even though the model was open sourced by Black Forest Labs, the training code and training data are not referenced. As a result, I am currently trying to adapt the quantization system for a fully open-source text to image system such as:

Dall-E Mini or the first Flux.1 Dev model that was released.

However, the FLux model when quantized to 1.58 bits does produce excellent outputs that are almost on par with the original model.

Eg: “A man using a soldering iron to repair a broken electronic device” Produces:

My goal for the end of the next week is to either identify a way of using an FPGA that can accommodate the larger models(Using either a DIMM slot or in an extreme case, networking two FPGAs).

And if this is not possible, either distilling the FLUX model or recreating the quantization code for DALL-E Mini

Our Project Idea

We aim to address two current challenges in ML applications:

  1. Models are too heavyweight to run on local machine.
  2. Consume excessive energy, making them environmentally unsustainable.

To address this problem, we plan to develop an FPGA-based accelerator as a precursor to an ASIC capable of running smaller, lightweight “bitnets” locally.

Bitnets are highly quantized versions of their base bulky models, and recent research by Microsoft, Tsinghua University and the Chinese Academy of Sciences has shown that such models can be trained with a minimal loss in model output quality.

Our proof of concept will demonstrate architectural improvements, achieving faster text generation compared to FPGA- based CPU/GPU systems of similar size and power class. We will validate our approach using a heavier text completion model.

Currently, we are working on identifying the ideal bitnet model that we aim to accelerate, using the following considerations to evaluate the models:

  1. The models should be small enough to run on the FPGA’s limited hardware resources.
  2. The models should be producing good enough outputs that they could be used for applications like text or code completion. With a future goal of predictive text completion. 

    Currently:

    Amelia is investigating potential text to text models that we could use. (Based on the work of Microsoft’s bitnet framework)
    Andrew is looking into the potential of retraining a Flux text to image model for smaller size. (Based on the work of Black Forest Labs)
    Anirudh is trying to create a quantization and training system for the Flux text to image models so that they can be compressed to bitnets (Based on the work of Tiktok Research)