Anirudhp_March22nd_status_report

This week, we built towards a basic prototype that interfaces with the FPGA for accelerated inference.

My personal goals were to get used to dealing with a Pynq based interface and identify how I could use it to do the following:

  1. Raise an “in use” flag, this would be used by the client to decide whether to send query.
  2. Directly wire the input to the output(basically send the query back to the user).

The objective of this system is to prime towards a multi-client based FPGA response system as described in our team report and design reports.

For the time being, the above goals seem to be accomplished but we can’t really verify it until the complete system integration is done over the following week. But given that the blocks are relatively simple and have been tested well internally we seem to be ahead of schedule on our project.

anirudhp_status_report_March_15th

This week, with the FPGA in place I aimed to extend the UI and network script to the FPGA interface.

First, I extended our earlier hotkey interface to send the query to the board for inference completion. This was done via a scp protocol to ensure security in data movement as the data moves from the laptop to the FPGA through the local network.

At the moment, we have not yet managed to perform inference on the FPGA(accelerated or otherwise). So I have not yet been able to test the return of the file. However, I did take a look into a multi-client approach that is entirely based on the laptop.

What I did was try to pull a flag from the FPGA that dictates whether it is servicing a query or not. And wrapped this in a loop to ensure that the laptop does not send a query until the FPGA is free. And as for the authentication system, we simply start the text of the query with a passphrase embedded into the script and the FPGA will use it to verify the user authentication.

So far, even though we haven’t performed an actual inference on the board, we still seem to be ahead of schedule given that our multi-client approach has made significant progress.

anirudhp_status_report_03/08/25

This week, I focussed on two primary aspects of the project:

  1. Ethical considerations and how this will adjust the benchmark. For this system, I have made some minor improvements to the model so that it simply refuses to autocomplete certain types of text — eg: Medical, urgent action etc.
  2. Analysing the Microsoft Bitnet Paper in order to suggest performance improvements that we could target.

Overall, the aspects that I was able to achieve are:

  1. Reduced hallucination rate by over 6%, but this was naturally at the expense of the model simply refusing to provide an output.
  2. Identified the Look-up-table implementation and the indexing system as the major speedups which would provide 40% more throughput in the system.

My goals for next week are:

  1. I want to be able to connect with the FPGA wirelessly and transmit the query onto the board(this I can do simply after booting Linux on the core that the board has) so I’d probably do this before we start working on the synthesis flow.
  2. Prepare more on Vitis to see how I would synthesize a basic block that detects the query and pasts the exact same text as the query into the output(this can be seen as a prelim step, we would simply replace the short circuit with our model in order to complete the system)

I wanted to keep pretty conservative goals for this week given that we are finally going to start interfacing with hardware, and this will always come with a number of challenges relating to the setup and the use of the system. At the same time, I still think that the above goals that I have listed are reasonable.

We’re currently well ahead of schedule.(Approx 2 weeks)

Anirudhp_status_Feb8th2025

So this week while Andrew and Amelia were finalizing the model and setting up the FPGA, I dealt with the user interface and hotkey setup.

I utilized a lua interface that sits above the MacOS Kernel to trigger software interrupts and packaged the entire system into a single python script that will allow for the hotkey “CMD + G” to trigger our bitnet llm of choice.

Currently, our bitnet performs reasonably fast — taking around 4-5 seconds on a manual stopwatch to generate the output. This however does not stream the output token by token, and rather sends the entire output to the surface at once. Something that will have to be fixed over the next week.

While I have not taken any power measurements yet, I did notice that it turned my laptop’s fan on after I ran it 10-15 times in quick succession.

My goals for the next week are:

  1. Benchmark the model on a purely MacOS based infrastructure.
  2. Allow the system to stream tokens rather than displaying all at once.
  3. Figure out some way to take power measurements and benchmarks for the Mac based runtime.
  4. Benchmark the model for safety and look into quantizing a deepseek like system in order to improve hallucinations and accuracy(reasoning based models are inherently better in this regard.)

Anirudhp_29thJan2025

I am currently working on recreating a Flux 1.58 bit model as announced by Bytedance Research.

However, at this time, the model that they have trained shows a 7.7x times size improvement over the existing 23.5GB Flux model that was released by Black Forest Labs. This model will be in excess of 3Gb, and cannot be accomodated on the FPGAs that we have access to(max size 2Gb).

As a result, I have currently replicated the quantization process for the Flux model, however even though the model was open sourced by Black Forest Labs, the training code and training data are not referenced. As a result, I am currently trying to adapt the quantization system for a fully open-source text to image system such as:

Dall-E Mini or the first Flux.1 Dev model that was released.

However, the FLux model when quantized to 1.58 bits does produce excellent outputs that are almost on par with the original model.

Eg: “A man using a soldering iron to repair a broken electronic device” Produces:

My goal for the end of the next week is to either identify a way of using an FPGA that can accommodate the larger models(Using either a DIMM slot or in an extreme case, networking two FPGAs).

And if this is not possible, either distilling the FLUX model or recreating the quantization code for DALL-E Mini