anirudhp_status_report_April19th

This week, I mainly focussed on testing as well as wrapping up the project, in the process I patched the following issues:

  1. Multi-client was reading the output too early becasue the flag was set well in advance.
  2. Output quality was ramped up using a better bitnet model.
  3. An interesting bug was identified during user testing: just because you don’t actually print the output in the autocomplete while waiting doesn’t mean that you aren’t reading it. Hence you are capable of reading the other person’s output.
  4. This was fixed by making it read only during the non output ready time steps.

Currently, my goals for next week are:

  1. Complete the final report
  2. I will be presenting the final presentation, so I will need to rehearse for that.

Some things that I specifically learnt over the course of the project are:

  1. Access controls and the like on an FPGA hard-core. Have not learnt this before, and user testing allowed me to identify that this would be an issue.
  2. The software defined interrupt interface on the mac, this was the key technology that I had used for the keyword prompting.
  3. Presentation and specifically not to focus on text that is already stated on the slides, this was done while rehearsing the presentation yesterday.

Team Status report_Apr19th

This week we placed the final touches to our overall project and completed the presentation for the coming week.

We accomplished the following:

  1. User testing with experienced tech savvy users, this is to find potentially “broken” things, so far we found a few minor fixes covered in the individual status reports.
  2. Testing multi-client and ensuring functionality as well as security.
  3. Completing our presentation and report.

Over the next week, as we complete the presentation, we will also aim to do the following:

  1. Identify some way of bringing the FPGA to the demo(beyond requiring the ethernet dock that we had requested earlier)
  2. Complete the report and poster for teh final demo date.

Currently we are well ahead of schedule with only minimal work required next week.

anirudhp_status_report

This week, I wanted to verify that the 700M parameter model would run on the FPGA(given that our primary focus is to get the output quality up and running).

In order to verify this, I had a full analysis of the memory resources that were available on the FPGA, and added a memory profiling system to the CPU based runtime on my laptop.

For now, I believe that the FPGA should not have trouble running the model once we delete PYNQ from the board. However, the always on python script will need to be adjusted.

In terms of the milestones, we are still on track. But I ran into a few other things that need to be changed in order to delete PYNQ and that additional work might put it marginally behind schedule.

Over next week, I want to work out exactly how the interfacing and inference system would work once I delete PYNQ from the board.

anirudhp_status_report_March29th

This week, my goal was to implement the model inference system on the FPGA.

I ended up running into a large number of issues, and was forced to switch models in order to fix them. For now, the model switch is temporary. I simply switched because the dynamic memory was over the board’s resources, theoretically this should not be happening and it should definitely be only because there is a bug in the inference systems.

We shifted to afrideva/llama-160m-GGUF llama-160m.q2_k.gguf along with the q2_k quantization system.

For now, we are well ahead of schedule given that this model produces decent output quality(37% hallucination rate). For now, changing from this model back to the original one is a much lower priority.

My goals for next week are to increase the performance(it is currently operating at a reading speed level(8-10 tokens/sec) all the way to reading level. It currently is a bit slower than the reading speed given that I notice a slight lag.

Team_March22nd_Status_Report

This week, now that the FPGA has both been setup and connected to the campus wifi network, we could easily complete all parts individually without having to pass the FPGA around.

We architected an alternate approach to multi-client based response that operates with the scheduler on the client side rather than the server side of things:

  1. When the FPGA receives a query, it sets an “in-use” flag on and starts operating on that query.
  2. Before the client sends a query to the FPGA, it checks the in-use flag.
  3. Waits till the in-use flag turns off before actually sending the query.

This system leaves it vulnerable to race conditions, but we have decided to accept that minor flaw.

Andrew worked on running the model on the FPGA.

Anirudh setup a basic answer system and the in-use flag requirement.

Amelia refined the UI script so that it reads the flag and performs the wait before sending the query across.

For the time being, all individual components have been completed and at this stage we are moving on to the integration step. While we have tested everything, this can only be verified during integration. But it looks like we are well ahead of schedule given how easy integration should be.

Amelia’s Status Report for March 15

This week was a little slow for me as I had a couple midterms and other things I needed to focus on. My goals for this week were to look into pulling power data from the FPGA and get some ideas about how to implement multiuser authentication. I did not have time to look into multiuser, however I figured out how to pull power data from the FPGA and also developed a framework for obtaining power data for our accelerated hardware and not the softcore/everything else on the board.

To read power data I plan to use pynq, since there is a power management module that will allow us to use built in libraries to access power data from the PMIC on board. I also looked into monitoring data overtime to track for heavy use on our system. Most of my time was spent familiarizing myself with these libraries and waiting for our board to arrive so that I could play with the pynq tools once the board was booted. Another issue we ran into is that the PMIC transmits power data for the entire board, which is higher than for just our accelerator. to get around this, we plan to measure power before synthesizing our design and then use the delta of power before and after as our power rating.

I’m off track this week, so to get back on track next week I plan to pivot to getting a package ready for user testing (finalizing survey questions and making sure the repo is public). I also plan to implement a multiuser system with user log ins. Finally, I will be available to help get tokens streaming on the FPGA.

anirudhp_status_report_Feb22nd

This week, the focus was on packaging and clearing our user interface system in order to start getting feedback form people that can trial our system.

This week I managed to get the power and timing parameters achieved to be printed on the side of the screen in a location that I thought was unobtrusive. It’s one of the details that I would like to verify in our user feedback form.

Additionally, given that we are moving from an Ultra96v2 FPGA to a Kria based FPGA, we would need to learn how to use a different set of EDA tools. So I spent the past week mainly focussing on how to operate and use Vitis to synthesize our softcores and language models.

Over the next week my goals are:

  1. I want to work out how the full synthesis flow to load our models onto the Kria works.
  2. I want to see how to move data onto the FPGA and pull the results out of the FPGA works, this way I can extend our previous python script to use the FPGA for inference.

After this, I would plan to go for the more advanced power and performance data that we want to monitor on the FPGA.

 

We’re currently well ahead of schedule and on track to reach the iteration and architecture phase within another 2 weeks.

Anirudhp status update Feb 15th

This week, I focussed on setting up power and timing infrastructure on my Mac and integrating this into the overall system.

I managed to achieve all of those goals, and evaluate it on a couple of test prompts. This seems to yield some encouraging results:

  1. Mean power dissipation:
    1. CPU — 600–700mW
    2. GPU — 24-40mW
  2. Mean timing:
    1. 1.1 — 1.3 seconds

Which seem to indicate that the FPGA system will effortlessly beat these specifications, so it looks like we’re on the right track in that regard.

A more important aspect now is to be quite thorough in this system, so while the FPGA setup is ongoing I plan to find a dataset to benchmark the power and timing on to find the average performance. I also evaluated the model on truthfulQA and found a score of 30 which is a pretty decent score for a model of this size.

For the next week, I aim to complete the above goals and also extend my python script for WiFi connectivity to the FPGA.

Answering part A:
Most people’s data whenever they wish to leverage large language models or any other AI based systems, get sent into data centres. These data centres process and compute the results. This leads to vulnerability on two ends:

  1. The data may be intercepted and read while in transit.
  2. Without control over the data, you never know what is being done with it after it has been used.
    Which leads to poorer intellectual property protection and personal data safety.

    Additionally, as people become more and more reliant on these systems they will start using it for more critical tasks — like urgent healthcare etc. As a result, in the absence of wireless connectivity, this can cause significant harm.

Our solution aims to provide a fully offline setup for distilled AI systems in order to provide reliable, secure, and offline AI inference to people that want to keep control of their data.

Our Project Idea

We aim to address two current challenges in ML applications:

  1. Models are too heavyweight to run on local machine.
  2. Consume excessive energy, making them environmentally unsustainable.

To address this problem, we plan to develop an FPGA-based accelerator as a precursor to an ASIC capable of running smaller, lightweight “bitnets” locally.

Bitnets are highly quantized versions of their base bulky models, and recent research by Microsoft, Tsinghua University and the Chinese Academy of Sciences has shown that such models can be trained with a minimal loss in model output quality.

Our proof of concept will demonstrate architectural improvements, achieving faster text generation compared to FPGA- based CPU/GPU systems of similar size and power class. We will validate our approach using a heavier text completion model.

Currently, we are working on identifying the ideal bitnet model that we aim to accelerate, using the following considerations to evaluate the models:

  1. The models should be small enough to run on the FPGA’s limited hardware resources.
  2. The models should be producing good enough outputs that they could be used for applications like text or code completion. With a future goal of predictive text completion. 

    Currently:

    Amelia is investigating potential text to text models that we could use. (Based on the work of Microsoft’s bitnet framework)
    Andrew is looking into the potential of retraining a Flux text to image model for smaller size. (Based on the work of Black Forest Labs)
    Anirudh is trying to create a quantization and training system for the Flux text to image models so that they can be compressed to bitnets (Based on the work of Tiktok Research)