Amelia’s Status Report for March 29

This week I worked on finalizing the client communication with the FPGA based on how Anirudh and Andrew configured it (in terms of where data ready flags are set and where output data is sent). Although we have accepted that there will be some resource starvation in our multi-client implementation, I ensured that when the client accesses the flag that signals the FPGA is ready to receive data, that is an atomic action that prevents other clients from simultaneously trying to send their prompts. This also simplifies the process of ensuring the correct data gets to the correct user, since now the only client who will be looking for data from the FPGA will be the one whose request is being processed. I also updated the lua script to read the FPGA data out of a file the python script scps back to the client device. Everything on the client side is now ready to be integrated with the FPGA side. My repo with the code I’ve been working on can be found here.

I am on schedule and completed everything I planned to complete this week. My goals for next week are to have successful interim demos and also integrate my scripts with the hardware.

anirudhp_status_report_March29th

This week, my goal was to implement the model inference system on the FPGA.

I ended up running into a large number of issues, and was forced to switch models in order to fix them. For now, the model switch is temporary. I simply switched because the dynamic memory was over the board’s resources, theoretically this should not be happening and it should definitely be only because there is a bug in the inference systems.

We shifted to afrideva/llama-160m-GGUF llama-160m.q2_k.gguf along with the q2_k quantization system.

For now, we are well ahead of schedule given that this model produces decent output quality(37% hallucination rate). For now, changing from this model back to the original one is a much lower priority.

My goals for next week are to increase the performance(it is currently operating at a reading speed level(8-10 tokens/sec) all the way to reading level. It currently is a bit slower than the reading speed given that I notice a slight lag.

Team_status_report_March29th

This week, we set things up for the interim demo. Couple of changes that we made:

  1. We moved from a 700 Million parameter model to a 165M parameter model.
  2. We swapped over our quantization techniques due to the custom kernel not being able to accept the smaller model for this quantization form.
  3. We used a client based access control system, and as Prof. Theo said, we observe a starvation based model in our resource requests.

The details will be analyzed further in individual status reports.

We are well on schedule and have actually hit our basic MVP. The only thing left to do now would be to iterate and try to improve the performance of our system. We are also exploring how to extract power and telemetry signals from the FPGA for simple visualizations.

Amelia’s Status Report for March 22

This week I focused on getting the UI script to automatically ssh into the fpga and then read an “in use” flag that tells the program when the FPGA is available to send a new query. The main challenges with getting this all working was that the client script needed to concurrently check for the “in use” flag and also the input from the user when the lua script is triggered.

Here is a link to my github repo for the code so far: repo

I am on track in terms of getting my scripts ready to integrate with the FPGA and met my goals set last week. Looking forward, we plan to integrate next week, so I can’t say for certain what I plan to work on but my main goal is to get a functional integration of all of our work together so that we have something concrete to show for the interim demo.

Team_March22nd_Status_Report

This week, now that the FPGA has both been setup and connected to the campus wifi network, we could easily complete all parts individually without having to pass the FPGA around.

We architected an alternate approach to multi-client based response that operates with the scheduler on the client side rather than the server side of things:

  1. When the FPGA receives a query, it sets an “in-use” flag on and starts operating on that query.
  2. Before the client sends a query to the FPGA, it checks the in-use flag.
  3. Waits till the in-use flag turns off before actually sending the query.

This system leaves it vulnerable to race conditions, but we have decided to accept that minor flaw.

Andrew worked on running the model on the FPGA.

Anirudh setup a basic answer system and the in-use flag requirement.

Amelia refined the UI script so that it reads the flag and performs the wait before sending the query across.

For the time being, all individual components have been completed and at this stage we are moving on to the integration step. While we have tested everything, this can only be verified during integration. But it looks like we are well ahead of schedule given how easy integration should be.

Anirudhp_March22nd_status_report

This week, we built towards a basic prototype that interfaces with the FPGA for accelerated inference.

My personal goals were to get used to dealing with a Pynq based interface and identify how I could use it to do the following:

  1. Raise an “in use” flag, this would be used by the client to decide whether to send query.
  2. Directly wire the input to the output(basically send the query back to the user).

The objective of this system is to prime towards a multi-client based FPGA response system as described in our team report and design reports.

For the time being, the above goals seem to be accomplished but we can’t really verify it until the complete system integration is done over the following week. But given that the blocks are relatively simple and have been tested well internally we seem to be ahead of schedule on our project.

Andrew’s Status Report for March 15

After the FPGA finally arrive this week, I was able to become more productive in the team effort. In the beginning, I found out that the power supply used for the Ultra-96 isn’t compatible with the Kria and had to wait a few more days for the power supply to arrive and begin testing. I then flashed the factory image and booted linux on the FPGA. The linux boot process was troublesome, the documentation wasn’t very clear and I had to do quite some debugging to control the FPGA from UART port initially, connect the FPGA to internet and update repos(there is a certificate invalid error that took some time) and then setting up the FPGA boot Ubuntu image and building the pynq server onto the fpga. In the end the FPGA is setup as a server that can be accessed anywhere on campus using CMU WIFI with the help of pynq, which also includes an easier way to setup PS to PL connection.

I plan to do more with the FPGA next week, getting the demos to read/write the fpga between PS/PL, and setting up ssh connections for computer 2 FPGA transmissions. I will also start to look into the specific RTL modules that needed to be designed to accelerate the BitNet text autocomplete modules.

Amelia’s Status Report for March 15

This week was a little slow for me as I had a couple midterms and other things I needed to focus on. My goals for this week were to look into pulling power data from the FPGA and get some ideas about how to implement multiuser authentication. I did not have time to look into multiuser, however I figured out how to pull power data from the FPGA and also developed a framework for obtaining power data for our accelerated hardware and not the softcore/everything else on the board.

To read power data I plan to use pynq, since there is a power management module that will allow us to use built in libraries to access power data from the PMIC on board. I also looked into monitoring data overtime to track for heavy use on our system. Most of my time was spent familiarizing myself with these libraries and waiting for our board to arrive so that I could play with the pynq tools once the board was booted. Another issue we ran into is that the PMIC transmits power data for the entire board, which is higher than for just our accelerator. to get around this, we plan to measure power before synthesizing our design and then use the delta of power before and after as our power rating.

I’m off track this week, so to get back on track next week I plan to pivot to getting a package ready for user testing (finalizing survey questions and making sure the repo is public). I also plan to implement a multiuser system with user log ins. Finally, I will be available to help get tokens streaming on the FPGA.

team_status_report_March_15th

This week, we have received our FPGA and the plug-in wiring needed to supply the power.

In order to move forward according to our plans, we began with the following tasks in parallel:

  1. Booting Linux on the FPGA in order to start attempting the model on the embedded core.
  2. Extending the UI script to a multi-client FPGA based approach.
  3. Working out the UI testing system(developing the form and our exact method of analyzing the responses.

The details of how each task was accomplished are in the individual status reports.

At this time, we have accomplished the following:

  1. Sending a query onto the FPGA.
  2. Booting linux on the board.
  3. We have finalized the UI script study questions.

Currently we are ahead of schedule on the technical fronts, but are probably a bit behind schedule on the user study. For now, this is not that concerning given that it is very easy to adjust the UI script to user feedback, and we are anyway improving on it at this time for the FPGA extension.

anirudhp_status_report_March_15th

This week, with the FPGA in place I aimed to extend the UI and network script to the FPGA interface.

First, I extended our earlier hotkey interface to send the query to the board for inference completion. This was done via a scp protocol to ensure security in data movement as the data moves from the laptop to the FPGA through the local network.

At the moment, we have not yet managed to perform inference on the FPGA(accelerated or otherwise). So I have not yet been able to test the return of the file. However, I did take a look into a multi-client approach that is entirely based on the laptop.

What I did was try to pull a flag from the FPGA that dictates whether it is servicing a query or not. And wrapped this in a loop to ensure that the laptop does not send a query until the FPGA is free. And as for the authentication system, we simply start the text of the query with a passphrase embedded into the script and the FPGA will use it to verify the user authentication.

So far, even though we haven’t performed an actual inference on the board, we still seem to be ahead of schedule given that our multi-client approach has made significant progress.