# Texcelerate

Andrew Liao, Amelia Heller, Anirudh Prakash



## Use Case + Requirements

#### Problem: Sending data to the cloud can be risky

• There is a range of users who cannot use commercial AI copilots because they work with sensitive data

#### Solution: On-device text and code completion

- FPGA accelerator for faster and more power efficient text & code completion
- UI that allows user to generate text on any text box on their Mac

#### **Requirements:**

| Throughput                                        | User Interface                                                    | General                                                      |
|---------------------------------------------------|-------------------------------------------------------------------|--------------------------------------------------------------|
| To achieve<br>instantaneous<br>generation, tokens | User should be able to choose whether or not to autocomplete text | The system should<br>support up to three<br>wireless clients |
| per second > 10                                   | Setup/Installation should take the user less than 15 minutes      | simultaneously                                               |

## **Technical Design Requirements**

| Quantified Requirement                              | Justification                                                                                                                                         |  |
|-----------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Power consumption < 700 mW                          | 600 - 700 mW on the CPU and 24-40 mW on the GPU                                                                                                       |  |
| Time to first token < 250 ms<br>Tokens / second > 8 | Mean timing for text generation for our<br>model run on a Mac is 1.11 - 1.3 seconds.<br>This is less than what the human eye<br>perceives as instant. |  |
| Context Window > 100 tokens                         | Anything less than this will truncate<br>details, leading to less coherent<br>completions                                                             |  |

## **Design Tradeoffs**

#### FPGA

Picked the Ultra96v2 over the Kria KV260 because the tool flow for the Ultra is easier to use

Also chose it over the ZedBoard because the ultra has more RAM and a hardcore

#### Platform

Our UI only works on Macbooks because it takes advantage of a MacOS specific tool to utilize keyboard interrupts.

Not accessible for all users but mitigates risk of low quality UI - since we are not software people.

#### Model

Considered a quantized DeepSeek model over our current model but it's too difficult to prompt it for text completion and chain of thought output doesn't help fulfill use case requirements.

## **Solution Approach**



#### System Workflow:



### System Specification - Block Diagram



### **Implementation Plan**



## UI Demo



## **Testing: Verification & Validation**

| Requirement                                                          | Test Method                                                                                                                                                                | Success Criteria                                                                                                                                                                                                      |
|----------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| User can choose whether or not to accept generated text              | Users – CMU Community<br>Metrics –                                                                                                                                         | Text completions are<br>overridden as requested 100%<br>of the time                                                                                                                                                   |
| User can download and install<br>system in < 15 minutes              | <ol> <li>How much time it takes<br/>users to download our system</li> <li>Frequency with which the<br/>system let them override or<br/>accept text completions.</li> </ol> | 90% of users taken < 15<br>minutes to install system<br>The 10% who need more time<br>should still be able to set up<br>the system in < 25 minutes<br>- This accounts for people<br>who struggle to use<br>technology |
| At least three users can<br>connect to the accelerator<br>wirelessly | We will attempt to all connect<br>to the FPGA and run queries to<br>it simultaneously.                                                                                     | Output quality of model is<br>consistent across user and<br>power and timing requirements<br>hold                                                                                                                     |

## **Testing: Verification & Validation**

| Requirement                                            | Test Method                                                                                                | Success Criteria                                    |
|--------------------------------------------------------|------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|
| Latency and throughput less<br>than CPU & GPU on a Mac | On Mac – Power and Timing<br>Profiler<br>On FPGA – counters<br>synthesized onto the fabric of<br>the FPGA. | Tokens / second > 8<br>Time to first token < 250 ms |
| Power consumption less than<br>CPU & GPU on a Mac      | We will measure power<br>consumption by interfacing<br>with the PMIC on the<br>Ultra96v2 FPGA              | Power consumption < 700 mW                          |

## Testing: Risks & Ethical Concerns

| Challenge                       | Mitigation Strategy                                                                                                       |
|---------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| Broken wifi on our FPGA         | We will try a wired UART connection to<br>get around this. Worst case scenario is<br>switch to the Kria KV260 FPGA        |
| Limited FPGA iteration speed    | We need to develop a synthesis flow. This<br>is complicated if we switch to the KV260<br>which requires us to use Vitis   |
| Multi-user security concerns    | We need to justify allowing multiple users<br>to use the same hardware to run text<br>completion on sensitive information |
| Hallucinations & Biased Outputs | Our model scored a 30 on the truthfulQA<br>benchmark and a 35.1 on the HellaSwag<br>benchmark                             |

### Schedule

#### **Texccelerate**

TASK

**Project Planning** Decide on FPGA

Decide UI/UX Structure

Choose Target Model

Quantize text model

**FPGA** Acceleration

Verify Texccelerate RTL

UI/UX Software Interface

Whole System Integrated testing

0%

4/2/25 4/16/25

ML Model Test existing BitNet Model

#### Project start: Wed, 1/29/2025

Display week: 1

Anirudh Prakash, Amelia Heller, Andrew Liao

#### Jan 27, 2025 Feb 3, 2025 Feb 10, 2025 Feb 17, 2025 Feb 24, 2025 Mar 3, 2025 Mar 10, 2025 Mar 17, 2025 Mar 24, 2025 Mar 31, 2025 Apr 7, 2025 Apr 14, 2025 Apr 21, 2025 27 28 29 30 31 1 2 3 4 5 6 ASSIGNED TO PROGRESS START END N T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M 1/29/25 2/5/25 80% 100% 2/5/25 2/7/25 Decide Benchmark Softcores 100% 1/29/25 2/12/25 100% 1/29/25 2/3/25 100% 1/29/25 2/1/25 Select Text model for quantize 1/31/25 100% 2/5/25 100% 2/5/25 2/19/25 Modify inference code for CPU soft core deployment 100% 2/20/25 2/27/25 Modify inference code for GPU soft core deploymen 0% 2/28/25 3/14/25 Modify inference code for FPGA deployment 0% 3/15/25 3/29/25 Implement Unified Performance Counter 2/26/25 0% 2/19/25 Synthesize CPU/GPU soft cores 0% 2/19/25 2/26/25 Decide FPGA Architecture + RTL 2/23/25 3/9/25 0% 0% 3/10/25 3/17/25 Synthesize Texccelerate on FPGA 3/18/25 4/1/25 0% Synthesize Texccelerate on FPGA 4/2/25 4/16/25 0% FPGA Interface, UI/UX Boot Linux on FPGA hard core 0% 2/28/25 3/3/25 FPGA to computer UART framework 0% 3/3/25 3/13/25 FPGA PS to PL communication framework 3/14/25 3/24/25 0% Stream PMU metris through UART 0% 3/25/25 4/1/25 100% 2/10/25 2/24/25