## Edge architectures for extreme efficiency





Brandon Lucia - Carnegie Mellon University - Challenges of Intermittent Computing

### Existing architectures are extremely inefficient

Instruction energy\* breakdown:



### **Extreme Edge Computing Goal**:

increase energy-efficiency and preserve programmability

# Where does all the energy go in existing computer architectures?

#### Something is fundamentally wrong here:

Instruction energy\* breakdown:

| Fetch/Decode (40-50%) | Register file (20%) | Other control | Useful<br>compute<br>(10%) |
|-----------------------|---------------------|---------------|----------------------------|
|-----------------------|---------------------|---------------|----------------------------|

ASICs/Accelerators would improve this, but forfeit programmability





### MANIC: Extreme Edge Vector-dataflow processor

- Reduce instruction supply energy + VRF energy
- Maintain high-degree of programmability to support future kernels



| Energy              |       |          |           |  |  |
|---------------------|-------|----------|-----------|--|--|
| Model               | Insns | RF Reads | RF Writes |  |  |
| Scalar              | Û     | Û        | ①         |  |  |
| Vector              | Ţ     | Û        | ①         |  |  |
| Vector-<br>Dataflow | Û     | Û        | Û         |  |  |





| Energy              |       |             |              |  |  |  |
|---------------------|-------|-------------|--------------|--|--|--|
| Model               | Insns | RF<br>Reads | RF<br>Writes |  |  |  |
| Scalar              | Û     | Û           | Û            |  |  |  |
| Vector              | Û     | Û           | Û            |  |  |  |
| Vector-<br>Dataflow | Û     | Û           | ſ            |  |  |  |







| Energy              |       |             |              |  |  |
|---------------------|-------|-------------|--------------|--|--|
| Model               | Insns | RF<br>Reads | RF<br>Writes |  |  |
| Scalar              | Û     | ①           | Û            |  |  |
| Vector              | Û     | Û           | Û            |  |  |
| Vector-<br>Dataflow | Û     | Û           | Ţ            |  |  |







| Energy              |       |             |              |  |  |
|---------------------|-------|-------------|--------------|--|--|
| Model               | Insns | RF<br>Reads | RF<br>Writes |  |  |
| Scalar              | Û     | Û           | Û            |  |  |
| Vector              | Û     | Û           | Û            |  |  |
| Vector-<br>Dataflow | Û     | Û           | Ţ            |  |  |





| Energy              |       |             |              |  |  |
|---------------------|-------|-------------|--------------|--|--|
| Model               | Insns | RF<br>Reads | RF<br>Writes |  |  |
| Scalar              | Û     | ①           | Û            |  |  |
| Vector              | Û     | Û           | Û            |  |  |
| Vector-<br>Dataflow | Û     | Û           | Û            |  |  |



Dataflow



• Energy wasted on instruction & data supply

| Memory         DCache access         Compute + Control |
|--------------------------------------------------------|
|--------------------------------------------------------|



Example Program
vload v0, &a
vmul v1, v0, v0
vadd v2, v1, v0
vstore &b, v2



| Energy              |       |             |              |  |  |
|---------------------|-------|-------------|--------------|--|--|
| Model               | Insns | RF<br>Reads | RF<br>Writes |  |  |
| Scalar              | Û     | ①           | Û            |  |  |
| Vector              | Û     | Û           | Û            |  |  |
| Vector-<br>Dataflow | Û     | Û           | Û            |  |  |

Dataflow

Control-flow





RF

Reads

俞

Û

RF

Writes

Û

Û



**Example Program** Dataflow vload: v[0] v[1] v[2] vload vload v0, &a Vector Register vmul v1, v0, v0 vmul vmul: v[0] v[1] v[2] vadd v2, v1, v0 vstore &b, v2 vadd Vector Register vadd: v[0] v[2] v[1] vstore Vector Register vstore: v[2] v[1 v[0

| Energy              |       |             |              |  |  |
|---------------------|-------|-------------|--------------|--|--|
| Model               | Insns | RF<br>Reads | RF<br>Writes |  |  |
| Scalar              | Û     | ①           | Û            |  |  |
| Vector              | Û     | ①           | Û            |  |  |
| Vector-<br>Dataflow | Û     | Û           | Ţ            |  |  |

Control-flow

Dataflow







| Memory                                                                                                | nory DCache access ICache access |  |  |  |  | pute + Control |
|-------------------------------------------------------------------------------------------------------|----------------------------------|--|--|--|--|----------------|
| Vector                                                                                                |                                  |  |  |  |  |                |
| Memory         DCache access         ICache         Compute +<br>Control         Vector register file |                                  |  |  |  |  |                |



















vload v0, &a vmul v1, v0, v0 vadd v2, v1, v0 vstore &b, v2









## Vector-dataflow reduces energy without costing programmability

- Vector-dataflow execution
  - Vector execution reduces instructions fetched
  - Dataflow execution eliminates VRF reads
- Software support to eliminate VRF writes



## MANIC is an energy-minimal computer architecture implementing vector-dataflow

#### Block diagram



**Implementation Characteristics** 

- Complete standalone system
- Scalar, Vector & MANIC designs
- Intel 22nm bulk FinFet (HVT)
- Embedded MRAM
- SRAM, logic, MRAM power isolated

### **Evaluating MANIC's efficiency in a silicon prototype**



Intel 22nm FinFET 8 metal layers, MANIC + Vector + Scalar, 256kB MRAM + 64kB SRAM

**Evaluation Goals:** Energy characterization of first ever vector-dataflow chip.

**Key Result:** Power low & efficiency high enough to run on tiny solar panel indoors



Operational Characteristics: Frequency: 4-50MHz Voltage:

Power: 19.1uW

 IHz
 Voltage: 0.4-1.0V

 Efficiency: 256 GOPS/W

**Carnegie Mellon University** Electrical & Computer Engineering





### Can we do even better?

Let's eliminate all instruction control & caching costs!





- Processing elements (PE) connected by Network-on-Chip (NoC)
  - Heterogenous PE capability
  - Connections configured by software compiler





- Collection of processing elements (PE) connected via NoC
  - Configure PE once, use many times: **no instruction fetch/control costs**
  - Data move directly PE to PE: no RF/VRF/Cache costs
  - Stream data through fabric: Reduced memory costs



Nearly all energy for actually useful computation!



### CGRAs provide efficiency & programmability

**Dataflow** compiler support avoids the need for programmer acrobatics



### **Dataflow Architecture**

LABORATORY FOR COMPUTER SCIENCE



MASSACHUSETTS INSTITUTE OF

TECHNOLOGY

The Varieties of Data Flow Computers

Computation Structures Group Memo 183-1 August 1979 Revised December 1979

Jack B. Dennis

### Dataflow Architecture: Dataflow Program Graphs



### Dataflow: "Activity Template" implementation



### Dataflow: Processing Element & Interconnect Arch.



**Processing Element Architecture** 



**Processing Element Interconnection Architecture** 

Question: what do we need to specify in this ISA?

## Dataflow: MIT Dataflow Architecture



Fig. 14. MIT data flow processor.



Fig. 15. Practical form of the MIT architecture.

What is the main difference in this architecture versus the

"basic" architecture on the previous slide

### Dataflow: The Riptide Ordered Dataflow Machine



## Dataflow: The Riptide ISA

| Operator(s)             | Category        | Symbol(s)          | Semantics                                   |
|-------------------------|-----------------|--------------------|---------------------------------------------|
| Basic binary ops        | Arithmetic      | +, -, <<, !=, etc. | a op b                                      |
| Multiply, clip          | Multiplier      | *, clip            | $a  \operatorname{op}  b$                   |
| Load                    | Memory          | ld                 | ld <i>base</i> , <i>idx</i> (, <i>dep</i> ) |
| Store                   | Memory          | st                 | st base, idx, val(, dep)                    |
| Select                  | Control Flow    | sel                | cond ? val0 : val1                          |
| Steer, carry, invariant | Control Flow    | (T   F), C, I      | See Fig. 3                                  |
| Merge, order            | Synchronization | М, О               | See Fig. 3                                  |
| Stream                  | Stream          | STR                | See Fig. 3                                  |

Carry Invariant(carry w/ carry = out) ! True steer ! False steer State machine State machine D: Pop D Pop A, D B if(D) out = Aif(!D) out = APop D, B Pop D if(state == I) out = A if(state == I) out = A else if(D) out = B else if(D) out = A Order Stream Merge bound start step for(idx = start; idx [<,>,==,...] bound; Str idx = idx [+,>>,<<] step) D-0 last = !(idx [<,>,==,...] bound) last if(D) out = A valid(A) && valid(B) idx else if(!D) out = B out = B

What program constructs do the **carry** and **invariant** ISA ops support?

What does the **order** ISA op do?

What program construct(s) does the **stream** ISA op support?

## Background: What is a Dataflow Machine?

(a) data link

A Preliminary Architecture for a Basic Data-Flow Processor

Jack B. Dennis and David P. Misunas Project MAC Massachusetts Institute of Technology

An example of a program in the elementary data-flow language is shown in Figure 1 and represents the following simple computation:



Figure 1. An elementary data - flow program

Figure 6. Links of the basic data-flow language.

(b) control link



Figure 7. Actors of the basic data-flow language.



Figure 2. Organization of the elementary data-flow processor.

# Dataflow ISA matches CGRA Architecture

**Dataflow** compiler efficiently targets reconfigurable dataflow architecture!



42

### Intermediate representation for dataflow compilation



### Steering Sends Values Only Where They Are Needed



### Control-flow Operators Handle Loop-Carried Dependences



### Memory Ordering Gates Maintain Memory Consistency



### Memory Ordering Reduction Analysis



Existing memory ordering analyses rely on *transitive reduction* of ordering graph

Dataflow ordering reduction requires *path sensitivity* or cuts required orders.

> **Carnegie Mellon University** Electrical & Computer Engineering

### Stream Gates Optimize Patterned Address Computation



### End-to-end compilation flow



**Carnegie Mellon University** Electrical & Computer Engineering

# Energy-Minimal Network-on-Chip Tricks: No Buffers & Control-Flow in NoC

✓ Multi-hop, bufferless NoC

V.

Buffer on every link



Duplicates data in multiple buffers

Buffers @ producer



Allows broadcast to multiple consumers w/out duplication

#### Control-flow in the NoC

Control-flow operations are numerous, but simple
 → Wasteful to assign to PEs



# Riptide vs. COTS Extreme Edge MCUs







Speedup









## That was "Ordered Dataflow"



Axiom: Tokens proceed through the graph in the order of their generation

How do we ensure that tokens flow through the dataflow graph in order?

# What about allowing token reordering?

(Memory) Latency: when can I expect my value to come back? Id t=?



"Tagged-token dataflow architectures"

Two issues: Latency & Synchronization

Latency: time between when operation is issued and when completes

Synchronization: need to assure data properly written before read





Network

144 To

<PE:(FP.IP), V>

32 <Opcode, r, s>

Opcode, s

24

Opcode, s

24

72

.......

ALU

(3 Stages)

............

[FP+r] 72

[FP+r]

next

Instruction

Memory

Presence

Bits

Memory

Frame

Store

144

Instruction

Fetch

Effective

Address

Presence

Bits

FP+r

FP+1

Frame Store

Operation

pcode.

Form Token

Multiplexer

144

144

72

IP+1

### **Tagged-token Dataflow Architecture**



I-fetch send rd tok w/ addr+continuation *P: read & run; A/W: queue I-store: send wr tok to populate table A/W are non-blocking (why?) I-allocate: make storage for fetch/stores* 

Fig. 18. A processing element. **Token matching:** synchronization of out-of-order inputs, using i-structs

To Communication Network

 $\mathbf{\Omega}$ 

Output

From Communication Network

## Parallelism is a resource congestion problem



Synchronization: which value should I use? Many options accumulating over time

### Varieties of Dataflow Execution

**Figure 3:** A running example: dense matrix-vector multiplication. We zoom in on the innermost loop body (green box) to demonstrate dataflow execution.







Figure 4: Partial execution trace of dmv.

### Varieties of Dataflow Execution

**Figure 5:** Execution traces for dmv on prior architectures. Trace width indicates execution time, and trace height indicates parallelism. The number of black edges along a vertical cut is the number of live values, which increases in proportion to parallelism. Red edges are additional orderings enforced by the architecture, according to its token synchronization scheme, that limit parallelism.



(d) Ordered dataflow (e.g., RipTide [29]).

(a) Sequential von Neumann (execution continues past the edge of the page).

(e) Unordered tagged dataflow (e.g., TTDA [52]).

sense, vN's sequential ordering restricts execution to a "depthfirst" traversal of a program's full dynamic execution graph (Fig. 1). The result is minimal parallelism, so that program execution takes a long time (the graph is wide) but live state is minimized (the graph is short).

Parallelism can be increased by having multiple vN execution streams, i.e., multithreading. Token synchronization now includes a token's thread id, in addition to its vN ordering. Multithreading punts all of the problems of parallelism.



(f) Data-parallel (e.g., vector, GPU).