





| Preview                                                   |  |  |  |
|-----------------------------------------------------------|--|--|--|
| Strided access & performance                              |  |  |  |
| Techniques to reduce bank conflicts on interleaved memory |  |  |  |
| <ul> <li>"Exotic" DRAM technology</li> </ul>              |  |  |  |
| • EDO DRAM                                                |  |  |  |
| • SDRAM                                                   |  |  |  |
| Cached DRAM                                               |  |  |  |
| • Rambus                                                  |  |  |  |
| Titan memory subsystem example                            |  |  |  |
| Mini-supercomputer memory subsystem                       |  |  |  |
|                                                           |  |  |  |
|                                                           |  |  |  |

| Footnote: C vs. Fortran Array Organization                                                                                                                                                                                                                                           |                                                                                                          |  |  |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|--|--|--|
| <ul> <li>C is row-major, 0-based for arrays</li> <li>Memory layout of 4x4 array         <ul> <li>Access all elements in ROW sequentially (row-by-row storage)</li> </ul> </li> <li>Hennessy &amp; Patterson are C-based; workstation tradition</li> </ul>                            | [0,0] [0,1] [0,2] [0,3]<br>[1,0] [1,1] [1,2] [1,3]<br>[2,0] [2,1] [2,2] [2,3]<br>[3,0] [3,1] [3,2] [3,3] |  |  |  |
| <ul> <li>Fortran is column-major, 1-based for arrays</li> <li>Memory layout of 4x4 array         <ul> <li>Access all elements in COLUMN sequentially<br/>(column-by-column storage)</li> </ul> </li> <li>Cragon is Fortran-based; mainframe &amp; supercomputer tradition</li> </ul> | (1,1) (2,1) (3,1) (4,1)<br>(1,2) (2,2) (3,2) (4,2)<br>(1,3) (2,3) (3,3) (4,3)<br>(1,4) (2,4) (3,4) (4,4) |  |  |  |





| number | 01 | sets | ın | cache |
|--------|----|------|----|-------|
|        |    |      |    |       |

# **Strided Access** Strided access with stride k means touching every kth memory element • Stride = 1 is sequential access (0, 1, 2, 3, 4, 5, 6, ...)• Stride = 2 is (0, 2, 4, 6, 8, ...)• Stride = k is (0, k, 2k, 3k, 4k, ...) Strides > 1 commonly found in multidimensional data • Row accesses (stride=N) & diagonal accesses (stride=N+1) • Scientific computing (e.g., matrix multiplication) • Image processing (image rows and columns) • Radar/Sonar processing (angle vs. elevation) • In many cases arrays are a power of 2 size, promoting bank conflicts



| SW Technique: Array Size Change                                                                 |                                        |  |  |  |  |
|-------------------------------------------------------------------------------------------------|----------------------------------------|--|--|--|--|
| <ul> <li>Software/compiler solution</li> <li>Allocate array size relatively prime to</li> </ul> | <b>1,1 1,2 1,3 1,4</b> 1₅5             |  |  |  |  |
| number of memory banks                                                                          | <b>2,1 2,2 2,3 2,4 2,5</b>             |  |  |  |  |
| • Before:                                                                                       | <b>3,1 3,2 3,3 3,4</b> 3,5             |  |  |  |  |
| foo(SIZE, SIZE)                                                                                 | <b>4,1 4,2 4,3 4,4</b> 4,5             |  |  |  |  |
| • After:<br>foo(SIZE, SIZE+1)                                                                   | <br>M0 M1 M2 M3                        |  |  |  |  |
| Row and column accesses have no conflicts                                                       | 0 1,1 2,1 3,1 4,1<br>1 5,1 1,2 2,2 3,2 |  |  |  |  |
| • Diagonal access uses only 2 banks of 4                                                        | 2 4,2 5,2 1,3 2,3                      |  |  |  |  |
|                                                                                                 | 3 <mark>3,3</mark> 4,3 5,3 1,4         |  |  |  |  |
|                                                                                                 | 4 2,4 3,4 4,4 5 <sub>9</sub> 4         |  |  |  |  |
|                                                                                                 |                                        |  |  |  |  |



# **Prime Number Interleaving**

• Conflicts happen when stride is an even multiple (or divisor) of interleave factor

- Power of 2 bank size is easy to build -- uses low order bits for bank number
- Power of 2 is a common array size
- 2 divides into all even numbers (and people naturally use even numbers)
- Prime number interleave reduces possibility for conflicts
  - Good tricks available for 2<sup>n</sup>+1 and 2<sup>n</sup>-1 banks
  - Burroughs Scientific Processor (BSP) used interleaving with m=17



# **Superinterleaving**

- Assume that there are *m* memory bus cycles per module cycle time
  - *e.g.*, if memory cycle time is 4 bus clocks, *m*=4
- "Normal" interleaving has *n* banks, with  $n \ge m$ 
  - In best case, such as sequential access, all banks can be busy (n=m)

### Superinterleaving has n > m banks

- In other words, there are more banks than can possibly be active at once; extra banks don't help with raw bandwidth ability
- Used to reduce chance of conflict (may be less likely that stride will be a multiple of *n* than a multiple of *m*, since n is larger)
- Example: 8 memory banks on a bus that can only keep 4 banks busy















# **Rambus DRAM**

### • Proprietary byte-wide bi-directional bus

- Reduced voltage swing (600 mV swing centered at 2.2 V) for speed
- 500-600 MB/sec per channel; high bandwidth with small number of chips
- Two or four 1KByte or 2KByte sense amplifiers used as high speed caches
- Interleaving among the banks

### Very high bandwidth

• But, high latency -- minimum size 8 byte transfer



| C | Other Special DRAM architectures                                                                                                                                                                                                         |
|---|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| • | <ul><li>Video DRAM</li><li>Has shift-out register to feed bits to video display</li></ul>                                                                                                                                                |
| • | <ul> <li>Cached DRAM</li> <li>Traditional cache structure on DRAM in addition to row-oriented buffering</li> <li>"Smart" prefetching logic on DRAM chip to anticipate accesses<br/>(DRAM-based stream buffer mechanism?)</li> </ul>      |
| ٠ | <ul><li><b>IRAM Intelligent RAM</b></li><li>Parallel processing with a processor on each DRAM chip</li></ul>                                                                                                                             |
| • | <ul> <li>Real-world stumbling blocks</li> <li>Process technology differences between DRAM and logic/SRAM fabrication techniques</li> <li>Processor+memory on-chip is limited in memory size, no matter how big DRAM chips get</li> </ul> |
|   |                                                                                                                                                                                                                                          |





# Titan Mini-Super Computer

• "Single-User supercomputer" -- significant fraction of supercomputer performance at a high-end workstation price

- Design started 1986. Company name: Dana -> Ardent -> Stardent
- Up to 4 processors (integer/vector pairs)
- 16 MFLOPS (single processor) peak for ~\$100,000 in 1988
- Design based on traditional supercomputer approach
  - Gate arrays used for vector floating point unit
  - MIPS R2000 used for integer control processor
  - Hardware support for 3-D graphics



(Siewiorek & Koopman Plate 1) FIGURE 1. The Stardent 1500 "Titan" series graphics supercomputer: system over view. Copyright © 1988, Stardent Computer Inc.



| Wide-Word Interleaving via Page Mode                                                                                                                                                                                                                                                                               |  |  |  |  |  |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|
| <ul> <li>Each memory bank 32 bits wide</li> </ul>                                                                                                                                                                                                                                                                  |  |  |  |  |  |  |
| • 64-bit words read using 2-clock page mode access                                                                                                                                                                                                                                                                 |  |  |  |  |  |  |
| <u>1st:</u> ADDRESS CONTROL ACCESS DATA LO DATA HI PRECHARGE<br><u>2nd:</u> ADDRESS CONTROL ACCESS DATA LO DATA HI PRECHARGE<br><u>3rd:</u> ADDRESS CONTROL ACCESS DATA LO DATA HI PRECHARGE                                                                                                                       |  |  |  |  |  |  |
| Time ®                                                                                                                                                                                                                                                                                                             |  |  |  |  |  |  |
| <ul> <li>Cuts cost for interleaving in half with small performance hit <ul> <li>Data paths through cross bar half as wide</li> <li>Number of minimum chips required in system half as big</li> <li>Still provides 64 bits of data per bus clock (low &amp; high from different interleaves)</li> </ul> </li> </ul> |  |  |  |  |  |  |



# Titan Memory Tradeoffs Wide-word interleaving for cost savings Interleave data paths only 32 bits, but supports 64-bit accesses 1 clock access latency penalty Doubles number of interleaves available at comparable cost Interleave expansion with memory expansion Adding second memory board supports 16-way interleaving Dual bus access to main memory Required for balanced memory bandwidth (discussed later) Interleaved memory use to (usually) support dual access with single-ported DRAM Interleaved memory used to permit streaming results to processors 256K x 4 DRAMs used to reduce minimum memory size over 1M x 1 DRAMs Frovides atomic access primitive in memory subsystem Fetch and increment-if-negative





# **Review**

### • Interleaved memory access

- Helps with latency by hiding refresh time & reducing access conflicts
- Multiple banks can provide multiple concurrent accesses

### Strided memory access

- Strided array accesses might not be evenly distributed among banks
  - Software solution -- rearrange access patterns
  - Hardware solution -- make common strides access different banks
- "Exotic" DRAM technology
  - Application of general architecture techniques within a single DRAM chip
- Titan as an example memory subsystem
  - High bandwidth, but with some cost-cutting tricks