## 18-740/640 Computer Architecture Lecture 14: Memory Resource Management I

Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 10/26/2015

## Required Readings

#### Required Reading Assignment:

 Qureshi and Patt, "Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches," MICRO 2006.

#### Recommended References:

- Seshadri et al., "The Evicted-Address Filter: A Unified Mechanism to Address both Cache Pollution and Thrashing," PACT 2012.
- Pekhimenko et al., "Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches," PACT 2012.
- Lin et al., "Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems," HPCA 2008.

# Shared Resource Design for Multi-Core Systems

## Memory System: A *Shared Resource* View



## Resource Sharing Concept

- Idea: Instead of dedicating a hardware resource to a hardware context, allow multiple contexts to use it
  - Example resources: functional units, pipeline, caches, buses, memory
- Why?
- + Resource sharing improves utilization/efficiency → throughput
  - When a resource is left idle by one thread, another thread can use it; no need to replicate shared data
- + Reduces communication latency
  - For example, shared data kept in the same cache in SMT processors
- + Compatible with the shared memory model

## Resource Sharing Disadvantages

- Resource sharing results in contention for resources
  - When the resource is not idle, another thread cannot use it
  - If space is occupied by one thread, another thread needs to reoccupy it
- Sometimes reduces each or some thread's performance
  - Thread performance can be worse than when it is run alone
- Eliminates performance isolation → inconsistent performance across runs
  - Thread performance depends on co-executing threads
- Uncontrolled (free-for-all) sharing degrades QoS
  - Causes unfairness, starvation

Need to efficiently and fairly utilize shared resources

## Example: Problem with Shared Caches



Kim et al., "Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture," PACT 2004.

## Example: Problem with Shared Caches



Kim et al., "Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture," PACT 2004.

## Example: Problem with Shared Caches



Kim et al., "Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture," PACT 2004.

## Need for QoS and Shared Resource Mgmt.

- Why is unpredictable performance (or lack of QoS) bad?
- Makes programmer's life difficult
  - An optimized program can get low performance (and performance varies widely depending on co-runners)
- Causes discomfort to user
  - An important program can starve
  - Examples from shared software resources
- Makes system management difficult
  - How do we enforce a Service Level Agreement when hardware resources are sharing is uncontrollable?

## Resource Sharing vs. Partitioning

- Sharing improves throughput
  - Better utilization of space
- Partitioning provides performance isolation (predictable performance)
  - Dedicated space
- Can we get the benefits of both?
- Idea: Design shared resources such that they are efficiently utilized, controllable and partitionable
  - No wasted resource + QoS mechanisms for threads

## Shared Hardware Resources

- Memory subsystem (in both Multi-threaded and Multi-core)
  - Non-private caches
  - Interconnects
  - Memory controllers, buses, banks
- I/O subsystem (in both Multi-threaded and Multi-core)
  - I/O, DMA controllers
  - Ethernet controllers
- Processor (in Multi-threaded)
  - Pipeline resources
  - □ L1 caches

## Multi-core Issues in Caching

- How does the cache hierarchy change in a multi-core system?
- Private cache: Cache belongs to one core (a shared block can be in multiple caches)
- Shared cache: Cache is shared by multiple cores



## Shared Caches Between Cores

#### Advantages:

- High effective capacity
- Dynamic partitioning of available cache space
  - No fragmentation due to static partitioning
- Easier to maintain coherence (a cache block is in a single location)
- Shared data and locks do not ping pong between caches

#### Disadvantages

- Slower access
- Cores incur conflict misses due to other cores' accesses
  - Misses due to inter-core interference
  - Some cores can destroy the hit rate of other cores
- Guaranteeing a minimum level of service (or fairness) to each core is harder (how much space, how much bandwidth?)

## Shared Caches: How to Share?

- Free-for-all sharing
  - Placement/replacement policies are the same as a single core system (usually LRU or pseudo-LRU)
  - Not thread/application aware
  - An incoming block evicts a block regardless of which threads the blocks belong to
- Problems
  - Inefficient utilization of cache: LRU is not the best policy
  - A cache-unfriendly application can destroy the performance of a cache friendly application
  - Not all applications benefit equally from the same amount of cache: free-for-all might prioritize those that do not benefit
  - Reduced performance, reduced fairness

## Handling Shared Caches

#### Controlled cache sharing

- Approach 1: Design shared caches but control the amount of cache allocated to different cores
- Approach 2: Design "private" caches but spill/receive data from one cache to another

#### More efficient cache utilization

- Minimize the wasted cache space
  - by keeping out useless blocks
  - by keeping in cache blocks that have maximum benefit
  - by minimizing redundant data

## Controlled Cache Sharing: Examples

#### Utility based cache partitioning

- Qureshi and Patt, "Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches," MICRO 2006.
- Suh et al., "A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning," HPCA 2002.

#### Fair cache partitioning

 Kim et al., "Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture," PACT 2004.

#### Shared/private mixed cache mechanisms

- Qureshi, "Adaptive Spill-Receive for Robust High-Performance Caching in CMPs," HPCA 2009.
- Hardavellas et al., "Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches," ISCA 2009.

## Efficient Cache Utilization: Examples

- Qureshi et al., "A Case for MLP-Aware Cache Replacement," ISCA 2005.
- Seshadri et al., "The Evicted-Address Filter: A Unified Mechanism to Address both Cache Pollution and Thrashing," PACT 2012.
- Pekhimenko et al., "Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches," PACT 2012.

# Controlled Shared Caching

# Hardware-Based Cache Partitioning

## Utility Based Shared Cache Partitioning

- Goal: Maximize system throughput
- Observation: Not all threads/applications benefit equally from caching → simple LRU replacement not good for system throughput
- Idea: Allocate more cache space to applications that obtain the most benefit from more space
- The high-level idea can be applied to other shared resources as well.
- Qureshi and Patt, "Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches," MICRO 2006.
- Suh et al., "A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning," HPCA 2002.

## Marginal Utility of a Cache Way



### Utility Based Shared Cache Partitioning Motivation



## Utility Based Cache Partitioning (III)



Three components:

- □ Utility Monitors (UMON) per core
- □ Partitioning Algorithm (PA)
- Replacement support to enforce partitions

## Utility Monitors

- For each core, simulate LRU policy using ATD
- Hit counters in ATD to count hits per recency position
- LRU is a stack algorithm: hit counts → utility E.g. hits(2 ways) = H0+H1







Figure 4. (a) Hit counters for each recency position. (b) Example of how utility information can be tracked with stack property.

## Dynamic Set Sampling

- Extra tags incur hardware and power overhead
- Dynamic Set Sampling reduces overhead [Qureshi, ISCA'06]
- 32 sets sufficient (<u>analytical bounds</u>)
- Storage < 2kB/UMON</p>



## Partitioning Algorithm

- Evaluate all possible partitions and select the best
- With a ways to core1 and (16-a) ways to core2:  $Hits_{core1} = (H_0 + H_1 + ... + H_{a-1})$  ---- from UMON1  $Hits_{core2} = (H_0 + H_1 + ... + H_{16-a-1})$  ---- from UMON2
- Select a that maximizes (Hits<sub>core1</sub> + Hits<sub>core2</sub>)
- Partitioning done once every 5 million cycles

Way partitioning support: [Suh+ HPCA' 02, Iyer ICS' 04]

- 1. Each line has core-id bits
- 2. On a miss, count ways\_occupied in set by miss-causing app



## Performance Metrics

- Three metrics for performance:
- 1. Weighted Speedup (default metric)
  - $\rightarrow$  perf = IPC<sub>1</sub>/SingleIPC<sub>1</sub> + IPC<sub>2</sub>/SingleIPC<sub>2</sub>
    - $\rightarrow$  correlates with reduction in execution time
- 2. Throughput
  - → perf =  $IPC_1 + IPC_2$
  - → can be unfair to low-IPC application
- 3. Hmean-fairness
  - $\rightarrow$  perf = hmean(IPC<sub>1</sub>/SingleIPC<sub>1</sub>, IPC<sub>2</sub>/SingleIPC<sub>2</sub>)
  - → balances fairness and performance

## Weighted Speedup Results for UCP



## IPC Results for UCP



UCP improves average throughput by 17%

## Any Problems with UCP So Far?

- Scalability to many cores
- Non-convex curves?
- Time complexity of partitioning low for two cores (number of possible partitions ≈ number of ways)
- Possible partitions increase exponentially with cores
- For a 32-way cache, possible partitions:
  - □ 4 cores  $\rightarrow$  6545
  - □ 8 cores  $\rightarrow$  15.4 million
- Problem NP hard  $\rightarrow$  need scalable partitioning algorithm

## Greedy Algorithm [Stone+ ToC '92]

- Greedy Algorithm (GA) allocates 1 block to the app that has the max utility for one block. Repeat till all blocks allocated
- Optimal partitioning when utility curves are convex
- Pathological behavior for non-convex curves



## Problem with Greedy Algorithm



In each iteration, the utility for 1 block:

U(A) = 10 misses U(B) = 0 misses

All blocks assigned to A, even if B has same miss reduction with fewer blocks

 Problem: GA considers benefit only from the immediate block. Hence, it fails to exploit large gains from looking ahead

## Lookahead Algorithm

- Marginal Utility (MU) = Utility per cache resource
  MU<sub>a</sub><sup>b</sup> = U<sub>a</sub><sup>b</sup>/(b-a)
- GA considers MU for 1 block.
- LA (Lookahead Algorithm) considers MU for all possible allocations
- Select the app that has the max value for MU.
  Allocate it as many blocks required to get max MU
- Repeat until all blocks are assigned

## Lookahead Algorithm Example



Result: A gets 5 blocks and B gets 3 blocks (Optimal)

Time complexity ≈ ways²/2 (512 ops for 32-ways)

#### Four cores sharing a 2MB 32-way L2



## Utility Based Cache Partitioning

- Advantages over LRU
  - + Improves system throughput
  - + Better utilizes the shared cache
- Disadvantages
  - Fairness, QoS?
- Limitations
  - Scalability: Partitioning limited to ways. What if you have numWays < numApps?
  - Scalability: How is utility computed in a distributed cache?
  - What if past behavior is not a good predictor of utility?

## Fair Shared Cache Partitioning

- Goal: Equalize the slowdowns of multiple threads sharing the cache
- Idea: Dynamically estimate slowdowns due to sharing and assign cache blocks to balance slowdowns
  - Approximate slowdown with change in miss rate
- Kim et al., "Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture," PACT 2004.



















## Advantages/Disadvantages of the Approach

#### Advantages

- + Reduced starvation
- + Better average throughput
- + Block granularity partitioning
- Disadvantages and Limitations
  - Alone miss rate estimation can be incorrect
  - Scalable to many cores?
  - Is this the best (or a good) fairness metric?
  - Does this provide performance isolation in cache?

# Software-Based Shared Cache Partitioning

## Software-Based Shared Cache Management

- Assume no hardware support (demand based cache sharing, i.e. LRU replacement)
- How can the OS best utilize the cache?
- Cache sharing aware thread scheduling
  - Schedule workloads that "play nicely" together in the cache
    - E.g., working sets together fit in the cache
    - Requires static/dynamic profiling of application behavior
    - Fedorova et al., "Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler," PACT 2007.
- Cache sharing aware page coloring
  - Dynamically monitor miss rate over an interval and change virtual to physical mapping to minimize miss rate
    - Try out different partitions

## OS Based Cache Partitioning

- Lin et al., "Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems," HPCA 2008.
- Cho and Jin, "Managing Distributed, Shared L2 Caches through OS-Level Page Allocation," MICRO 2006.

#### Static cache partitioning

- Predetermines the amount of cache blocks allocated to each program at the beginning of its execution
- Divides shared cache to multiple regions and partitions cache regions through OS page address mapping
- Dynamic cache partitioning
  - Adjusts cache quota among processes dynamically
  - Page re-coloring
  - Dynamically changes processes' cache usage through OS page address re-mapping

# Page Coloring

- Physical memory divided into colors
- Colors map to different cache sets
- Cache partitioning
  - Ensure two threads are allocated pages of different colors



Memory page

## Page Coloring

Physically indexed caches are divided into multiple regions (colors).All cache lines in a physical page are cached in one of those regions (colors).



## Static Cache Partitioning using Page Coloring



#### Shared cache is partitioned between two processes through address mapping.



Cost: Main memory space needs to be partitioned, too.



## Dynamic Cache Partitioning via Page Re-Coloring



### Dynamic Partitioning in a Dual-Core System



## Experimental Environment

#### Dell PowerEdge1950

- Two-way SMP, Intel dual-core Xeon 5160
- □ Shared 4MB L2 cache, 16-way
- BGB Fully Buffered DIMM
- Red Hat Enterprise Linux 4.0
  - 2.6.20.3 kernel
  - Performance counter tools from HP (Pfmon)
  - Divide L2 cache into 16 colors

## Performance – Static & Dynamic



- Aim to minimize combined miss rate
- For RG-type, and some RY-type:
  - Static partitioning outperforms dynamic partitioning
- For RR- and RY-type, and some RY-type
  - Dynamic partitioning outperforms static partitioning

## Software vs. Hardware Cache Management

- Software advantages
  - + No need to change hardware
  - + Easier to upgrade/change algorithm (not burned into hardware)
- Disadvantages
  - Large granularity of partitioning (page-based versus way/block)
  - Limited page colors  $\rightarrow$  reduced performance per application (limited physical memory space!), reduced flexibility
  - Changing partition size has high overhead → page mapping changes
  - Adaptivity is slow: hardware can adapt every cycle (possibly)
  - Not enough information exposed to software (e.g., number of misses due to inter-thread conflict)

# Private/Shared Caching

## Private/Shared Caching

- Example: Adaptive spill/receive caching
- Goal: Achieve the benefits of private caches (low latency, performance isolation) while sharing cache capacity across cores
- Idea: Start with a private cache design (for performance isolation), but dynamically steal space from other cores that do not need all their private caches
  - Some caches can spill their data to other cores' caches dynamically
- Qureshi, "Adaptive Spill-Receive for Robust High-Performance Caching in CMPs," HPCA 2009.

## Revisiting Private Caches on CMP

Private caches avoid the need for shared interconnect ++ fast latency, tiled design, performance isolation



Problem: When one core needs more cache and other core has spare cache, private-cache CMPs cannot share capacity

Spill evicted line from one cache to neighbor cache

- Co-operative caching (CC) [ Chang+ ISCA' 06]



Problem with CC:

- 1. Performance depends on the parameter (spill probability)
- 2. All caches spill as well as receive  $\rightarrow$  Limited improvement

Goal: Robust High-Performance Capacity Sharing with Negligible Overhead

Chang and Sohi, "Cooperative Caching for <sup>63</sup>Chip Multiprocessors," ISCA 2006.

## Spill-Receive Architecture

Each Cache is either a Spiller or Receiver but not both

- Lines from spiller cache are spilled to one of the receivers
- Evicted lines from receiver cache are discarded



What is the best N-bit binary string that maximizes the performance of Spill Receive Architecture → Dynamic Spill Receive (DSR)

Qureshi, "Adaptive Spill-Receive for Robust High-Performance Caching in CMPs," HPCA 2009.

# Efficient Cache Utilization

## Efficient Cache Utilization: Examples

- Qureshi et al., "A Case for MLP-Aware Cache Replacement," ISCA 2005.
  - Seshadri et al., "The Evicted-Address Filter: A Unified Mechanism to Address both Cache Pollution and Thrashing," PACT 2012.
- Pekhimenko et al., "Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches," PACT 2012.



Effective cache utilization is important

# **Reuse Behavior of Cache Blocks**

Different blocks have different reuse behavior

**Access Sequence:** A B C A B C S T U V W X Y Z A B C High-reuse block 🛛 Low-reuse block Ideal Cache A B C . . . . .

# **Cache Pollution**

**Problem:** Low-reuse blocks evict high-reuse blocks



**Idea:** Predict reuse behavior of missed blocks. Insert low-reuse blocks at LRU position.



# Cache Thrashing

**Problem:** High-reuse blocks evict each other



Qureshi+, "Adaptive insertion policies for high performance caching," ISCA 2007.

LRU

MRU

# Handling Pollution and Thrashing

Need to address both pollution and thrashing concurrently

#### **Cache Pollution**

Need to distinguish high-reuse blocks from lowreuse blocks

#### **Cache Thrashing**

Need to control the number of blocks inserted with high priority into the cache

## **Reuse Prediction**



Keep track of the reuse behavior of every cache block in the system

#### Impractical

- 1. High storage overhead
- 2. Look-up latency

# **Approaches to Reuse Prediction**

Use program counter or memory region information.

1. Group Blocks



2. Learn group behavior



3. Predict reuse



- 1. Same group  $\not\rightarrow$  same reuse behavior
- 2. No control over number of high-reuse blocks

#### **Per-block Reuse Prediction**

Use recency of eviction to predict reuse



# Evicted-Address Filter (EAF)



#### Naïve Implementation: Full Address Tags



- 1. Large storage overhead
- 2. Associative lookups High energy

#### Low-Cost Implementation: Bloom Filter





Implement EAF using a Bloom Filter

Low storage overhead + energy

# **Bloom Filter**



Inserted Elements:  $(\chi)$   $(\gamma)$ 



Bloom-filter EAF: 4x reduction in storage overhead, 1.47% compared to cache size

# **EAF-Cache: Final Design**





- 1. Simple to implement
- 2. Easy to design and verify
- 3. Works with other techniques (replacement policy)

# EAF Performance – Summary



# Cache Compression

# **Motivation for Cache Compression**

Significant redundancy in data:



How can we exploit this redundancy?

- -Cache compression helps
- Provides effect of a larger cache without making it physically larger

# **Background on Cache Compression**



- Key requirements:
  - Fast (low decompression latency)
  - Simple (avoid complex hardware changes)
  - Effective (good compression ratio)

| Compression | Decompression | Complexity   | Compression |
|-------------|---------------|--------------|-------------|
| Mechanisms  | Latency       |              | Ratio       |
| Zero        | $\checkmark$  | $\checkmark$ | ×           |

| Compression<br>Mechanisms | Decompression<br>Latency | Complexity   | <b>Compression</b><br>Ratio |
|---------------------------|--------------------------|--------------|-----------------------------|
| Zero                      | $\checkmark$             | $\checkmark$ | ×                           |
| Frequent Value            | ×                        | ×            | $\checkmark$                |

| Compression<br>Mechanisms | Decompression<br>Latency | Complexity   | <b>Compression</b><br>Ratio |
|---------------------------|--------------------------|--------------|-----------------------------|
| Zero                      | $\checkmark$             | $\checkmark$ | ×                           |
| Frequent Value            | ×                        | ×            | $\checkmark$                |
| Frequent Pattern          | ×                        | ×/√          | $\checkmark$                |

| Compression<br>Mechanisms | Decompression<br>Latency | Complexity   | Compression<br>Ratio |
|---------------------------|--------------------------|--------------|----------------------|
| Zero                      | $\checkmark$             | $\checkmark$ | ×                    |
| Frequent Value            | ×                        | ×            | $\checkmark$         |
| Frequent Pattern          | ×                        | ×/√          | $\checkmark$         |
| BΔI                       | $\checkmark$             | $\checkmark$ | $\checkmark$         |

# **Key Data Patterns in Real Applications**

Zero Values: initialization, sparse matrices, NULL pointers

| 0x0000000 0 | 000000000x0 | 0x00000000 | 0x00000000 |  |
|-------------|-------------|------------|------------|--|
|-------------|-------------|------------|------------|--|

**Repeated Values**: common initial values, adjacent pixels

0x*000000<mark>FF</mark> 0x000000<mark>FF</mark> 0x000000<mark>FF</mark> 0x000000<mark>FF</mark> ...* 

Narrow Values: small values stored in a big data type

| 0                                       | $n_{\rm v}$                             | 0x <i>0000000<mark>04</mark></i> |     |
|-----------------------------------------|-----------------------------------------|----------------------------------|-----|
| 0.0000000000000000000000000000000000000 | 0.0000000000000000000000000000000000000 | 0x000000 <mark>04</mark>         | ••• |

**Other Patterns:** pointers to the same memory region

0xC04039<mark>C0</mark> 0xC04039<mark>C8</mark> 0xC04039<mark>D0</mark> 0xC04039<mark>D8</mark> ...

#### **How Common Are These Patterns?**

SPEC2006, databases, web workloads, 2MB L2 cache "Other Patterns" include Narrow Values



43% of the cache lines belong to key patterns

91

#### **Key Data Patterns in Real Applications**

# Low Dynamic Range:

# Differences between values are significantly smaller than the values themselves

#### Key Idea: Base+Delta (B+Δ) Encoding



# **Can We Do Better?**

• Uncompressible cache line (with a single base):



• Key idea:

Use more bases, e.g., two instead of one

• Pro:

– More cache lines can be compressed

- Cons:
  - Unclear how to find these bases efficiently
  - Higher overhead (due to additional bases)

# B+Δ with Multiple Arbitrary Bases



# How to Find Two Bases Efficiently?

1. First base - first element in the cache line



2. Second base - implicit base of 0

✓ Immediate part

Advantages over 2 arbitrary bases:

- Better compression ratio
- Simpler compression logic

**Base-Delta-Immediate (BΔI) Compression** 

#### B+ $\Delta$ (with two arbitrary bases) vs. B $\Delta$ I



Average compression ratio is close, but  $B\Delta I$  is simpler

#### **BΔI** Cache Compression Implementation

- Decompressor Design
  - Low latency
- Compressor Design
  - Low cost and complexity
- B∆I Cache Organization
  - Modest complexity

# **BΔI Decompressor Design**

**Compressed Cache Line** 



**Uncompressed Cache Line** 

# **BΔI** Compressor Design



#### **B**ΔI Compression Unit: 8-byte B<sub>0</sub> 1-byte Δ



# **BΔI** Cache Organization



#### **B**Δ**I**: **4**-way cache with **8**-byte segmented data



## **Qualitative Comparison with Prior Work**

#### Zero-based designs

- ZCA [Dusser+, ICS'09]: zero-content augmented cache
- ZVC [Islam+, PACT'09]: zero-value cancelling
- Limited applicability (only zero values)
- FVC [Yang+, MICRO'00]: frequent value compression
  High decompression latency and complexity
- Pattern-based compression designs
  - FPC [Alameldeen+, ISCA'04]: frequent pattern compression
    - High decompression latency (5 cycles) and complexity
  - C-pack [Chen+, T-VLSI Systems'10]: practical implementation of FPC-like algorithm
    - High decompression latency (8 cycles)

# **Cache Compression Ratios**

SPEC2006, databases, web workloads, 2MB L2



**BΔI** achieves the highest compression ratio

# Single-Core: IPC and MPKI



**BΔI** achieves the performance of a 2X-size cache Performance improves due to the decrease in MPKI 105

# **Multi-Core Workloads**

- Application classification based on
  Compressibility: effective cache size increase (Low Compr. (*LC*) < 1.40, High Compr. (*HC*) >= 1.40)
   Sensitivity: performance gain with more cache (Low Sens. (*LS*) < 1.10, High Sens. (*HS*) >= 1.10; 512kB -> 2MB)
- Three classes of applications:
  - LCLS, HCLS, HCHS, no LCHS applications
- For 2-core **random** mixes of each possible class pairs (20 each, 120 total workloads)

# **Multi-Core: Weighted Speedup**



If Bath beasts on an application is a set site of ghese (9.5%) performance improves

#### Readings for Lecture 15 (Next Monday)

Required Reading Assignment:

 Mutlu and Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems," ISCA 2008.

Recommended References:

- Muralidhara et al., "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning," MICRO 2011.
- Ebrahimi et al., "Parallel Application Memory Scheduling," MICRO 2011.
- Wang et al., "A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters," VEE 2015.

#### Guest Lecture on Wednesday (10/28)

- Bryan Black, AMD
  - 3D die stacking technology

#### 18-740/640 Computer Architecture Lecture 14: Memory Resource Management I

Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 10/26/2015