This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
buzzword [2014/02/24 19:17] rachata |
buzzword [2014/03/03 18:16] rachata |
||
---|---|---|---|
Line 595: | Line 595: | ||
* Intel SSE --> Modern version of MMX | * Intel SSE --> Modern version of MMX | ||
+ | ===== Lecture 17 (2/26 Wed.) ===== | ||
+ | |||
+ | * GPU | ||
+ | * Warp/Wavefront | ||
+ | * A bunch of threads sharing the same PC | ||
+ | * SIMT | ||
+ | * Lanes | ||
+ | * FGMT + massively parallel | ||
+ | * Tolerate long latency | ||
+ | * Warp based SIMD vs. traditional SIMD | ||
+ | * SPMD (Programming model) | ||
+ | * Single program operates on multiple data | ||
+ | * can have synchronization point | ||
+ | * Many scientific applications are programmed in this manner | ||
+ | * Control flow problem (branch divergence) | ||
+ | * Masking (in a branch, mask threads that should not execute that path) | ||
+ | * Lower SIMD efficiency | ||
+ | * What if you have layers of branches? | ||
+ | * Dynamic wrap formation | ||
+ | * Combining threads from different warps to increase SIMD utilization | ||
+ | * This can cause memory divergence | ||
+ | * VLIW | ||
+ | * Wide fetch | ||
+ | * IA-64 | ||
+ | * Tradeoffs | ||
+ | * Simple hardware (no dynamic scheduling, no dependency checking within VLIW) | ||
+ | * A lot of loads at the compiler level | ||
+ | * Decoupled access/execute | ||
+ | * Limited form of OoO | ||
+ | * Tradeoffs | ||
+ | * How to street the instruction (determine dependency/stalling)? | ||
+ | * Instruction scheduling techniques (static vs. dynamic) | ||
+ | * Systoric arrays | ||
+ | * Processing elements transform data in chains | ||
+ | * Develop for image processing (for example, convolution) | ||
+ | * Stage processing | ||
+ | |||
+ | ===== Lecture 18 (2/28 Fri.) ===== | ||
+ | |||
+ | * Tradeoffs of VLIW | ||
+ | * Why does VLIW required static instruction scheduling | ||
+ | * Whose job it is? | ||
+ | * Compiler can rearrange basic blocks/instruction | ||
+ | * Basic block | ||
+ | * Benefits of having large basic block | ||
+ | * Entry/Exit | ||
+ | * Handling entries/exits | ||
+ | * Trace cache | ||
+ | * How to ensure correctness? | ||
+ | * Profiling | ||
+ | * Fixing up the instruction order to ensure correctness | ||
+ | * Dealing with multiple entries into the block | ||
+ | * Dealing with multiple exits into the block | ||
+ | * Super block | ||
+ | * How to form super blocks? | ||
+ | * Benefit of super block | ||
+ | * Tradeoff between not forming a super block and forming a super block | ||
+ | * Ambiguous branch (after profiling, both taken/not taken are equally likely) | ||
+ | * Cleaning up | ||
+ | * What scenario would make trace cache/superblock/profiling less effective? | ||
+ | * List scheduling | ||
+ | * Help figuring out which instructions VLIW should fetch | ||
+ | * Try to maximize instruction throughput | ||
+ | * How to assign priorities | ||
+ | * What if some instructions take longer than others | ||
+ | * Block structured ISA (BS-ISA) | ||
+ | * Problems with trace scheduling? | ||
+ | * What type of program will benefit from BS-ISA | ||
+ | * How to form blocks in BS-ISA? | ||
+ | * Combining basic blocks | ||
+ | * multiples of merged basic blocks | ||
+ | * How to deal with entries/exits in BS-ISA? | ||
+ | * undo the executed instructions from the entry point, then fetch the new block | ||
+ | * Advantages over trace cache | ||
+ | * Benefit of VLIW + Static instruction scheduling | ||
+ | * Intel IA-64 | ||
+ | * Static instruction scheduling and VLIW |