This is an old revision of the document!


Project Ideas

From 2004 (Babak)

  • Program-driven VM replacement

Much like caches, VM replacement today is primarily based on heuristics such as LRU. With little hardware support, VM can significantly benefit from program-extracted information (e.g., instruction control flow) such as those used in a dead-block predictor. Design a dead-page predictor to optimize VM page replacement and show that with high accuracy, such a predictor can substantially improve page fault rate over LRU. Compare and contrast your results with a recently proposed Miss-Ratio Curve replacement algorithm proposed in ASPLOS 2004 (this year).

  • Program-driven L2 replacement

Off-chip cache accesses are detrimental to performance. There are applications with adverse cache conflict behavior (e.g., database systems) for which even highly-associative L2 caches do not work well. Design an L2 dead-block predictor to predict candidate blocks for replacement. Show that given high accuracy, such a predictor can substantially reduce cache miss rates over LRU. Compare and contrast such a predictor with a time-out predictor using either cycle or reference counts.

  • Transactional Processor (and potentially other TM-based projects?)

Speculation is fruitful if the benefits from optimized execution due to speculation offset the negative impact of misspeculation/rollback. For instance, current branch predictors improve performance because they achieve a high enough accuracy to allow speculative execution over tens of instructions. Today’s superscalar processors support speculation through small reorder buffers and load/store queues. There are a number of candidate scenarios for speculative execution that have orders of magnitude larger verification latency than branch prediction but also lead to orders of magnitude increase in performance if successful. For example, a synchronization access in a multiprocessor (a read-modify-write operation on a memory address) can take thousands of cycles while the processor can speculate that the lock is available. Propose enhancements to the cache hierarchy and load/store queue to allow for checkpointing execution windows of hundreds or thousands of instructions (checkpointing register state is not a biggy). Evaluate the opportunity for performance improvement for skipping locks using your transactional processor. You will need scaffold for this work.

  • Optimal Cache Sharing

Multithreading has been regarded as a technique that fundamentally thrashes caches and should be applied with great care. Gibbons and Blelloch have recently shown in a SPAA 2004 paper that careful scheduling of threads in a multithreaded processor running a parallel program actually requires only a modest increase in cache size over a single-threaded application performing the same work. Evaluate their results in the context of a real commercial application (e.g., a database management system prototype such as Shore or Postgress) by changing the thread scheduling policy to follow their cache sharing model and measure cache miss ratio as a function of cache size. Compare the results to a unithreaded processor which runs the threads sequentially.

  • Dynamic Program Phase Identification

Recent research (such as the work by the SimPoint group) suggests that programs execute in phases that repeat across execution. By identifying and exploiting such repetition, architects can substantially enhance execution for future high-end reconfigurable processors. Unfortunately, it appears that program phases are microarchitecture dependent and as such must be detected dynamically. Use statistical tools for measuring the distance between distributions (e.g., Chi-squared distance), and use them to identify unique program phases at runtime based on measured distribution of a performance metric of interest (e.g., IPC). Show that indeed, phases are microarchitecture dependent. Compare your phase detector against that proposed by Sherwood et al. in ISCA 2003.

  • Chiller

Modern high-end processors are designed for worst-case performance demands. Unfortunately, such designs have led to a high variability in maximum power density and heat on chip. Such variabilities make packaging costs prohibitively high. Instead, many have proposed dynamic hotspot detection and cooling. By dynamically detecting what chip area requires thermal management, microarchitectural techniques to reduce power can be applied locally to alleviate the problem with little impact on performance. Build a simple floorplan model of an out-of-order core (e.g., Skadron et al.’s model in their ISCA 2003 paper), use accurate heat conductivity models (as proposed by Prof. Asheghi in ME) to evaluate heat distribution every cycle across near-neighbor components. Use the models to identify and alleviate hot spots through resource scaling. A good place to start is lava.cs.virginia.edu.

  • Nasty Leakage

Leakage power increases five-fold every generation. In a 90nm process, leakage will account for higher levels of power dissipation than voltage swings. Leakage is exponential in temperature so the more the chip leaks, the hotter it gets, and therefore it leaks exponentially more! The latter results in a thermal runoff. This project is similar to (8) except that here, you will evaluate the relationship between leakage, temperature, and time and apply microarchitectural techniques to avoid a thermal runoff.

  • Kill the Buffer Overflow Problem

The internet was brought down in the late 80’s partially due to a bug in the “finger daemon” that can be exploited using the buffer overflow trick. If all programs performed bounds checking when reading data from the network, there would be no problems. Unfortunately, because we can’t always write code carefully, we need to guard against malicious attacks on our carelessness. So, it turns out that if you read data from a network packet onto the stack way past the point you should, the packet data can overwrite the return address pointer on the stack to point to code inside the packet! Propose architectural/microarchitectural techniques to guard against such an attack. Evaluate your techniques using the simulator.

  • Machine Learning & Branch Prediction

Apply machine learning to improve the accuracy and/or cost of branch prediction. Design and evaluate a “correlating feature selector (CFS)” that will accurately select which specific history bits a branch correlates to. Design and evaluate a table-based predictor using your CFS. Use architectural features such as register values as input to the feature set. You may refer to Fern et al.’s (www.ece.cmu.edu/~babak/pub.html) technical report on CFS predictors.

  • Accelerating CPU Simulation with Cache-independant Checkpointing

The SMARTS paper (ISCA 2003) showed that proper application of sampling theory can reduce CPU simulation turnaround by orders of magnitude while achieving highly accurate results. Nevertheless, SMARTS simulation speed remains bottlenecked by fast-forwarding past instructions between sampled measurements. Follow-on work (TurboSMARTS, draft paper available from Tom) shows that fast-forwarding can be eliminated by storing simulated cache, branch predictor, and architectural state in checkpoints. However, the current design ties checkpoints to a particular cache and branch predictor configuration. Develop a new simulation methodology which combines checkpointing and cache/branch predictor warmup (i.e., as in Haskins & Skadron, ISPASS 2003) to maximize simulation speed and minimize checkpoint storage cost without requiring a fixed cache/branch predictor configuration.

  • SyntSim and SMARTS

Researchers at Cornell University have developed a simulation toolset (SyntSim) that generates high-speed functional simulators, up to twenty times faster than sim-fast, by translating a target benchmark’s binary into a custom functional simulator for that benchmark (paper to appear in MICRO ‘04). Integrate this binary-translation approach to functional simulation with sampled microarchitectural simulation as proposed in SMARTS. You must determine the best way to integrate cycle-accurate simulation into the binary-translation framework and how to perform cache and branch predictor warmup during binary-translated functional simulation (see Chen, MS Thesis for ideas).


Personal Tools