18-548/15-548 Fall 1998

Lab 5:
Performance Tuning

Due Friday, November 20, 1998 at 3:00 PM

Draft until Friday 10/30

In this lab you'll do performance analysis and improvement of a matrix multiply that is too big to fit into some or all levels of cache memory.

To get started, create a working directory and make a copy of the working files for this homework. For example:

mkdir lab5 cd lab5 cp /afs/ece.cmu.edu/class/ece548/www/hw/lab5/* .

You will then have the following files to work with:

matmul.c is a starting point for you to improve the performance of matrix multiplication. You are concerned with improving the performance of my_matmul(). Note that there are some "#defines" necessary in the command line (figuring out what they do is part of the fun...)
prob1 can be sourced to exercise the matmul.c example provided. It includes the recipe for compiling and Atom-izing the program. analy2.c and instr2.c are used with Atom to instrument just the data references of just the subroutine my_matmul().
3level demonstrates how to run dinero for multi-level cache simulations, and has values representative of the Alphaservers. It uses multilev.awk to split the intermediate results out to files and pass cache misses on to the next level of cache.
examples.c is a file found on the web which contains ideas that may be useful to you. (But, note that this code is for floating point multiplication, and we are using long ints instead.)
do_exmp can be sourced to exercise examples.c, although you may want to play around with it more than just what is done there.

You are strongly urged to get an early start on this lab. You may want to set up a file using the "at" facility and run your simulations overnight, testing them first to make sure they're likely to "go". Also, think before you run -- it's easy to burn up a lot of CPU hours unnecessarily, depriving your classmates of machine cycles. In particular, it is not necessary (nor desirable) to run multi-level cache simulations for any problem except problem 1.

Problem 1: Collect Three-Level Dinero Data

This problem demonstrates how to use Atom and Dinero to collect multi-level cache data.

1) The "prob1" file provided in the homework directory will produce 3-level cache data for a 100x100 matrix multiply. Both an uninstrumented program and an instrumented program are executed. Note that the instrumented program reports a much longer elapsed time because at least some of the time taken by the instrumentation is also charged against the program. Execute this file and report:

Execution time of the uninstrumented program (in seconds, as measured by the program itself)
Overall data accesses, miss ratio & traffic ratio for L1, L2, and L3 caches.

2) Report and explain the data write miss ratios for both the L1 and L2 caches (i.e., what is going on with the code to cause each of them to be either large or small, depending on what you find?).

Problem 2: Diagnose a Performance Problem

1) Measure execution times for N=250 through N=262 for the uninstrumented matmul.c program compiled with optimization (see the file prob2). Plot the execution times on an appropriate graph. Use the same compiler optimizations as in problem 1, but change the definition of N, e.g. "DN=260" would set N to the value 260 for that compilation. Do not do dinero simulations -- just run the programs uninstrumented.

2) Find out what is going on in the cache at N=256. Run a dinero simulation for L1 cache only at N=256 using the same compilation and simulation parameters as used in previous runs. Explain what the results mean about the program behavior in general terms. Be sure to use the -z option to get a run that isn't stopped early. Important -- this is a single-level cache simulation only -- don't run the 3-level cache script for this problem or any of the remaining problems -- it will take a lot of CPU time for no reason! If you haven't used them before, ask for help using the "at" or "nohup" Unix commands so you don't have to run this simulation interactively.

Problem 3: Use a transpose matrix

Modify the subroutine my_matmul in matmul.c to use a transposed b matrix by adapting code from examples.c (take code from the routine transpose()).

1) Measure the uninstrumented run time for the transposed algorithm at N=256. It should be faster than the original matrix multiply, even though it does the extra work of copying the b array before doing the actual multiply. List the execution times for original matrix and transposed matrix code. Include a listing of your new my_matmul routine with the transpose algorithm.

2) Why is the transpose routine so much faster? Make a hypothesis, and test that hypothesis at N=251 so that the anomaly at N=256 isn't a factor. You don't have to have the correct answer to get full credit -- just give a plausible answer and a reasonable test with results that either prove it or disprove it. It's OK to use cache simulation parameters from the servers even if you actually measured results on an Alpha workstation -- the on-chip caches are not very different. If you're having trouble coming up with a hypothesis, send the course staff e-mail for a detailed hint.

The point Problem 3 is to take one trip through the scientific method cycle:

Observation

Do a timed run of both versions at N=251 and note the speed difference

Perhaps do timed runs at other N values if you like

Look at simulation results from previous parts of the homework

Hypothesis

Take a plausible guess at what might be going on here based on your experience, observations, and a look at the source code

Experiment

Formulate some experiment that you can do to test your hypothesis and run it. (Hint: the chances that something going on in L3 cache cause that big a performance difference are low, so I wouldn't bother with a 3-level cache simulation given how long it takes to run; but an L1-only cache simulation might possibly be in order...)

Conclusion

Did the experiment work? What did you learn?

If your experiment didn't "work" that's OK too. Just say what you were trying to find and what you ruled out. The point is to experience the above process, not to necessarily solve the entire problem if that becomes a time sink.

Problem 4: Diagnose a Performance Problem with matrix multiply

In this problem you'll be characterizing and optimizing the performance of a large matrix multiply which doesn't fit into cache. In particular, you must make a 1024x1024 long-int matrix multiply go fast.

Starting with the original program matmul.c, and using the examples.c program for ideas, create a performance-optimized matrix multiply routine for the Alphaservers. The objective is to create a matrix multiplication program that performs well for a matrix sized too large to fit into L3 cache.

The rules are:

You may use any source of information for ideas (including web sites; including other students as long as out SAY WHAT YOU USED -- no points will be deducted regardless of source as long as you do not literally copy code)
You must verify correct results of your final algorithm by comparing it to the naive algorithm, which is already done in the matmul.c test program. (You might save some time by disabling comparisons while you are hacking the code -- but be careful to occasionally check to make sure you're still computing the correct result.)
You must create a general-purpose routine that operates on 64-bit long integers (so you can't use special cases or precomputation because you know what the data looks like; and your code should work on any array size).
You must present timing results taken on the course Alphastations. BE SURE TO STATE MACHINE LOAD USING THE uptime command for full credit! Of course, determining miss ratios via simulation may help you improve your ultimate timed performance.
Your final answer may only change code within the definition of my_matmul(), and must take inputs from a[] and b[] and put results in c[]. But, you may exploit the compilation variables OFFSET and FILLER.
You should get a final execution time at least as fast as below depending on which type of Alphastation you use:
- 433 MHz "color" alphastations: matmul.c unoptimized=305 seconds ; optimized = 32.0 seconds
- 500 MHz "skeleton" alphastations: matmul.c unoptimized=289 seconds ; optimized=26.3 seconds

Grading will be based on:

1) Keep a log of every optimization you try with a one-line description and execution time (it would be no surprise if there were dozens of entries in this). You will not be graded on how many tries it took; anyone with a believable and readable log will get full credit for this sub-problem.

2) Give a listing of your final optimized my_matmul() routine.

3) Give a listing of how you invoked the compiler (hint: you might want to compute OFFSET and FILLER based on the value of N). This might be a command line or a make file depending on how fancy you get.

4) What was execution time, the speedup you achieved and the machine load (using uptime) when you measured the speedup? If the machine is not completely idle you will probably want to average a few trials. If you get close to the required execution time but can't get faster than that, consult with the course staff to determine if the problem is heavy machine loads from other students or with your code. As a hint, if you're not using a block decomposition strategy of some kind, you aren't there yet.

5) Draw a picture showing the access pattern of the arrays in your optimized code. Label arrows with the names of all your loop variables.

Notes on Problem 4:

We realize this is a rather open-ended assignment. We're willing to enter a dialogue via e-mail and newsgroup if you feel you don't have sufficient direction. The assignment is supposed to be enlightening, not frustrating, so let us know if there are ways to improve the experience. There are probably several ways to do problem 4 with very little work (and, once again, if you didn't literally copy the code, it isn't cheating). In general the benefit of this lab is proportional to the thought (not necessarily hours) you put into it.

You might want to experiment with optimizations on smaller arrays (say, 128x128 arrays) to understand the algorithms. This will reduce the time to run an experiment until you are zeroing in on the final answer.

There is a "feature" in Digital Unix which can cause the unoptimized program to run much faster than what is stated above, but leaves the optimized program unaffected -- if you encounter this "feature" we can either reboot the machine you are using or tell you how to do the reboot. (Isn't dealing with real hardware and software more fun than just getting simulation results?) At any rate, the thing that really counts is getting optimized performance below the targets, and the targets are unaffected by this problem.

A student in a previous semester asked if a getting this speedup was realistic, since it is very close to the theoretical limit of no memory accesses at all. The short answer is, yes it's realistic; that's the point of being clever about exploiting the memory hierarchy. Below are actual speedups measured on the course machines. Note that the faster machine gets a greater speedup by improving cache performance because main memory speed is the same on both machines (thus an constant-duration L3 cache miss penalty at 500 MHz is more clocks than the same penalty at 433 MHz).

Machine Speed	433 MHz	433 MHz	500 MHz	500 MHz
Program	Execution Time	Speedup	Execution Time	Speedup
Original matmul	305.00 seconds	--	289.28 seconds	--
Improved matmul (machine load=1)	31.48 seconds	9.69	25.82 seconds	11.2
Improved matmul (machine load=4)	33.12 seconds (average)	9.21	27.17 seconds (average)	10.65

You will not be penalized heavily (and in many cases will get full credit) if you make a good try and don't quite hit the speedup target as long as you provide all the information for the program you developed as requested in the five problem parts assigned above.

18-548/15-548 home page.

18-548/15-548 Fall 1998

Lab 5: Performance Tuning

Draft until Friday 10/30

Notes on Problem 4:

Lab 5:
Performance Tuning