HOMEWORK 2 FAQ (FRI Sep 17, 0840 HOURS) ERRATA ------ Problem 1 --------- Problem 2 --------- Problem 3 --------- A3-1) I have a couple questions about #3 in HW #2. First, I couldn't find a definition of MIP rate. Is that just Millions of Instructions Per unit time (seconds in the case of MIPS)? And in the solution, how is the equation #MIPS = 1,000 x Efficiency = 1,000 x N/(N+M) derived? Q3-1) MIPS is "Millions of Instructions Per Sec". After the pipeline is filled, the execution looks like: Wait M cycles for data; Execute for N cycles without waiting; # N instructions have finished Wait M cycles for data; Execute for N cycles without waiting; # N more instructions have finished etc. where 1 CPU cycle is 1 nanosecond (a 1 GHz CPU ticks at a rate of 10^9 ticks per second or one tick takes 10^{-9} sec or 1 nsec. So, in (M+N) cycles, the CPU completes N instructions; in 2(M+N) cycles it completes 2N instructions; and so on. M+N cycles takes M+N nsec to finish. So, the instruction completion rate is about: rate = kN instructions / (k(M+N)) nsec = N/(M+N) nsec = N/(M+N) 10^9 instructions per second To convert to millions of seconds per second, divide by 1,000,000 or rate (MIPS) = 1000 N/(M+N) Problem 4 --------- Problem 5 --------- Q5-1) On problem 5b, are we to assume that an integer size is 2 bytes or 4 bytes? I am not sure which one to use because it would change the number of CPU reads from L1 cache to the registers. I.E. if the cache can deliver 8 bytes in one access, then the processor would make 2x the cache -> register reads in a 64 bit architecture. A5-1) Integers are 4 bytes. It takes 1 nsec to get 8 bytes from L1 to the CPU where it throws away 4 bytes before putting the int into an integer register. Q5-2) I am also unclear on the effect of the WB cache in problem 5b/c. Because writes to the cache arent reflected in memory until necessary, I have gleaned that there is no cost associated with writing the incremented value back, because it will not be written to the cache, and because it is already in the cache. Is this a correct assumption? It is correct that there is no write to memory because of the write-back cache. But you do have to update the cache value since the register will have a different value from the cache. I am confused because in the example solution for 5c his RWT is 2*(H*1+M*41)/(H+M) which is 2x my RWT value. A5-2) I think you need to draw a diagram. So, the core of the code we are interested in is: x[i] = x[i] +1; which translates to: 1) R <- M # read x[i] from memory subsystem 2) Incr R 3) R -> M # write x[i] to memory subsystem So, that means from the viewpoint of the memory subsystem: item: x[0] x[0] x[1] x[1] x[2] x[2] x[3] x[3] ... R/W: R W R W R W R W ... \________/ 1 RWT where R means read and W means write. Note that the Write of x[0] will cause a write to the cache but not main memory. Now add in the H(it) and M(iss) labels. Now does the RWT make sense? Q5-3) Is the cache write time the same as the cache read time? A5-3) Yes. Q5-4) In Part C, are we suppose to correct the student solution so that it matches our answers in Parts A and B? A5-4) The student solution in Part c is mostly correct. For example, the first and last equations are correct. But there is no explanation of how they arrived at the equations and tables in their solution. I can tell you that there are a few small algebraic errors, but the derivation is basically correct. So, one approach is to add in some explanation for each equation and table if they are correct, and also make corrections. Q5-5) Memory is 5-1-1-1, so the cache pulls in consecutive 64 bits (limit of cache bus?? size) 64 bits equals 2 ints, so, it will draw in x[0] and x[1] and encounter cache miss at x[2] so will have to go into main memory...? A5-5) The main memory is read in 32-byte (8-int) chunks. That means that when x[0] is referenced, x[0], x[1], ... x[7] are read into cache ... Why do you think there is a miss when x[2] is referenced? It is in the cache. So, it should be a hit. Q5-6) Why is 4-units pattern (5-1-1-1) used to describe memory? Why don't we use multi-units pattern(5-1-....-1)? A5-6) It is just an assumption ... which is not unrealistic. Q5-7) What is the length of integer supposed in the homework? 2 or 4bytes? A5-7) 4 bytes. Problem 6 --------- Q6-1) For part C, the temporal locality of a given index would always be the same? Aren't we writing it immediately after accessing it? I'm not sure how you would get a k and i and why the csize and stride has any effect? Each index is only accessed twice and right next to each other? A6-1) The problem could have been written much clearer. So, here is a restatement. The D() metric measures how many memory references into the future are required before a memory reference is repeated. Applied to this problem, what we really mean is after x[i] is written, D() indicates how many data memory references will occur before it is referenced again. Q6-2) For part e, are we supposed to calculate this or infer only from the graph? I guess I'm just not sure how to go about thinking about this problem. A6-2) Infer from the graph. Problem 5 shows that having a cache line size means that for small strides, RWT increases approximately linearly with the stride; i.e., double the stride and you will double RWT. But since Problem 5 assumed only a single-level cache and real systems have 2 or 3 levels, the factor might not be 2. But you should see a multiplicative effect peaking at the cache line size (with some noise). Note that the x-axis is logarithmic and not linear making the left-end of the graph appear non-linear. There may be other things that make the graphs different than Problem 5, but you should still the multiplicative effect of stride until the stride exceeds the cache line size. Q6-2) I am seeing worst-case RW times of about 5 or 6 nsec in my membench tests. That seems really fast. Is that right? A6-2) The largest array in the original membench.c is only 4 MB. Some newer machines (last 2 years) have an L2 cache that is atleast that big. If you use the newer membench.c that I posted as a separate link, you will likely see worst-case times in the tens of nsec. Q6-3) I have another question, this time about number 6: While interpreting the graph, how can one determine the size of the cache line from looking at the stride? I understand where to look (when the shape of the graph first changes), which is around stride = 80, but I don't know how to make any statement about the cache line size (or what units to use) from the stride. A6-3) Some comments: o The x-axis is logarithmic, not linear. o If you look at the raw data (generic.out), the stride values are 4, 8, ... They start at 4 and double. The code says that also. So, "stride = 80" cannot be right. You must mean 64 or 128. o I think the FAQ already says the following but ... from Prob 5, you expect the RWT graph to have an elbow at the stride that corresponds to the cache line size of some cache. Problem 7 --------- Q7-1) All of my timings seem to be OK except for one involving fork. This one occurs when 10 and shows average fork time that is 200 times larger than all of the others. Do you think that should be true? I ran this on my laptop that isn't running anything else. A7-1) The fact that all of your other times seem to be stable, probably means that the one case is not representative ... no guarantee, but probably. The fact that you aren't running any other processes doesn't mean that your timeit process didn't lose the CPU because there are daemons that can be running and the OS will periodically do accounting that can take a long time. Enter: "ps clax" and you will likely see 30-50 processes (probably in the Suspended state) even when it looks like nothing is happending. The first thing to do is to do a hand calculation of the time difference (use the last two columns) for that one weird case to make sure the problem is not a printing problem. There is also the small (although unlikely) possibility that your timeit process was interrupted by some disk activity which would cause this behavior ... one disk access can take 10-20 msec. You should repeat the experiment a few times to see if the weirdness repeats. If it does the weirdness could still be just some OS activity. Note that the fork experiment is the most likely to experience an interruption. Calculate how long it takes to do one experiment. Suppose that it is X msec. Then the probability that your process was interrupted by a hardclock interrupt is about Min{ X/10, 1 }. Q7-2) In the code, tavg is left unassigned and I'm not quite sure that I'm assigning it correctly. I'm assigning it to be: tavg = 1000000 * (tend.tv_sec - tbegin.tv_sec) + tend.tv_usec - tbegin.tv_usec; // in usec The way I assign tavg as stated above, it is just calculating the amount of time it took to run each test, which is not an average but just a difference. However, I'm not certain if tavg is supposed to be the average of the different runtimes of each test for each value of N (since there is the inner loop with j running each test 3 times). Yet if it were in fact supposed to be an average, displaying it with each N and each j is meaningless; it should only be displayed after the three iterations of j. Can you help me understand what tavg actually is supposed to represent? A7-2) You have to ask yourself what definition of average would make any sense. The loops look like this: foreach call type { do 3 times { do N times { ... } // work } } The innermost loop executes the workload to be measured N times. So, what we are interested in is something like this (in some pseudo programming language that I have not undefined): foreach call type { do 3 times { tbegin <== Get time; do N times { ... } // work tend <== Get time; tavg = (tend - tbegin)/N; } } We want to find out how long it takes to do the work once. So, we want to measure how long it takes to do work N times and compute the average over the N executions. The reason for doing this 3 times is to see if we can get a run where the results are stable and uninterrupted by some other user. The choice of 3 here was arbitrary. Your code: tavg = 1000000 * (tend.tv_sec - tbegin.tv_sec) + tend.tv_usec - tbegin.tv_usec; // in usec computes the difference and not the average. You need to divide by N: tavg = 1000000 * (tend.tv_sec - tbegin.tv_sec) + tend.tv_usec - tbegin.tv_usec; // in usec tavg /= N;