Previous | Next --- Slide 24 of 36
Back to Lecture Thumbnails
kayvonf

Question: If you were designing a superscalar processor that only had to run the program on this slide, how many instructions per clock would you design it to support? Why?

yanzhan2

I think 2 is enough, need 3 cycles and good usage.

mchoquet

I agree: two instructions per cycle is the minimum amount of parallelism required to run the program as fast as possible (3 cycles), so supporting any more than that would be needless complexity.

retterermoore

I was interested in how this would scale, so I looked it a little bit (i.e. if your program is the sum of the squares of k numbers, how does the minimum instruction per clock that lets you achieve the fastest time, scale with k? (in this example k = 3, mipc = 2)

The first few are pretty easy to calculate:

1:1

2:2

3:2

4:4

Which suggests a pattern (1, 2, 2, 4, 4, 4, 4, etc.)

If this pattern held for k = 5, the mipc for fastest runtime would be 4. But I don't think that pattern actually holds - the best you can do with 5 numbers is 4 steps (step 1 square all numbers, step 2 add 2 pairs of them, steps 3 and 4 finish the addition), but you can achieve this with only 3 instructions per clock:

Step 1: aa, bb, cc

Step 2: dd, ee, aa + bb

Step 3: dd + ee, aa + bb + cc

Step 4: aa + bb + cc + dd + ee

Can anyone come up with a consistent pattern for the scaling of mipc that does hold? (also, how does math notation work in this comment system?)

gif

@retterermoore This is a good question to think about. It might also be good to think about what other concerns might arise as you increase k. Would other properties of the hardware begin to become more important? What kind of assumptions that are valid when k is 5, versus when k is 10000 (perhaps in terms of how to get hold of the data in x, y, z, ...)?

As for your question about markdown, be sure to check out the markdown tutorial for a brief overview.

TheRandomOne

So one thing that's important for the pattern is that the number of steps will increase by one every time a new power of two is passed. That is, when at most 4, there are at most 3 clock steps needed (probably a bad term for it), where at 5 there are 4 needed, and at 9 5 are needed by properties of logs and trees.

Also, when we have a power of two as our number of values, we need enough instructions in a clock step to calculate all of the squares in a single step. So 16 would need 16 instructions per clockstep available to complete in 5 steps.

I'm not entirely sure how to generalize from here, though. It's clear that for any n, we need at most n instructions per clock step and lg(n) steps in order to complete the problem, but I'm not sure how to reduce the number of instructions per clock step needed further.

ruoyul

I don't quite get the answer to the question. Can someone clarify how the answer of 2 instructions per clock cycle was arrived? What does minimum instructions per cycle exactly mean?

devs

@ruoyul Someone can correct me if I'm wrong, but there are a total of 6 instructions. As we can see, the (xx) is independent from (yy) which is independent from (zz). So we could have 3 instructions per clock cycle. But looking at the next step, we see that in order to add all of these three quantities together, we must first add (xx) then (yy) before we can additionally add (zz). This would mean in the second and third cycles, we would have the capability to read in 3 instructions, but we would actually only be performing one instruction. So going back to the first step, let's have only two instruction reads per clock cycle. Thus we can assign the work in the following manner:

Cycle 1

(x * x)

(y * y)

Cycle 2

(z * z)

(x * x)+(y * y)

Cycle 3

(zz)+(xx)+(y*y)

Comparing vs the 3 instructions per cycle we see

Cycle 1
(x * x)

(y * y)

(z * z)

Cycle 2 (x * x)+(y * y)

Cycle 3 (z * z)+(x * x)+(y * y)

Having more than 2 instructions per cycle would also unnecessarily complicate the design and not be as optimal as we would like as @yanzhan2 and @mchoquet pointed out. And so minimum instructions per cycle means what is the fewest amount of instructions we execute per cycle so that we maximize parallelism.