Previous | Next --- Slide 24 of 66
Back to Lecture Thumbnails
grose

Question: A further optimization you could make after parallelizing the outer for loop is to note that the sequence of denom's used is the same in each case (i.e. increasing odd factorials). What if one processor just computed those values into an array in memory while other processors in parallel run the body of the outer for loop, and when they need values of denom, just get it from the one processor's array? Would that speed it up? What if the one processor did the entire precomputation before-hand?

Hint: synchronization, to start with

kayvonf

@grose. Nice post. Let's think about a solution where all the denominators were computed before hand.

Each iteration of the innermost loop burns six math ops to compute the denominator denom = denom * (2*j+2)*(2*j+3), and your proposal replaces thoughts six math ops with a single memory fetch. Let's assume for simplicity that the six ops can be done in six ALU cycles (but that might be optimistic since it's dependent math.)

The question is, would you trade six ALU ops for a memory request? Well, the precomputed array of denominators that gets used every outermost loop would be quite tiny: sizeof(float) * terms, so it almost certainly would be sitting in L1. So really we're trading six arithmetic ops for a L1 hit, not a main memory request.

L1 tends to have an access latency of 4-6 cycles on modern CPUs, and if the CPU stalled waiting for L1 it would all be close to a wash. However, that L1 access latency could be hidden by other arithmetic ops on a modern out-of-order CPU. So in short, CPU's are complex, and so I'm not sure if your optimization would be a net win or a net loss. Maybe you could code it up and let us know?

grose

Hmm, I was intending the problem to be that if other processors have to wait for the one while they compute, then the cost of waiting, and running synchronization routines wouldn't make it worth it. The precomputed case is pretty interesting too, I was betting on it being in L1, but I didn't think it'd come so close. Seems pretty neat, so I'll have to code it up though.

That being said, it seems like this computation is somewhat redundant. Has anyone ever considered building SIMD such that another instruction set of instructions that just computes the factorial can run in parallel, in fact in lockstep, with the SIMD calculation here, and the result can be pulled in, perhaps from a shared register? Or anything along those lines?