In this example for N=130, running 32 processors is actually slower than running on 1 processor. In this case each processor only has to compute 310 grid cells. After the computation there is synchronization and communication that must occur between the processors. Because the problem size is so small, the computation would be quicker to do on one processor and have less overhead.
This reminds me of mechanisms of a big company, where you always feel there is huge communication overhead, and you have to wait for other colleague's work to be done to continue yours (synchronization). While in a small company or a startup, everyone single one is much more efficient, and usually one person will be responsible for a complete and independent component.
In practice, programs will have a cutoff size below which parallelism would not occur. For example, if a program wanted to perform parallel mergesort, it would mergesort in parallel until the input is small enough that sorting with one processor would be more efficient.
The reason why the N are always to the power of two plus two is that the whole picture is always power of two but has one more line around this picture for its edges to look at.
For smaller problems, when the number of processors goes up, the speedup curve goes down. It is because as the number of process increase, the arithmetic intensity will drop; and thus the speedup is more likely to be bounded by the communication overhead. But for bigger problem, more processor is a good thing and the speedup curve tends to approach the ideal speedup curve.
This actually happened in my internship last summer, and caused a major frustration. I was developing a multi-threaded library in a key-value cache, and for safety concerns I was only allowed to use "test instances", which received a tiny fraction of the load of production server. As an example, a production instance would have 20K queries per second, yet a test instance only receives 1K queries per second.
After I benchmarked my performance on the test instances, the results were (unsurprisingly) really disappointing. After digging around for two days (!), my manager hinted me to try it on a production box, and voila, the performance boost was really significant!
I think the reason for no benefits with 32 processors is that N is too small plus there is the communication overhead associated with a lot of processors.