The reason for the dropoff after 16 processors is that our time starts being dominated by communication. The ocean simulator work assignment is on the next slide, and as you can see our arithmetic intensity will decrease if we keep $N$ fixed and increase $P$. This example demonstrates that increasing the number of processors you have will not always increase performance and may in fact hurt it, especially when you have a small problem size that does not require so many processors. The reason is that we stop being computation bound and start being communication bound.
Another reason that we might start having a dropoff after 16 processors might be due to shared resources. It depends on the problem, but as we saw during class on the first day during the demos (with 4 people having to use a pencil/paper to compute a sum together), each processor might similarly have to share some common resource that forces the program at that point to essentially become sequential. This contributes to the fraction of sequential work we have in the program that we'd eventually like to decrease. Having a grid size of 268x268 thus is probably too small a problem to use for >16 processors, and using 32 processors for much larger problems (eg. the 12k x 12k grid in a later slide) makes much more sense.
In addition to communication and shared resources, another reason might be lock contention. If any synchronization primitive is used (highly likely), then as there are more processors, overheard of the synchronization primitive becomes more significant, and hence the drop in speedup. This actually happened before in my internship project, where when the number of threads of our data cache grows to over 50, the speedup of multi-threading decreased sharply.