What are the benefits of the greedy join approach? Is it just better caching and less communication?
I think it probably reduces the caching required because shared mem can be exploited more.
Additionally, avoiding the communication from thread n to thread 0 means bar() can get executed sooner. The communication of the "finished" message is guaranteed to be a sequential operation in this example so amdahl's law applies.