Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Programming for Performance: Work Assignment

Previous | Next --- Slide 33 of 60

himmelnetz

The last bullet is indicating that your granularity is really dependent on many factors, such as the task being done and hardware. How do people that do high end* parallel programming in industry accomplish this? Does it involve a bunch of experimentation and fine tuning of parameters? Could compilers (reasonably) accomplish this or some facet of this? Or do they just get it good enough and then use it on all similar environments?

*I'm thinking more about somewhat everyday applications such as video games more than experimental super-parallel supercomputers.

russt17

@himmelnetz To me this doesn't seem like somewhere for a compiler to get involved, since it has to do with the way work is divided into tasks, which is something that is more on the algorithmic level. In this case, for example, the programmer might use knowledge of the distribution of primes to aid in designing the work distribution. The compiler isn't going to know things like that.

I imagine that educated testing and tuning is the going to be the best way to figure out how to divide the work into tasks, and that it would be necessary to re-optimize the code in different environments.

fgomezfr

@himmelnetz Speaking with a little experience from the game industry: Possibly the most important phrases on this slide are "dynamic assignment" and "know your machine." Microsoft/Sony provide detailed performance capture tools to measure and visualize CPU and GPU execution; this makes very low experimentation and tuning (should my hull shader run at one thread per face or per vertex? measure it!) standard practice for game engineers. You've also got the right idea for unknown targets (e.g. PC games) - the compiler can do a limited amount but the game developer will test on an assortment of machines from min- to recommended-spec and choose a reasonable implementation.

For simple, individual tasks its not too difficult to write/compile an optimal implementation for a specific architecture (e.g. "test two boxes for intersection on a single AVX thread"). What we haven't seen yet, and your comment brings up, is the problem of many different tasks all at once. Game engines have to do lots of different work, and number and duration of tasks for each can shift dramatically as the player moves and the game unfolds (e.g. "test these 20,000 pairs of boxes, and meanwhile step vehicle physics and simulate sound"). Dynamically assigning tasks to available cores is an important tool for managing work-balance in scenarios like this.

sgbowen

Do people ever adjust granularity dynamically? It seems pretty automatic to measure the amount of time in the critical section vs. the amount of time working. The program could then adjust the granularity up and down depending on whether that ratio is above/below some threshold indicated by the programmer. For instance, I could say I want no more than 5% of my time spent in the critical section and could write code to automatically bundle more tasks together if I see, say, 7% of my time spent in the critical section.

Is this something that's ever done in practice? Or do programmers rely more on knowing their machine and measuring different results manually?

kayvonf

@sgbowen. Definitely. Although in general it's good to error on the side keeping programs as simple as possible, profiling performance and using information from the profile to automatically and dynamically adjust task granularity. It's done all the time, although a precise example in real-world application that I could link to is alluding me right now.

Academically, here's a paper by CMU processor Umut Acar on how a runtime might be able to adjust granularity in a fork-join programming model. http://www.mpi-sws.org/~mrainey/oracle_scheduling.pdf