Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Perf Optimization I: Work Distribution and Scheduling

Previous | Next --- Slide 8 of 64

tarabyte

This reminds me of the bin packing problem, except instead of minimizing the "height" of each bin, bin packing algorithms minimize the number of bins used. It definitely wouldn't be useful for us minimize the number of processors we are using, but are there ways to modify bin packing algorithms to minimize the height of the bins (therefore the runtime) instead? What other ways do people figure out how to distribute the work?

jkorn

@tarabyte There's a similar type of NP-hard optimization problem called, conveniently, the multiprocessor scheduling problem. You specify a fixed number of processors and a set of jobs and their corresponding lengths, and want to find the minimum time required to schedule all jobs such that none overlap. This doesn't necessarily account for the real-life issue of scheduling overhead, but it's an interesting problem nonetheless.

From wikipedia:

A simple, often-used algorithm is the LPT algorithm (Longest Processing Time) which sorts the jobs by its processing time and then assigns them to the machine with the earliest end time so far. This algorithm achieves an upper bound of 4/3 - 1/(3m) OPT.

where here m is the number of processors.

There are other algorithms that utilize network flows, genetic algorithms, etc, to try to get closer to OPT.

tarabyte

@jkorn No way, that's cool. I'll have to look into it more. Thanks, Josh.

apr

In many real-time systems (tasks have strict timing deadlines) there are pre-determined task-sets which need to be scheduled onto different cores. Like @tarabyte and @jkorn mentioned, there are different dimensions that can be kept in mind when optimizing scheduling such task-sets.

For example, sometimes it is favorable to distribute tasks equally to all cores to save power. Frequency scaling is one way in which frequency of cores are changed based on the workload. Some multi-processor systems apply frequency scaling algorithms to all cores at once, which means that all cores must run at the same frequency. Since real-time tasks have timing deadlines, this would mean that for particular task-sets on each core, a minimum frequency is required to meet those deadlines. In order to meet all deadlines, the max of these [min frequency per core] list is used as the frequency for all cores. If the tasks are not equally distributed in this case, it will cause one core to work harder than others, but because they have to run at the same frequency, this results in unnecessary power loss.

Another method used is sleep states. In this case however, it is possible to turn off different cores resulting in much lesser power loss from those cores. Here, it might be better to reduce the number of cores used (based on how frequency is scaled and the particular task-set)

If you are interested, consider reading this paper! https://www.cs.cmu.edu/~ssaewong/mypaper/dvs-rtas2003.ps

vasua

Another scheduling algorithm we considered in 349 was RMS (rate monotonic scheduling), in which each task is given a priority, a period, and an upper bound on computation time. Assuming that there are no runtime checks for computation bound overruns, I suppose this still counts as a static assignment, as the scheduling is entirely deterministic based on the initial conditions. Even with computation bounds checking, if the program isn't given extra time until its next period, we still have a static assignment.

anonymous

In 18-648 Embedded Real-time systems, one of the Bin-packing heuristics is worst-fit decreasing (WFD). The heuristic is as follows:
1, sort objects in descending order of the size (job's cost here)
2, sort the bins (processor) in ascending order of consumed space
3, fit next object into the first sorted bin that it can fit into, if it doesn't fit into the first bin, add new bin to fit into
4, if no objects left, done. otherwise, goto step 2
This heuristic can balance load across all available bins.