I noticed that when the team of 4 evenly distributed the numbers, the paper/pen became a large constraint. When the numbers were unevenly distributed, some people finished earlier and thus were able to write on the paper before the paper was in demand. It seems that the ideal initial distribution of work is neither completely even nor extremely lopsided.
@jhibshma: nice observation. What you're pointing out is that there was a problem caused by contention for a shared resource, and that contention is not just a matter of how frequently that resource is used, it also depends on when clients need to access the resource.
For example, we might have a perfectly good road near Heinz Field that has plenty of capacity to handle 55,000 cars per week, but if 55,000 fans decide to leave the Steeler's game at the same time, there's going to be contention for the road. And the result of the contention is going to to be some cars not making progress (i.e., using the road) while they are waiting for access to the resource.
In both the above cases, only 55,000 cars used the road a week.
It would make quite a lot of sense to write what the team does in code in terms of threads.
16 numbers in an array
mutex on paper/pen
Each thread can grab a portion of the array and calculate the sum independent of the rest of the threads. However each thread will have to grab the mutex at different times which reduces the speedup.
I'm interested in the scheduling strategy that is used when distributing the workload in the experiment. In the second part of the experiment, students are distributing the numbers evenly, which causes them to compete for the rare resource(pen) in the end. However, evenly distribute the work seems to be a reasonable heuristic when we have multiple computers. Does that mean that if we were trying to implement a load balancer, we also need to consider all the shared resources to make sure there is no contention or reduce the contention as much as possible? Is this true in the real word and is there any additional materials about this topic?
One particular algorithm that could be used to make the work in Demo 2 faster (Parallelizing addition of numbers) is to make one person in charge of cumulative addition along with possibly minimum addition of numbers. The other three could have distributed the numbers such that each of them have different number of digits to add. When the timer starts, the person who gets done first, adds the counts as and when they are available. Since each of them have different number of digits and if we assume each of them have nearly similar speed of calculation, not everyone will be ready with the sum at the same time. This makes it a very organized way for the person in charge of cumulative addition as that person has to make incremental changes to the sum each time someone is ready. This way there is a good chance two people are not ready with their sum at the same time and hence communication is much smoother.
The analogy can be thought of as a asynchronous system which a large core that does much of barrier synchronization and sequential part of the program, while the other cores do their parallel jobs. We may have to think of how the other cores are identical and how would the work distribution be among them.
@steffid: I think what you're pointing out in fact is a heterogeneous system with each core having different capabilities. There could be one large core to take care of the part of the program that is sequential and dependent, and many small parallel cores to do independent jobs in parallel. This would also be a good strategy for achieving good speedup with less power consumption.
@fuzzyduck: Yes, precisely. Having a homogenous system, in case of the demo, same number of digits was inefficient. A heterogenous system can overcome this limitation, one of the reasons for multi core being popular.
This also reminds me of the work stealing mechanism. Perhaps not fitting in this case, but in general if one student finishes earlier, why doesn't he/she poke around and grab some "extra" work from the other guys while they are computing? It can be done in a way such that the workload is somehow "evened".
@tding won't that lead to more time wasted? because then the student would 1.) need to ask the other student the current sum and follow-up from there 2.) take numbers that the other student has not yet calculated, which would not only disturb the other student but would lead to time wasted (the process of passing on numbers and even just talking is time wastage)
Glad to see someone has brought this up already, I had similar idea during the class demo. Though give each person 4 numbers seems like even distribution of work, it is not evenly distributed in terms of the iteration of additions each person has to carry out (plus there's the overhead of acquiring shared resources, which proven to be a large delay in this experiment). So I thought unevenly distribute the numbers might give us more evenly distributed work. The idea is that we let a portion of processors finish their own part early so they can act as coordinators to combine other processors results. For example we could do 2,4,4,6, so the person with two numbers can take care of the task of getting final result using other people's partial results.
@jerryzh Good question about load balancing! Designing a load balancer requires us to consider many such issues. For example, suppose you want to implement your own MapReduce framework on a distributed system (similar to Hadoop). A load balancer for such a framework should be careful in assigning the right amount of work to each processor in the distributed system, avoid a lot of communication overhead, and ensure that once a processor completes a job, the processor can send its results and start on a different job with minimum delay. To avoid communication overhead, such a framework could use a Master-Worker system strategy, where processors that do work communicate only with the Master node, and data is stored on the nodes that do the work to avoid large data transfers. Since the master node can listen on the workers asynchronously, there is no contention for writing results to the master node in this manner of load balancing. A common strategy to distribute work is to use a Work queue, so that when a worker node is done doing some computation, the Master node simply assigns the next computation task from the Work queue to the worker. This way, there are no idle processors due to bad static assignments of work to processors. This doesn't guarantee still that all workers will be used all the time because one computation may take a longer time than others, the worker node doing such a computation may be slower than other worker nodes in a heterogenous distributed system.