Does the "Fat Tree" work like some sort of physical data structure? Does the system that manages the matching of work to processors take into account the relative paths between the different processors and compute graph based algorithms behind the scenes to optimize the transportation of interdependent work?

rokislt10

This seems somewhat analogous to the rocket equation. The more processors/cores you put in a supercomputer, the greater the physical distance between the processors and memory. Is there a limit where making a supercomputer larger will actually decrease the overall performance?

Elias

And as a follow on: what are some examples of problems run on supercomputers that aren't bandwidth-limited?

arjunh

@Elias Think about what may happen in a parallel programming paradigm that uses a shared memory model, such as OpenMP.

caretcaret

@rokislt10: At some point, the speed of light is the limit. From 15-210, we know map has $O(1)$ span and $O(n)$ work, and reduce has $O(\log n)$ span and $O(n)$ work, but physically, a parallel algorithm using these actually can't get better than $O(\sqrt[4]{n})$ in 3-dimensional space.

Proof sketch:
In an algorithmic problem, we expect a final answer to be collected at a single location (otherwise the answer is distributed). Without loss of generality let this be a core at the origin (0, 0, 0). Normalize the speed of light to 1. Suppose there are many cores floating in 3-D space. Note that after time $t$, any core farther than $t$ distance away from the origin cannot communicate any work it has done to the origin. So after $T$ time, only the cores within a radius of $T$ can do work. Since the total amount of work done is proportional to time elapsed, we have that the amount of work that can propagate to the origin is computed by a ball of processors centered at the origin with radius $T$. So the amount of effectful work computed in $T$ time is proportional to $$\int_0^T \frac{4}{3}\pi t^3 dt \in O(T^4).$$ So if an algorithm requires $\Omega(f(n))$ total work, the fastest it can be done in is $\Omega(\sqrt[4]{f(n)})$.

I first found this result on Quora and the answer references a paper Your Favorite Parallel Algorithms Might Not Be as Fast as You Think (Fisher 1988). This result generalizes: given a situation with $I$ inputs, $K$ outputs, requiring $T$ computation in $d$-dimensional space, it will take $$\Omega(\max(I^\frac{1}{d}, K^\frac{1}{d}, T^\frac{1}{d+1})).$$

vrkrishn

@caretcaret

Wow that's an interesting thought

@rokislt10

One example of problems encountered on a super computer but not a normal computer is communication issues solely due to the number of cores. The Blacklight supercomputer has 512 cores while your average PC has about 4. You can imagine an algorithm that broadcasts data to other cores can get must more communication cost when running in the supercomputer, to the point that communication becomes a significant problem when running parallel programs on supercomputers

HLAHat

One of the previous lectures talked about Moore's Law possibly slowing down recently as far as processor speed, size, etc. go. Are we reaching any of these physical bottlenecks with supercomputers currently? Or have we been able to keep scaling these up at a pretty consistent rate? Obviously they're limited by the processor limitations too, but is there any research on how many processors/communication lines we can cram in a huge box? I imagine heat would be a bigger concern here and I don't really want to think about how much power would be needed to run everything...

srw

@HLAHat According to Wikipedia http://en.wikipedia.org/wiki/Supercomputer: "The ability of the cooling systems to remove waste heat is a limiting factor" and "Designs for future supercomputers are power-limited" (that is, the power for the cooling units). I imagine it's not space limited, because one can always build a bigger box.

Does the "Fat Tree" work like some sort of physical data structure? Does the system that manages the matching of work to processors take into account the relative paths between the different processors and compute graph based algorithms behind the scenes to optimize the transportation of interdependent work?

This seems somewhat analogous to the rocket equation. The more processors/cores you put in a supercomputer, the greater the physical distance between the processors and memory. Is there a limit where making a supercomputer larger will actually decrease the overall performance?

And as a follow on: what are some examples of problems run on supercomputers that aren't bandwidth-limited?

@Elias Think about what may happen in a parallel programming paradigm that uses a shared memory model, such as OpenMP.

@rokislt10: At some point, the speed of light is the limit. From 15-210, we know

`map`

has $O(1)$ span and $O(n)$ work, and`reduce`

has $O(\log n)$ span and $O(n)$ work, but physically, a parallel algorithm using these actually can't get better than $O(\sqrt[4]{n})$ in 3-dimensional space.Proof sketch: In an algorithmic problem, we expect a final answer to be collected at a single location (otherwise the answer is distributed). Without loss of generality let this be a core at the origin (0, 0, 0). Normalize the speed of light to 1. Suppose there are many cores floating in 3-D space. Note that after time $t$, any core farther than $t$ distance away from the origin cannot communicate any work it has done to the origin. So after $T$ time, only the cores within a radius of $T$ can do work. Since the total amount of work done is proportional to time elapsed, we have that the amount of work that can propagate to the origin is computed by a ball of processors centered at the origin with radius $T$. So the amount of effectful work computed in $T$ time is proportional to $$\int_0^T \frac{4}{3}\pi t^3 dt \in O(T^4).$$ So if an algorithm requires $\Omega(f(n))$ total work, the fastest it can be done in is $\Omega(\sqrt[4]{f(n)})$.

I first found this result on Quora and the answer references a paper

Your Favorite Parallel Algorithms Might Not Be as Fast as You Think(Fisher 1988). This result generalizes: given a situation with $I$ inputs, $K$ outputs, requiring $T$ computation in $d$-dimensional space, it will take $$\Omega(\max(I^\frac{1}{d}, K^\frac{1}{d}, T^\frac{1}{d+1})).$$@caretcaret

Wow that's an interesting thought

@rokislt10

One example of problems encountered on a super computer but not a normal computer is communication issues solely due to the number of cores. The Blacklight supercomputer has 512 cores while your average PC has about 4. You can imagine an algorithm that broadcasts data to other cores can get must more communication cost when running in the supercomputer, to the point that communication becomes a significant problem when running parallel programs on supercomputers

One of the previous lectures talked about Moore's Law possibly slowing down recently as far as processor speed, size, etc. go. Are we reaching any of these physical bottlenecks with supercomputers currently? Or have we been able to keep scaling these up at a pretty consistent rate? Obviously they're limited by the processor limitations too, but is there any research on how many processors/communication lines we can cram in a huge box? I imagine heat would be a bigger concern here and I don't really want to think about how much power would be needed to run everything...

@HLAHat According to Wikipedia http://en.wikipedia.org/wiki/Supercomputer: "The ability of the cooling systems to remove waste heat is a limiting factor" and "Designs for future supercomputers are power-limited" (that is, the power for the cooling units). I imagine it's not space limited, because one can always build a bigger box.

As for whether or not we're reaching that limit, I doubt it. If we look at supercomputer trends, we can see that they're still improving rapidly. http://en.wikipedia.org/wiki/Supercomputer#mediaviewer/File:Supercomputing-rmax-graph2.svg

Of course, with Amdahlâ€™s Law, we may run into more problems utilizing all those cores than we do building them. See lecture 1, slide 10