Lecture 23 : Domain-Specific Programming Systems
Watch the Lecture

kuity

As Kayvon mentioned in class:

C: Fully compiled

Java, Scala: Just-in-time(JIT) compilation. Program is stored as bytecode but compiled before execution "just-in-time".

JIT Compilation

Here, "Completeness" refers to the ability to write any program you want, not necessarily Turing Completeness, and not being limited by restrictions of the language.

sjoyner

High Performance: Code that runs fast. For example your code is likely to run faster in C than in Perl or Ruby. Productivity: Ability to easily write a program in a language quickly. Programming languages that are highly productivity are Python and JavaScript, since it's quick and simple to write simple programs. Completeness: Ability to write any program you want. Most modern programming languages satisfy this requirement.

Its very difficult to find a programming language that satisfies all requirements.

LeeK

"Successful" languages generally have 2 out of the 3 desirable traits. No programming language has been created that has all 3. DSLs give high performance and productivity but are specific to a certain domain so they lack completeness.

kayvonf

"No programming language has been created that has all 3."

Well, at least that's the claim (and opinion of many). Perhaps someone may disagree.

Xiao

Food for thought: Why is it so hard to achieve all 3? Is it because of our inability to create better programming languages? Or is it because of inherent design contrains created by the instruction-based programming model?
For example, in order to achieve productivity, we want programming languages that can most closely mimic the actual computation model we are working with, e.g. graphs, lists, high-order functions. However, at the machine level, everything is translated to an instruction based model. The machine is, most of the time, unaware of what it is computing: they are just a bunch of memory load/stores and arithmetic operations. This semantic gap has to be filled by the programming language and its compiler. Larger the semantic gap, the better productivity we can achieve, but the complexity of the compiler increases, and we lose performance during the translation process. On the other hand, if we keep the complexity constant, the compiler will generate good code, but we lose completeness, and so on. In other words, this triangle is the result of the discrepancy between our computation model and the machine's computation model.
However, raising the level of abstraction of the machine's computation model has not been very successful. This is the reasoning behind CISC (Complex Instruction Set Computers). However, recent years, CISC architectures has slowly lost the battle against ARM, SPARC and other RISC architectures.
So this leads to the question I asked at the very beginning... Maybe our current computation stack is inherently flawed. Maybe a different computation model and translation model could fulfill all of these three goals. But that is one huge unknown...

acappiello

I think that part of the answer to some of the questions raised above is because the type of parallel hardware we've been discussing is both relatively new and constantly changing.

As far as completeness and productivity goes, this was something that could be reasonably well addressed 10 years ago (and even further in the past) when mainstream computing consisted of single core processors. Complete programming languages have existed "forever" and even languages like Python have been around since 1991 source: Wikipedia.

On the other hand, there hasn't been the same level of interest in high performance computing in the parallel world for as long. For example, CUDA has only been around since 2007 source: Wikipedia. So, perhaps with more time we will be able to produce a language that masters all three categories. Until we can do that, it looks like one of the three needs to be sacrificed.

In my opinion, the biggest conflict comes between productivity and high performance. The less effort that the programmer has to put into making the high performance program means that the rest of the work has to be made up by the compiler/interpreter. DSLs are useful because this means that the compiler/interpreter only needs to focus on optimizing a small subset of tasks.

edwardzh

Domain Specific Languages are also commonly used in video game logic, especially to facilitate game logic implementation through scripting languages (such as StarCraft II's Galaxy scripting language). In this context, DSLs are 'ideal' for the video game developer who only needs to program video game logic.

sjoyner

We want the dependencies to be obvious. The compiler knows all the fields/vectors and does not have to deference anything or do anything special to determine what vector it is accessing in a given iteration.

kfc9001

Something that wasn't obvious to me but is important to point out is that Lizst is typically used with a single, known mesh. The reason Lizst can divide up these meshes is that it compiles for a single mesh at a time, typically preprocessed in advance. It can use the knowledge from properties of the mesh, and the access patterns of different parts of the mesh to figure out how to divide up the meshes in a distributed system, for example.

kuity

Arbitrary recursion can't be handled because:

The dependencies of the program are not well-scoped, i.e. it is not possible to tell statically which processor needs which data. However, if the recursion depth is limited, it can be supported.

sjoyner

Ghost cells: Cells that are duplicated across multiple machines. For example if we're considering an edge where the end vertices reside on different machines, it's possible we will need access to both vertices. Ghost cells require update messages to be exchanged across machines when writes occur, but not reads.

kfc9001

The Lizst compiler will actually generate MPI sends and receives (for a cluster system) for the ghost cells depending on how it divided up the mesh! That's pretty cool.

Incanam

Since locks are bad due to high contention, the system first devotes some time to coloring the graph to find which nodes can be done in parallel. This is especially applicable in the domain that this language is built for (makes sense) because often the meshes are not dynamic (like the wings of a plane for example). Thus doing the calculation once initially can drastically speedup any number of iterations after the fact.

kayvonf

To be more precise:

Rather than attempt a solution that used fine-grained synchronization to ensure atomicity of updates to shared edge or shared vertex values, this solution performs extra computation up front to compute subsets of the graph than can safely be computed in parallel without any fine-gained synchronization. Of course, there is still synchronization: there will be a barrier between the phases of parallel computation (each phases operates on nodes with the same color). If the extra work of the graph coloring computation is less than the cost of fine-grained synchronization (certainly the case on the modern GPU platforms Liszt targeted), then this is a better way to organize the parallel computation.

markwongsk

what is the probable reason for the super-linear speedup observed in the Navier-Stokes program?

lazyplus

Question: Is there a typo in the second and third bullets of 'High-Performance'?

It seems to be "Used locality-aware partitioning in distributed memory implementation" instead of "Used for ...". The slides before introduced some implementation details that applying locality-aware partitioning and graph coloring to parallelize the mesh computation. Did I misunderstand?

kayvonf

@lazyplus, your interpretation is fine. I just meant that the fact that the compiler knows dependencies in the program "is used" in the optimization listed. It might not be the best grammar but the slide is correct.

This is work efficient since it does less math; instead of adding 9 values for each pixel (the center pixel and the 8 surrounding it), we add 3 values twice (center and its left and right pixels, then for each of those the partial sum above and below to the center), so we only do 6 adds per pixel.

kayvonf

Notice that the relative efficiency of the two-pass approach increases as the size of the filter support region increases. For example, a 10 x 10 blur implemented in two passes performs 20 adds per pixel (10 adds in each pass) rather than 100.

We recompute edge values of each block since we don't waste time communicating, but since the block size is so large (256x32), it's a small tradeoff for a huge increase in parallel performance.

bottie

Since there exists overlap between blocks, so we need to recompute the edge values of each block.

kverrier

What new challenges are be presented by using a single function instead of the intermediate tmp function for the Halide language?

blurred(x, y) = (in(x-1, y-1) + in(x, y-1) + in(x+1, y-1) + in(x-1, y) + in(x, y) + in(x+1, y) + in(x-1, y+1) + in(x, y+1) + in(x+1, y+1)) / 9

jrk

Halide uses named functions as the points at which scheduling decisions can be made. In principle, there is little difference between the pure functional expression you've written on the right hand side here, and the composition of the two Funcs shown in the slide. In fact, this is the implementation you'd get if you left tmp scheduled as inline in blurred. In practice, we found it useful to give the programmer more explicit control over:

(1) the granularity at which scheduling decisions should be specified (2) the granularity at which work is potentially memoized for reuse (which is a key thing the schedule determines)

Compared to the fully inlined, single function version you wrote, all schedules for the two-function pipeline in the slide (short of fully inlining the two stages into one) will do substantially less work.

abunch

So maybe I am seeing numbers like 8 and 32 and jumping to conclusions, but I thought the whole point is that you could do this without knowing any of the underlying hardware and you would just say "compile this to GPU!" ? These numbers make me think that you might be assuming some things such as an 8-wide SIMD, or so many cores in the GPU, or having AVX instructions, etc when you decide how big to make the tiles, chunks, and vectors?

nslobody

To add on to @abunch's question, would it be possible for the compiler to detect the machine's SIMD width, number of cores, and other processor characteristics and use those automatically in determining how to schedule the algorithm? If so then they could be easily overriden with annotations if another schedule besides that which the compiler chooses were desired.

jrk

Yes, the ideal parameters are going to be platform specific. The key idea is that the schedule is fully decoupled from the description of the "algorithm." This is just one of an infinite number of possible schedules for this simple pipeline, and which is best will depend on the target architecture, the parameters and inputs (e.g., are the images small enough to fit in cache?), and the composition of the pipeline. The schedule can be specified explicitly by a human (as it is here), inferred by the compiler (e.g., using stochastic search aka "autotuning," as in the follow-on paper: http://people.csail.mit.edu/jrk/halide-pldi13.pdf), or any combination of the two (as @nslobody suggests).

Arnie

In our assignments, optimizing involved fine-tuning our algorithm to the particular piece of hardware we were using. For example, in Assignment 4, the number of threads we created in the workers was based on the number of cores in our worker computers.

Halide and other DSL's take a different approach. Instead of having to rewrite your code every time you switch machines, you identify dependencies and opportunities for parallelization, and Halide creates code which will be efficient for whatever machine it's running on. This is called decoupling of algorithm and schedule, since the algorithm (the code you write) stays the same regardless of the platform, and the schedule is decided by Halide based on the platform.

kfc9001

I think this is important enough to repeat: Halide doesn't by itself generate the ridiculously fast code mentioned above. By separating correctness and performance concerns for an image processing routine, it is much easier to try different schedules, while only writing the (possibly difficult) algorithm once. This is likely the reason pure functions were chosen to represent images, because these are very easy to reason about, and they prevent the programmer from thinking about performance details.

nslobody

Is OpenGL seen as an essentially well-honed tool for its task, merely needing to be iterated on as technology develops, or is there much research into a long-term replacement for it that is perhaps more functional and less stateful / procedural? What would a next-gen general purpose graphics programming language look like?

kayvonf

This is very much an open research question. A number of us have pursued the idea of a more flexible graphics pipeline, but no one has quite figured it out (or made a strong argument for why a generalized pipeline abstraction is actually what developers want and need).

You find find the following two examples interesting:

You can find a lot more references in these papers.