Previous | Next --- Slide 41 of 58
Back to Lecture Thumbnails
sanchuah

What is chunk() for?

kayvonf

@sanchuah: It describes a desired producer-consumer evaluation order. See slide 43.

Kapteyn

I'm a bit confused by the scheduling here. In the previous slide, the parallelism over tiles in the y direction applied to both the creation of tmp and blurred. But here it seems like we are only telling halide to produce tmp in chunks and to parallelize computing tiles of blurred along the y direction.

Will Halide figure out that it needs to compute tiles of tmp in parallel along the y direction as well so that they can be fed into the blurred function immediately like in the c++ code?

This feels weird to me because the scheduling that we've defined on a function (blurred) that occurs at a later stage in the pipeline is defining the scheduling that should occur for an earlier function in the pipeline (tmp).

kayvonf

@Kayteyn: The schedule specifies "produce elements of blurred in tile by tile order, vectorizing over the X direction for elements in a tile and parallelizing across cores at the granularity of different tiles.

Now, for each of those tiles, to compute the value of blurred(x,y), the program must access three elements of tmp. There are a number of ways we can obtain the values of the needed elements of `tmp', and each way corresponds to different pipeline producer-consumer scheduling strategy:

  • We can do all the math on the fly, so that if we require tmp(x,y) we execute the appropriate math operations on values from in() to compute the needed value of tmp. (This would be the 'inline' scheduling policy shown on the slide 43.)
  • We could do all the math up front, storing all elements of tmp in memory in an array. As a result, accessing elements of tmp when computing blurred(x,y) is just an array lookup. (This would be a 'root' scheduling policy on the slide 43.)
  • We could compute and store only the part of tmp needed to compute the elements in the current chunk of blurred. (This is the 'chunked' scheduling policy on slide 43 and what is specified here.) Note that the tile size of blurred is chosen so that a chunked scheduling policy yields pieces of tmp that fit in the cache. And you can see the temporary allocation for these pieces as the variable tmp in the code on slide slide 38.
Kapteyn

@Kayvonf I'd like to reiterate what you said make sure I'm understanding this correctly:

The compiler is aware of which values of tmp each call to blurred(x,y) depends on.

Given these dependencies, it can choose how it wants to generate and store (or not store) the intermediate values in tmp needed by blurred(x,y).

By specifying tmp.chunk(x) we are telling the compiler: "generate values from tmp that blurred(x,y) depends on tile by tile in the x direction. So now the compiler knows:

  1. that it should compute blur on rows of tiles in parallel
  2. for each chunk of blurred in a given row, compute all values of tmp in that chunk up front, store it somewhere (i.e. in cache) and use those values for the computation of blur on the current chunk.

What would happen if we specified the schedule as "tmp.chunk(y).vectorize(x,8)" instead?

I'm guessing it would cause a compilation error because we told the Halide compiler to compute rows of tiles of blur in parallel (which depend on rows of tiles of tmp). But "tmp.chunk(y).vectorize(x,8)" tells Halide to compute chunks of tmp column by column. This means that the chunk of tmp that the next chunk of blur depends on will not match up with the chunk of tmp provided by the scheduler.

kayvonf

The output code generated from the schedule above is the code on slide 38. The rough sketch is:

a tile is a 256x32 region of blurred

for each row of tiles of blurred: // iterations in parallel across cores
   for each tile t in the current row  // sequential, on a single core

      // stores all the data in tmp needed to compute input tile t
      float tmp[][]   

      // vectorize consecutive elements in tmp when processing
      compute elements of tmp (from in)

      // vectorize consecutive elements in blurred when processing
      use elements of tmp to compute the tile of blurred. 
kayvonf

Consider the following. Since blurred is tiled, we have 4 loops, a loop nest over tiles, and a loop next over elements in the tile (examples courtesy of Jonathan Ragan-Kelley, creator of Halide):

For blurred.y:
   For blurred.x:
     For blurred.yi:
       For blurred.xi:
         Compute blurred given tmp

So root scheduling means:

// compute full image of tmp here
For blurred.y:
  For blurred.x:
    For blurred.yi:
      For blurred.xi:
        Compute blurred given tmp

chunk(x) means:

For blurred.y:
   For blurred.x:
      // compute a tile width by tile height +2 tile of tmp here
      For blurred.yi:
         For blurred.xi:
             Compute blurred given tmp

chunk(xi) would mean about the same as inline:

For blurred.y:
   For blurred.x:
     For blurred.yi:
       For blurred.xi:
         // compute a 1x3 piece of tmp here
         Compute blurred given tmp

And chunk(y) would mean:

For blurred.y:
  // compute a tile height + 2 x image width strip of tmp here
  For blurred.x:
     For blurred.yi:
        For blurred.xi:
           Compute blurred given tmp