@sanchuah: It describes a desired producer-consumer evaluation order. See slide 43.
Kapteyn
I'm a bit confused by the scheduling here. In the previous slide, the parallelism over tiles in the y direction applied to both the creation of tmp and blurred. But here it seems like we are only telling halide to produce tmp in chunks and to parallelize computing tiles of blurred along the y direction.
Will Halide figure out that it needs to compute tiles of tmp in parallel along the y direction as well so that they can be fed into the blurred function immediately like in the c++ code?
This feels weird to me because the scheduling that we've defined on a function (blurred) that occurs at a later stage in the pipeline is defining the scheduling that should occur for an earlier function in the pipeline (tmp).
kayvonf
@Kayteyn: The schedule specifies "produce elements of blurred in tile by tile order, vectorizing over the X direction for elements in a tile and parallelizing across cores at the granularity of different tiles.
Now, for each of those tiles, to compute the value of blurred(x,y), the program must access three elements of tmp. There are a number of ways we can obtain the values of the needed elements of `tmp', and each way corresponds to different pipeline producer-consumer scheduling strategy:
We can do all the math on the fly, so that if we require tmp(x,y) we execute the appropriate math operations on values from in() to compute the needed value of tmp. (This would be the 'inline' scheduling policy shown on the slide 43.)
We could do all the math up front, storing all elements of tmp in memory in an array. As a result, accessing elements of tmp when computing blurred(x,y) is just an array lookup. (This would be a 'root' scheduling policy on the slide 43.)
We could compute and store only the part of tmp needed to compute the elements in the current chunk of blurred. (This is the 'chunked' scheduling policy on slide 43 and what is specified here.) Note that the tile size of blurred is chosen so that a chunked scheduling policy yields pieces of tmp that fit in the cache. And you can see the temporary allocation for these pieces as the variable tmp in the code on slide slide 38.
Kapteyn
@Kayvonf I'd like to reiterate what you said make sure I'm understanding this correctly:
The compiler is aware of which values of tmp each call to blurred(x,y) depends on.
Given these dependencies, it can choose how it wants to generate and store (or not store) the intermediate values in tmp needed by blurred(x,y).
By specifying tmp.chunk(x) we are telling the compiler: "generate values from tmp that blurred(x,y) depends on tile by tile in the x direction. So now the compiler knows:
that it should compute blur on rows of tiles in parallel
for each chunk of blurred in a given row, compute all values of tmp in that chunk up front, store it somewhere (i.e. in cache) and use those values for the computation of blur on the current chunk.
What would happen if we specified the schedule as "tmp.chunk(y).vectorize(x,8)" instead?
I'm guessing it would cause a compilation error because we told the Halide compiler to compute rows of tiles of blur in parallel (which depend on rows of tiles of tmp). But "tmp.chunk(y).vectorize(x,8)" tells Halide to compute chunks of tmp column by column. This means that the chunk of tmp that the next chunk of blur depends on will not match up with the chunk of tmp provided by the scheduler.
kayvonf
The output code generated from the schedule above is the code on slide 38. The rough sketch is:
a tile is a 256x32 region of blurred
for each row of tiles of blurred: // iterations in parallel across cores
for each tile t in the current row // sequential, on a single core
// stores all the data in tmp needed to compute input tile t
float tmp[][]
// vectorize consecutive elements in tmp when processing
compute elements of tmp (from in)
// vectorize consecutive elements in blurred when processing
use elements of tmp to compute the tile of blurred.
kayvonf
Consider the following. Since blurred is tiled, we have 4 loops, a loop nest over tiles, and a loop next over elements in the tile (examples courtesy of Jonathan Ragan-Kelley, creator of Halide):
For blurred.y:
For blurred.x:
For blurred.yi:
For blurred.xi:
Compute blurred given tmp
So root scheduling means:
// compute full image of tmp here
For blurred.y:
For blurred.x:
For blurred.yi:
For blurred.xi:
Compute blurred given tmp
chunk(x) means:
For blurred.y:
For blurred.x:
// compute a tile width by tile height +2 tile of tmp here
For blurred.yi:
For blurred.xi:
Compute blurred given tmp
chunk(xi) would mean about the same as inline:
For blurred.y:
For blurred.x:
For blurred.yi:
For blurred.xi:
// compute a 1x3 piece of tmp here
Compute blurred given tmp
And chunk(y) would mean:
For blurred.y:
// compute a tile height + 2 x image width strip of tmp here
For blurred.x:
For blurred.yi:
For blurred.xi:
Compute blurred given tmp
What is
chunk()
for?@sanchuah: It describes a desired producer-consumer evaluation order. See slide 43.
I'm a bit confused by the scheduling here. In the previous slide, the parallelism over tiles in the y direction applied to both the creation of tmp and blurred. But here it seems like we are only telling halide to produce tmp in chunks and to parallelize computing tiles of blurred along the y direction.
Will Halide figure out that it needs to compute tiles of tmp in parallel along the y direction as well so that they can be fed into the blurred function immediately like in the c++ code?
This feels weird to me because the scheduling that we've defined on a function (blurred) that occurs at a later stage in the pipeline is defining the scheduling that should occur for an earlier function in the pipeline (tmp).
@Kayteyn: The schedule specifies "produce elements of
blurred
in tile by tile order, vectorizing over the X direction for elements in a tile and parallelizing across cores at the granularity of different tiles.Now, for each of those tiles, to compute the value of
blurred(x,y)
, the program must access three elements oftmp
. There are a number of ways we can obtain the values of the needed elements of `tmp', and each way corresponds to different pipeline producer-consumer scheduling strategy:tmp(x,y)
we execute the appropriate math operations on values fromin()
to compute the needed value of tmp. (This would be the 'inline' scheduling policy shown on the slide 43.)tmp
in memory in an array. As a result, accessing elements oftmp
when computingblurred(x,y)
is just an array lookup. (This would be a 'root' scheduling policy on the slide 43.)blurred
. (This is the 'chunked' scheduling policy on slide 43 and what is specified here.) Note that the tile size ofblurred
is chosen so that a chunked scheduling policy yields pieces oftmp
that fit in the cache. And you can see the temporary allocation for these pieces as the variabletmp
in the code on slide slide 38.@Kayvonf I'd like to reiterate what you said make sure I'm understanding this correctly:
The compiler is aware of which values of tmp each call to blurred(x,y) depends on.
Given these dependencies, it can choose how it wants to generate and store (or not store) the intermediate values in tmp needed by blurred(x,y).
By specifying tmp.chunk(x) we are telling the compiler: "generate values from tmp that blurred(x,y) depends on tile by tile in the x direction. So now the compiler knows:
What would happen if we specified the schedule as "tmp.chunk(y).vectorize(x,8)" instead?
I'm guessing it would cause a compilation error because we told the Halide compiler to compute rows of tiles of blur in parallel (which depend on rows of tiles of tmp). But "tmp.chunk(y).vectorize(x,8)" tells Halide to compute chunks of tmp column by column. This means that the chunk of tmp that the next chunk of blur depends on will not match up with the chunk of tmp provided by the scheduler.
The output code generated from the schedule above is the code on slide 38. The rough sketch is:
Consider the following. Since
blurred
is tiled, we have 4 loops, a loop nest over tiles, and a loop next over elements in the tile (examples courtesy of Jonathan Ragan-Kelley, creator of Halide):So
root
scheduling means:chunk(x)
means:chunk(xi)
would mean about the same as inline:And
chunk(y)
would mean: