Question: Describe the "schedule" of this program. (In what order are the output pixels of the two-pass blur computed?)
The program will split the image into chunks of 256 x 32 and compute the pixels in each block 8 at a time with SIMD starting at the top left and going first across each row then down to the next. The chunks will also go across each row first and then continue to the next row of blocks.
How big are the binarys for these kinds of programs? I know cpp binaries can get fairly large, and that's one of the reasons that C is still popular in the embedded space as opposed to cpp. I know cell phone computing memory is quite large, but has there ever been a problem with halide code hogging too much instruction memory?