Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Previous | Next --- Slide 38 of 58

kayvonf

A good challenge is to describe all the optimizations that were performed in this implementation. (yes, they are enumerated on the slide, but not particularly verbosely.) Consider this as an implementation of an assignment. How would you describe this implementation in your assignment "handin"?

@kayvonf

In this function, our goal is to blur each pixel in an image by taking the average of the 3x3 neighborhood around the pixel. Note that every pixel in our image is represented by an unsigned 16 bit integer.

We break the image down into blocks of 32 rows by 256 columns. We use openMP to assign each row of blocks to a core. We further divide each block down into groups of 8 pixels in a row, which fits in a 128-bit register.

In the loop for (int y = -1; y < 32 + 1; y++), we iterate through each row in the block, starting from the row immediately above the block and ending at the row immediately below the block. For each row, we process 8 pixels at a time using SIMD vector intrinsics. Suppose the current 8 pixels we are considering are [a, b, c, d, e, f, g, h]. We first load [-a, a, b, c, d, e, f, g] into a, where -a is the element to the immediate left of element a. Similarly, we load [b, c, d, e, f, g, h, +h] into b, where +h is the element to the immediate right of element h. Finally, we load [a, b, c, d, e, f, g, h] into c. In other words, a stores the left neighbors, b stores the right neighbors, and c stores the pixels. We then perform the vector operation (a + b + c)/3, which averages each pixel with its 2 horizontal neighbors, and store the results in tmp.

After we've processed all the rows in the block, we enter the loop for (int y = 0; y < 32; y++). In this loop, we again process 8 pixels from the tmp at a time to take advantage of SIMD vector intrinsics. We load the bottom neighbors into a, the current pixels into b, and the top pixels into c. We then perform the vector operation (a + b + c)/3, which is equivalent of computing the average of the 3x3 neighborhood around the pixel in the original image.

Kapteyn

Another thing the programmer has to consider when implementing blur with vector intrinsics is going out of bounds. I believe when we reach the edges of the image, some of the 8 wide vectors will go out of bounds of the image. So just like we did in our CUDA code we also have to check that our vectors are accessing data in bounds.