The fact that there is no temporal locality but high spatial locality makes prefetching elements into CPU cache become a feasible way to reduce the amount of memory accesses.

rohany

Ideally however, this seems like a computation that we would like to parallelize efficiently. Aside from trying to increase memory bandwidth, what can we do to try and efficiently use parallel hardware to accelerate these kind of computations?

Master

To people who might be confused with the math here:

GTX1080 has 20 cores and 128 ALUs each core: 20 * 128 = 2560 in total.

Frequency is 1.6GHz, we can do 2560 * 1.6 * 10^9 = 4.096e12 multiplications per second. Each multiplication involves 3 ints = 12 bytes, and the total needed bandwidth is 4.096e12 * 12 / (1024^4) = 45 TB/s. Here we only have 320 GB/s, so GPU efficiency is less than 1%.

200

@rohany For this case, the reason memory bandwidth becomes the bottleneck is that we have a low Compute to Global Memory Access (CGMA) ratio. In order to achieve a higher level of efficiency without increasing the memory bandwidth, we need to increase CGMA ratio, i.e. introducing more calculations for each global memory access. Although we cannot have more computations for this task, we can interleave this task with other tasks, which are not IO-intensive, to increase the overall GPU/CPU efficiency.

yes

@200 Given that this task must be interleaved with other non-IO intensive tasks to increase the efficiency, does that mean that if a machine was programmed to solely do element-wise matrix multiplication, there really is no feasible way to more efficiently use the hardware?

lfragago

As far as I understand this task is not efficiently carried out using parallel computing only because there is a lot of data to be fetched from memory, and because of limited bandwidth. If we had an infinite amount of bandwidth then we would be able to use all ALUS and cores in parallel.
Is this correct?

eourcs

Memory bandwidth limitations is one of the more pertinent issues in computer architecture. Solutions to increasing memory bandwidth usually revolve around adding more buses of increasing width (this is definitely true of GPUs). Without increasing memory bandwidth, we are left to rely on smarter caching. One approach to is to schedule our tasks efficiently with respect to locality.

Here is a link to interesting article by HP on scheduling by available memory bandwidth.

http://www.hpcadvisorycouncil.com/pdf/vendor_content/Platform_Scheduling to Overcome the Multi-Core Memory Bandwidth Bottleneck WP.pdf

kayvonf

@lfragago. You are correct. This problem is a question of throughput. The processor can perform X mathematical operations per second, requiring Y GB/sec of data. The memory system is architected to provide Z GB/sec of data. If Y > Z, then the application will be bandwidth bound. The only way to achieve peak utilization of the ALUs (i.e., to achieve a throughput of X mathematical operations per second), would be to change the program so that the ratio of math operations performed to bytes read from memory is higher.

In this simple program: element-wise addition of two large vectors, there's not much to be done to run faster, other than build a higher throughput memory system.

Slide 68 describes a few strategies to reduce the bandwidth requirements of programs.

sherwood

@Master I believe the calculation should be 4.096e12 * 12 / (1024^4) = 45 TB/s instead of 4.096e12 * 12 / (1024^3) = 45 TB/s

The fact that there is no temporal locality but high spatial locality makes prefetching elements into CPU cache become a feasible way to reduce the amount of memory accesses.

Ideally however, this seems like a computation that we would like to parallelize efficiently. Aside from trying to increase memory bandwidth, what can we do to try and efficiently use parallel hardware to accelerate these kind of computations?

To people who might be confused with the math here:

GTX1080 has 20 cores and 128 ALUs each core: 20 * 128 = 2560 in total.

Frequency is 1.6GHz, we can do 2560 * 1.6 * 10^9 = 4.096e12 multiplications per second. Each multiplication involves 3 ints = 12 bytes, and the total needed bandwidth is 4.096e12 * 12 / (1024^4) = 45 TB/s. Here we only have 320 GB/s, so GPU efficiency is less than 1%.

@rohany For this case, the reason memory bandwidth becomes the bottleneck is that we have a low Compute to Global Memory Access (CGMA) ratio. In order to achieve a higher level of efficiency without increasing the memory bandwidth, we need to increase CGMA ratio, i.e. introducing more calculations for each global memory access. Although we cannot have more computations for this task, we can interleave this task with other tasks, which are not IO-intensive, to increase the overall GPU/CPU efficiency.

@200 Given that this task must be interleaved with other non-IO intensive tasks to increase the efficiency, does that mean that if a machine was programmed to solely do element-wise matrix multiplication, there really is no feasible way to more efficiently use the hardware?

As far as I understand this task is not efficiently carried out using parallel computing only because there is a lot of data to be fetched from memory, and because of limited bandwidth. If we had an infinite amount of bandwidth then we would be able to use all ALUS and cores in parallel. Is this correct?

Memory bandwidth limitations is one of the more pertinent issues in computer architecture. Solutions to increasing memory bandwidth usually revolve around adding more buses of increasing width (this is definitely true of GPUs). Without increasing memory bandwidth, we are left to rely on smarter caching. One approach to is to schedule our tasks efficiently with respect to locality.

Here is a link to interesting article by HP on scheduling by available memory bandwidth.

`http://www.hpcadvisorycouncil.com/pdf/vendor_content/Platform_Scheduling to Overcome the Multi-Core Memory Bandwidth Bottleneck WP.pdf`

@lfragago. You are correct. This problem is a question of throughput. The processor can perform X mathematical operations per second, requiring Y GB/sec of data. The memory system is architected to provide Z GB/sec of data. If Y > Z, then the application will be

bandwidth bound. The only way to achieve peak utilization of the ALUs (i.e., to achieve a throughput of X mathematical operations per second), would be to change the program so that the ratio of math operations performed to bytes read from memory is higher.In this simple program: element-wise addition of two large vectors, there's not much to be done to run faster, other than build a higher throughput memory system.

Slide 68 describes a few strategies to reduce the bandwidth requirements of programs.

@Master I believe the calculation should be 4.096e12 * 12 / (1024^4) = 45 TB/s instead of 4.096e12 * 12 / (1024^3) = 45 TB/s