cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

marwen
Journeyman III

Coalescence

i'm trying to evaluate performance differences between global memory access in a 128 bytes aligned boundary manner against an access not aligned.

the fact is that i can't find any difference although there should be an improvement of like an order of magnitude due to coalesced access.

so what is the smallest kernel code which exhibit global memory access coalescence?

Thanks

0 Likes
2 Replies
omkaranathan
Adept I

Coalescing :  memory access pattern  that occurs when that the memory requests generated by each thread are in the same GPU cache line. So when each thread reads a vec4 sequentially from global memory that access pattern is coalesced.

 You can compare the performance between these 2 kernels –

 __kernel void read(__global float *input,__global float *output)

{

float4 val = (float4)(0.0f, 0.0f, 0.0f, 0.0f);

uint index = get_global_id(0);;

val = val + input[index + 0];

val = val + input[index + 1];

val = val + input[index + 2];

val = val + input[index + 3];

val = val + input[index + 4];

val = val + input[index + 5];

val = val + input[index + 6];

val = val + input[index + 7];

val = val + input[index + 8];

val = val + input[index + 9];

val = val + input[index + 10];

val = val + input[index + 11];

val = val + input[index + 12];

val = val + input[index + 13];

val = val + input[index + 14];

val = val + input[index + 15];

output[get_global_id(0);] = val;

 }

 And

 __kernel void read(__global float4 *input,__global float4 *output)

{

float4 val = (float4)(0.0f, 0.0f, 0.0f, 0.0f);

uint index = get_global_id(0);;

val = val + input[index + 0];

val = val + input[index + 1];

val = val + input[index + 2];

val = val + input[index + 3];

val = val + input[index + 4];

val = val + input[index + 5];

val = val + input[index + 6];

val = val + input[index + 7];

val = val + input[index + 8];

val = val + input[index + 9];

val = val + input[index + 10];

val = val + input[index + 11];

val = val + input[index + 12];

val = val + input[index + 13];

val = val + input[index + 14];

val = val + input[index + 15];

output[get_global_id(0);] = val;

 }

 Note the difference between float and float4 in arguments.

The 2nd kernel is using coalesce accesses and you will get a good bandwidth using this pattern.



0 Likes

Using a similar benchmark in CAL/IL in compute shader mode, I actually get better performance using float than float4.

Albeit, the float4 gets 4* as many floats as the float but just looking at it, the float global memory access is better overall, but not "per float".

EDIT: what's oddly funny is that I get a better performance for global reads using pixel shader mode for both the float and float4 types.

0 Likes