cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

zkhan
Journeyman III

Best practices for accumulation?

How to handle: global_result += kernel_result

The kernel I'm working with takes a 2-d data structure A and some other parameters as input, calculates a result, and then accumulates this value in a single voxel of a large 3-d data structure B at the very end.

i.e.

// calculations using A

.

.

.

B[outIdx] += result;

where A and B are both global.

My problem is that it seems that the sequential read/write caused by the "+=" is a significant bottleneck in the execution. Replacing that statement with only a read or only a write results in a more than 5x speedup. However, I need to accumulate the result - is there a better way to do this, that doesn't incur the severe penalty of the consecutive read/writes?

0 Likes
4 Replies
HarryH
Journeyman III

Try this:

http://developer.amd.com/documentation/articles/Pages/OpenCL-Optimization-Case-Study-Simple-Reductions.aspx

0 Likes
dravisher
Journeyman III

Since you've already established that a single read or a single write gives much better performance, can't you just read in the old value for B some time earlier (say B_local), then increment this value locally and do B[outIdx]=B_local at the end? If the initial read of B is placed somewhere in the code with lots of calculations before and after, there's a good chance the compiler can hide the memory access too. Perhaps what is happening in your case is that the compiler is not able to do anything inbetween the read and write of B. Manually placing the read somewhere else could be a good thing in that case.

Also, just to be sure, you're aware that doing B[outIdx]+=result is not safe if any other work-item has the same outIdx?

0 Likes

The array reduction problem supplies the solution for your problem.

The kernel code for the array reduction is attached.

The result actually need to be stored inside an array,having elements equal to the number of workgroups.this array will then be added sequentially in CPU.

I hpoe it helps.

Himanshu

/* Write MMM using local memory */ #pragma OPENCL EXTENSION cl_amd_printf:enable #define Accumulate(x,y) {ResultLocal+=ResultLocal;} void __kernel rrrLocal(int ArrLen, __global float* Input, __global float* Result, int BLOCK_SIZE, __local float* ResultLocal) { int xGlobal=get_global_id(0); int xLocal=get_local_id(0); int validrange=BLOCK_SIZE/2; int numWorkGroup=get_num_groups(0); ResultLocal[xLocal]=Input[xGlobal]; barrier(CLK_LOCAL_MEM_FENCE); for(;validrange>0;validrange/=2) { if(xLocal<validrange) { Accumulate(xLocal,xLocal+validrange); } barrier(CLK_LOCAL_MEM_FENCE); } if(xLocal==0) { Result[WorkGroup]+=ResultLocal[0]; } barrier(CLK_LOCAL_MEM_FENCE); }

0 Likes
genaganna
Journeyman III

Originally posted by: zkhan The kernel I'm working with takes a 2-d data structure A and some other parameters as input, calculates a result, and then accumulates this value in a single voxel of a large 3-d data structure B at the very end.

 

i.e.

 

// calculations using A

 

.

 

.

 

.

 

B[outIdx] += result;

 

where A and B are both global.

 

My problem is that it seems that the sequential read/write caused by the "+=" is a significant bottleneck in the execution. Replacing that statement with only a read or only a write results in a more than 5x speedup. However, I need to accumulate the result - is there a better way to do this, that doesn't incur the severe penalty of the consecutive read/writes?

 

1. Read B[outldx] to some temp variable

2. calculate result

3. write temp + result to B[outldx]

 

Above steps should improve your performance.

 

0 Likes