Archives Discussions

Raistmer · ‎03-25-2015

Trying to speedup processing of few large arrays I used shared/local memory for splittling arrays to smaller blocks and to increase execution domain of kernel.

It wroks on on my dev host (C-60 Loveland) and also gives correct results on HD6950 GPU. But some testers report wrong computations on some GPUs.

So far tested:

C-60 Loveland with OpenCL 1.2 AMD-APP (1268.1) driver (Windows) - correct results

HD6950 with OpenCL 1.2 AMD-APP (1348.5) driver (Windows) - correct results

HD7970/Tahiti with Catalyst 14.9 (Windows) - invalid results

Tahiti LE with Catalyst 14.12/ OpenCL 1.2 AMD-APP (1642.5) driver (Linux) - correct results

Hawaii Pro with Catalyst 14.9/ OpenCL 1.2 AMD-APP (1526.3) driver (Linux)- invalid results

Not too clear is it driver version related issue or card architecture related or some issue with kernel's code itself.

Here is the kernel under question: http://pastebin.com/c9sX8Xwj

It has debug output enabled and different cards provide quite different outputs.

What is wrong here?

P.S. kernel's local domain is always {x,1,z} hence no local id(1) used inside kernel. Also, kernel produced correct results on HD7970 with workgroups/local domain of (1,1,64) and (4,1,1)(this one means no array splitting at all) but generated wrong results with (1,1,128).

Did not find any allowed WG configs that would fail on C-60 so far...

Raistmer · ‎03-29-2015

Additional tests were made on Tahity, Tahity LE and Hawaii devices under Windows and Linux.

While Tahity LE worked with all possible workgroup geometry, both Tahity and Hawaii work correctly only when workgroup size less or equal to wave size (that is WGsize<=64). And for all possible kernel geometries. That is 2x1x32 works as well as 4x1x16, but 1x1x128 will not go.

All this points to some issues with synchronization between waves. Some required barriers missed? Or some issue on another than source code level?...

ravkum · ‎03-31-2015

Hi,

Would like to check the source code here. The shared path is not accessible here. Could you check that?

Regards,

Ravi

Raistmer · ‎04-02-2015

Actually, very high probability that this issue has same roots as described in this thread: possible OpenCl compiler bug few months ago. Cause we tried latest available drivers it means issue not fixed still.

Please do fix to already known and CONFIRMED by your staff issue first. This would save lot of time both users and support staff not to re-check and re-report already detected bugs over and over.

And full kernel code in case I'm mistaken and this is another issue:

__kernel __attribute__((vec_type_hint(float4)))
void PC_find_triplets_avg_kernel_HD5_cl(int ul_FftLength, int len_power, float triplet_thresh_base, int AdvanceBy, int PoTLen,
__global float4* PoT,__global uint* result_flag,__global float4* PulsePoT_average,
__local float4* tmp) {
//R: this one uses local memory to increase number of separate workitems hence number of workgroups for better load of
// multi-CU devices
//R: difference from original kernel: this one just doing fast precheck and relies on CPU for real triplet analysis
//R: this kernel does 4 PoT chunks at once.
//R: cause workitems can write flag in arbitrary order it's not possible to set error code actually (original CUDA code
// missed this fact and tries to return error code from kernel. That is, different variable should be used for setting
// state flags.
int ul_PoT = get_global_id(0);//R: 4 PoTs at once!
int y = get_global_id(1);//R: index of offset chunk
int tid=get_local_id(2);
int fft_len4=ul_FftLength>>2;
int TOffset = y * AdvanceBy;
//R: each wave of third index works on single PoT array
// local float4 local_sum[64/*can be get_local_size(2) if variable length allowed*/];
if(TOffset + len_power > PoTLen) {
TOffset = PoTLen - len_power;
}
__global float4* fp_PulsePot= PoT + ul_PoT + TOffset * (fft_len4);
// Clear the result array
//int4 numBinsAboveThreshold_private=(int4)0;
float4 tmp_private=(float4)0.f,triplet_thresh=(float4)triplet_thresh_base,pp;
__local float4* tmp_local=tmp+get_local_size(2)*get_local_id(0);
/* Get all the bins that are above the threshold, and find the power array mean value */
for( int i=tid;i<len_power;i+=get_local_size(2)/*can be get_local_size(2) if variable length allowed*/ ) {
tmp_private += fp_PulsePot[i*fft_len4];
}
//R: here can be one of new reduce operations but this will require higher CL version
tmp_local[tid]=tmp_private;
for(int i=(get_local_size(2)>>1); i>0;i>>=1){
barrier(CLK_LOCAL_MEM_FENCE);
if(tid<i){
tmp_local[tid]+=tmp_local[tid+i];
}
}
barrier(CLK_LOCAL_MEM_FENCE);
if(tid==0){
tmp_private=tmp_local[0];
tmp_private/= (float4)len_power;
PulsePoT_average[ul_PoT+y*fft_len4]=tmp_private;//R: this avg will be needed later, at pulse finding
tmp_local[0]=tmp_private;//R: to share with other threads
}
barrier(CLK_LOCAL_MEM_FENCE);
tmp_private=tmp_local[0];//R: broadcast reduced value to all threads for further use
triplet_thresh*=tmp_private;
tmp_private=(float4)0.f;
for( int i=tid;i<len_power;i+=get_local_size(2)) {
pp= fp_PulsePot[i*fft_len4];
if( pp.x>= triplet_thresh.x ) {
tmp_private.x+=1.f;
printf("X BRANCH: global:(%d,%d,%d); local:(%d,%d,%d); tmp_private:(%v4g)\n",
get_global_id(0),get_global_id(1),get_global_id(2),
get_local_id(0),get_local_id(1),get_local_id(2),tmp_private);
printf("pp:(%v4g); triplet_thresh:(%v4g)\n",pp,triplet_thresh);
}
if( pp.y>= triplet_thresh.y ) {
tmp_private.y+=1.f;
printf("Y BRANCH: global:(%d,%d,%d); local:(%d,%d,%d); tmp_private:(%v4g)\n",
get_global_id(0),get_global_id(1),get_global_id(2),
get_local_id(0),get_local_id(1),get_local_id(2),tmp_private);
printf("pp:(%v4g); triplet_thresh:(%v4g)\n",pp,triplet_thresh);
}
if( pp.z>= triplet_thresh.z ) {
tmp_private.z+=1.f;
printf("Z BRANCH: global:(%d,%d,%d); local:(%d,%d,%d); tmp_private:(%v4g)\n",
get_global_id(0),get_global_id(1),get_global_id(2),
get_local_id(0),get_local_id(1),get_local_id(2),tmp_private);
printf("pp:(%v4g); triplet_thresh:(%v4g)\n",pp,triplet_thresh);
}
if( pp.w>= triplet_thresh.w ) {
tmp_private.w+=1.f;
printf("W BRANCH: global:(%d,%d,%d); local:(%d,%d,%d); tmp_private:(%v4g)\n",
get_global_id(0),get_global_id(1),get_global_id(2),
get_local_id(0),get_local_id(1),get_local_id(2),tmp_private);
printf("pp:(%v4g); triplet_thresh:(%v4g)\n",pp,triplet_thresh);
}
}
//R: again need to reduce values
tmp_local[tid]=tmp_private;
for(int i=(get_local_size(2)>>1)/*can be get_local_size(2) if variable length allowed*/; i>0;i>>=1){
barrier(CLK_LOCAL_MEM_FENCE);
if(tid<i){
tmp_local[tid]+=tmp_local[tid+i];
}
}
barrier(CLK_LOCAL_MEM_FENCE);
if(tid==0){
tmp_private=tmp_local[0];
if(tmp_private.x>2.f || tmp_private.y>2.f || tmp_private.z>2.f || tmp_private.w>2.f){
//R: global size is power of 2 so it should be safe to perform integer division here
// printf("Resulting numbers of peaks: (%v4g)\n",tmp_private);
int result_coordinate=(get_global_size(0)>RESULT_SIZE)?
((RESULT_SIZE*get_global_id(0))/get_global_size(0)):get_global_id(0);
result_flag[result_coordinate]=1;
}
}
}

ravkum · ‎04-04-2015

Thanks for the code. Could you please also tell me the size of local memory allocated in the host code?

Also what is the global_work_size for all the different local_work_sizes you have mentioned in your earlier posts?

Regards,

Ravi

Raistmer · ‎04-06-2015

Thanks for looking into this issue.

Requested data:

1) List of global kernel sizes for run with failures (received on Tahity LE host that able to do this kernel properly):

host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (4,15,64); local (4,1,64)

host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (8,15,64); local (4,1,64)

host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)

host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)