All OpenCL versions form all vendors have this issue. It is a wrong computation. Please take a look at my blog describing it in detail. Can this be fixed on AMD OpenCL anyhow?
https://iblog.isowa.io/2020/01/04/gpu-opencl-fine-tuning-problem
it is part of GitHub - sowson/darknet: Convolutional Neural Networks on OpenCL on Intel & NVidia & AMD & Mali GPUs... project please help!
Thanks!
There is one more thing if tuning is equal 1 then the computation is more accurate and when tuning is equal 16 there is faster but no as much accurate, thanks!
Question is if this is a real bug or just a problem of how the hardware works ... I mean
17 | for (s = 0; s < tuning; ++s) { |
18 | mean += sums ; |
19 | } |
is writing to same memory address from one thread but without any barrier in, meaning: its not guarantied that all results from early iterations are completed (or well ... consistent throughout all cache levels) before you access the same value again to add something different.
But I wonder: why don't you use register values for "sum" and "mean" and write only the results to local / global memory when done? It seems both fields are thread exclusive before / after the barrier, so that would save a lot of time having only one write of sum to local and only one write of mean to global memory per thread - and you get rid of the caching issues.
I am not sure what you suggest, but my only point was that if you put after barrier the following code:
printf("%i\n", t);
there will be for example from 0 to 15 values once tuning is 16, right? now,
why puting the if (t == 0) {/* code to calculate sums togheter */} not work,
because simply (t == 0) not happen and when you print to its values t is 0..
so question is why? Any help would be nice, once tuning 1 is slow and 16 is,
wrong .
PROPER SOLUTION ;-).
The thing is that for "tuning" thread space you have to make a pool of the global threads and pool of local sums space. I think the solution overall is elegant because it is staring only by the tuning parameter that can be dynamically adjusted. ;-).
CODE OF INVOCATION
void fast_mean_gpu(cl_mem_ext x, int batch, int filters, int spatial, cl_mem_ext mean)
{
int tuning = filters / 4;
dim2 dimGridG1;
dimGridG1 = dim2_create(tuning, filters);
dim2 dimGridL1;
dimGridL1 = dim2_create(tuning, 1);
opencl_kernel_local(opencl_fast_mean_kernel[opencl_device_id_t], dimGridG1, dimGridL1, 14, &tuning, sizeof(cl_int), NULL, tuning*sizeof(cl_float), &filters, sizeof(cl_int), &batch, sizeof(cl_int), &spatial, sizeof(cl_int), &x.mem, sizeof(cl_mem), &mean.mem, sizeof(cl_mem));
}
CODE OF KERNEL GPU
__kernel void fast_mean_kernel(int tuning, __local float *sums, int filters, int batch, int spatial, __global float *x, __global float *mean) {
int t = get_global_id(0);
if (t >= tuning) return;
int i = get_global_id(1);
if (i >= filters) return;
sums= 0;
int j, k, s;
for (j = 0; j < batch; ++j) {
for (k = t; k < spatial; k += tuning) {
int index = j * filters * spatial + i * spatial + k;
sums+= x[index]; if (t == 0) {
}
}
mean = 0;
for (s = 0; s < tuning; ++s) {
mean += sums;
}
mean /= (spatial * batch);
}
}
Thanks! and Enjoy!