Archives Discussions

n_treutner · ‎01-11-2011

Kernel produces wrong results

Hi There,

I've only reecntly switched from 2.2 to 2.3 and now parts of my code, that worked fine in 2.2 don't work anymore. i narrowed the problem (or at least one part of it) down to this kernel (attached), by checking the Buffers with the GDebugger.

What the kernel does, is: it compares to images (inputimage1 and 2) pixelwise. i initialize as many work-units as there are pixels (x, y) and then compare them over a window (X and Y), moving to the left of the image (with ww). the results for each comparison (so, variation of ww) are saved inside a 3d array, that actually is a normal array (dsi).

It appears, that the results, that are saved via

dsi[(x + y * width) * w_range + w_dsi] = sum0.w;

are the same inside each kernel (so the work-unit (x,y) writes 231678 to each field of the dsi, that it is supposed to write to), while they should change, since i change ww (via the for loop).

so i tried to find any changes in the handling of images and/or loops, but i couldn't find any relevant stuff.

i wonder if you guys have any idea, what is wrong.

__kernel void ssdkernel(__read_only image2d_t inputImage1, __read_only image2d_t inputImage2, __global float * dsi, __constant int * params, sampler_t imageSampler) { int x = get_global_id(0); int y = get_global_id(1); int width = get_global_size(0); int height = get_global_size(1); int k = (params[0] - 1)/2; int w = params[1]; int w_min = params[2]; int stepping = params[3]; int w_range = (w - w_min) / stepping; float4 diff0, sum0; //iterate over all disparities for (int ww = w_min, w_dsi = 0; ww < w; ww += stepping, w_dsi++) { diff0 = 0.0f; sum0 = 0.0f; //match the windows for(int Y = -k; Y <= k; Y=Y+1) { for(int X = -k; X <= k; X=X+1) { diff0 = read_imagef(inputImage1, imageSampler, (int2)(x + X, y + Y)) - read_imagef(inputImage2, imageSampler, (int2)(x + X - ww, y + Y)); sum0 += diff0 * diff0; } } sum0.w = sqrt(sum0.x * sum0.x + sum0.y * sum0.y + sum0.z * sum0.z); //compute the euclidian distance dsi[(x + y * width) * w_range + w_dsi] = sum0.w; } }

n_treutner · ‎01-18-2011

Hi.

i still didn't have any success in finding the error.

i wonder, if some unrolling might be the cause of the problem, although i don't set any pragma unroll-flag. do you have any information about changes in this area? is there a detailled list of the changes?

regards,

niklas

Jawed · ‎01-19-2011

I notice you are using the NDRange dimensions to determine the image size.

OpenCL's global domain size must be an integer multiple of the local workgroup size. So if your local workgroup is 8x8 then the global size must be a multiple of that. If these dimensions don't match the dimensions of the image or your dsi buffer (e.g. dsi is too small), then you could have problems.

Perhaps the SDK change has affected the workgroup size?

Another thing that might be worth trying is defining the sampler within the kernel, rather than passing it in as a parameter.

n_treutner · ‎01-19-2011

Jawed, thanks for your answer.

The global size is a multiple of the local workgroup size. also, i've tried other kernels, that work on the same input with the same images and in the same workgroup size, and these kernels work.

i had the sampler defined within the kernel, but as i tested the code on an nvidia-system, it didn't work until we passed the sampler as a parameter.

also, as i said above, other kernels do access the images correctly. and this kernels reads values from the images, too. it's just, that the results, that are written to the dsi-var are all the same inside each kernel (so they vary from kernel to kernel).

Jawed · ‎01-19-2011

Well in this situation I get out the hammer and sprinkle the code with literals instead of computed addresses and use a very small image. Or write the computed indices into the buffer, instead of the data.

I'm about to post a thread where I'm doing something very similar to your code (3 nested loops iterating windows over two images). It won't help you, but you'll at least see that I'm not having grief with loop indices.

n_treutner · ‎01-19-2011

good idea. i did, as you suggested and it turns out, that the computation and/or the adress-computation of the images is the problem. so it's this line:

diff0 = read_imagef(inputImage1, imageSampler, (int2)(x + X, y + Y)) - read_imagef(inputImage2, imageSampler, (int2)(x + X - ww, y + Y));

I found this out by assigning specific values to the dsi in the last line, which turned out to be fine.

What seems to be the problem is the variation indicated by ww. i compare several regions (the region dimensions are indicated by X and Y), apart by ww pixels, to each other. Apparently, for each iteration, when ww is increased, the same set of pixels are compared, although the region should have moved ww pixels to the left.

himanshu_gautam · ‎01-20-2011

Can you specify the typical values you specify in the params array. And you local and global ndrange.

You can send a test case via helpdesk http://developer.amd.com/support/Pages/default.aspx.

Jawed · ‎01-20-2011

The next thing I would try is:

if (x != 20 && y != 20) return;

Just after

int x = get_global_id(0);
int y = get_global_id(1);

So this will run the kernel for a single pixel. See if the result for the single pixel is correct.

I chose 20 on the basis that there will be some pixels to the left. Choose anything you think is reasonable

n_treutner · ‎01-21-2011

himanshu.guatam,

typical values for the params are

1 < k < 10

w = 160

w_min = 40

stepping 1, 2 or 4

that makes w_range usually 20 and so the dsi-array is width * height * dsiSize * dsi_range, so: 640 * 480 * sizeof(cl_float) * 20

the global ndrange is 640 , 480, the local ndrange 16, 16

Jawed, i'll try this as soon, as i get back home. but i don't think it will have different results, since the error is within one single kernel, not between kernels.

also, i start to think, that the sampler might be the problem. it's defined as

imageSampler = clCreateSampler(context, false, CL_ADDRESS_CLAMP,
CL_FILTER_NEAREST, &status);

was anything changed for the image-adressing?

Jawed · ‎01-21-2011

I use

const sampler_t plane = CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP |
CLK_FILTER_NEAREST;

inside my kernel, rather than creating the sampler on the host and passing it as a parameter. You're apparently setting your sampler the same way.

Very strange.

n_treutner · ‎01-21-2011

Jawed, to your suggestion with the if (x != 20 && y != 20) return;

i did as you suggested (with the small change, that i used || instead of && and the result was, as i thought it would be: only one kernel was executed, with the wrong (i.e. the same) results for all instances of ww.

and yes, i pass the sampler to the kernel, rather than creating it inside, beacuse my boss runs it on a nvidia machine, and nvidia needs it created outside, and passed via param. could this be the reason?

i already tested normal reading from images and it works, so the sampler itself isn't broken. maybe it just handles multiple calls to similar areas incorrectly?

Jawed · ‎01-21-2011

LOL at my logic fail.

My code runs on NVidia (have only known this for about 20 hours), so the sampler problem might be with old NVidia drivers. The sampler is defined locally rather than globally, for what it's worth. Though I don't see why changing the sampler should help you.

Try literal values instead of all the loop parameters? e.g.:

    int k = 4;//(params[0] - 1)/2;
    int w = 160;//params[1];
    int w_min = 40;//params[2];

    int stepping = 2;//params[3];

This results in lots of unrolling. See if the result is what you're expecting?

I had mysterious behaviour in my code due to a compiler error. It's pretty tricky to explain. It seems to be a mixture of literals and computed values for address computation - coupled with nested IFs based upon work item IDs. I'm now wondering if you are suffering a similar compiler problem.

One thing I noticed, which is really puzzling, is that I would get different compilations in SKA depending on whether another, similar, kernel was also loaded. e.g. right now NLMSingleFrameFourPixel compiles as 669 ALU instructions if NLMMultiFrameFourPixel and NLMFinalise are in SKA. If I put #if0 ... #endif around them, I get 677 ALU instructions.

That's just nuts in my view.

ibird · ‎01-21-2011

I has also the same problem upgrading from 2.2 to 2.3 ( and driver ) on gpu i has different

results from CPU ( work on CPU Nvidia GPU not on ATI gpu ).

After a lot work trying to isolate the problem i has finded that there is a problem with the constant parameters passed. Im my case it seem that the constant in some situation has strange behaviour.

I order to see if you have the same problem you can change the constant parameters with the runtime value

for example if you have parameter[0] = 256;

change your code from int k = (params[0] - 1)/2; to int k = (256 - 1)/2;

Is not a workaround, but is just to see if you have the same problem

himanshu_gautam · ‎01-23-2011

Jawed and ibird,

I would really appreciate if you can send a test case. I do not find any problem with your kernel in SKA though.

ibird · ‎01-24-2011

himanshu

Is not easy to build a Test Case, i has tried to simplify my kernel to build an easy test case but is extremly difficult to isolate.

The kernel is this attached to this post. ( )

CPU and GPU has different result, GPU work if i multiply with f_aux_dev outside the for cicle. This type of problem has been introduced with the last Catalyst driver needed by the new SDK, downgrading the SDK do not solve the problem. There are not buffer overflow on reading, as one can think; each kernel is alway controlled with a on OpenCL CPU, Nvidia GPU, and simple CPU, no one of them generate errors on result except ATI GPU with last drivers and SDK, (previous driver and SDK do not generate errors).

I hope hat Jawed has a simpler kernel to create a Test Case.

I can create a test case from this kernel to isolate the problem, but not before two or three week.

#define C1RE ( (link0.z*link2.z-link0.w*link2.w) - (link1.x*link2.x-link1.y*link2.y) ) #define C1IM (-(link0.z*link2.w+link0.w*link2.z) + (link1.x*link2.y+link1.y*link2.x) ) #define C2RE ( (link1.x*link1.z-link1.y*link1.w) - (link0.x*link2.z-link0.y*link2.w) ) #define C2IM (-(link1.x*link1.w+link1.y*link1.z) + (link0.x*link2.w+link0.y*link2.z) ) #define C3RE ( (link0.x*link2.x-link0.y*link2.y) - (link0.z*link1.z-link0.w*link1.w) ) #define C3IM (-(link0.x*link2.y+link0.y*link2.x) + (link0.z*link1.w+link0.w*link1.z) ) #define DeclareMatrixRegs float4 link0, link1, link2 #ifndef IMAGE #define LoadLinkRegs(gauge, vol, idx, mu) \ link0 = gauge[idx+vol*(0+3*mu)]; \ link1 = gauge[idx+vol*(1+3*mu)]; \ link2 = gauge[idx+vol*(2+3*mu)] #define LoadLinkRegsBrd(gauge, vol, idx, mu) \ link0 = gauge[idx+vol*(0+3*mu)+brdg]; \ link1 = gauge[idx+vol*(1+3*mu)+brdg]; \ link2 = gauge[idx+vol*(2+3*mu)+brdg] __kernel void DeriveFieldKernel(__global float4 *out_matrix, __global int *table, __global int *phases, __global float4 *gauge_texRef, const unsigned int bsize_dev, const unsigned int size_dev, const float f_aux_dev) { unsigned int brds; unsigned int brdg; unsigned int brde; unsigned int size_dev_t; int2 threadIdx = {get_local_id(0),get_local_id(1)}; int2 blockDim = {get_local_size(0),get_local_size(1)}; int2 blockIdx = {get_group_id(0),get_group_id(1)}; //Linear index int idx = blockIdx.x * blockDim.x + threadIdx.x; int nu, direction; int temp_site; int stag_phase; //New table indexing (index fastest) __local int site_table[NUM_THREADS]; // Staples __local float2 staples[9][NUM_THREADS]; //Temporary row-column store float2 mat0, mat1, mat2; //6 registers DeclareMatrixRegs; //12 registers float2 matrix_00, matrix_01, matrix_02, matrix_10, matrix_11, matrix_12, matrix_20, matrix_21, matrix_22; //18 registers staples[0][threadIdx.x].x = 0.0f; staples[0][threadIdx.x].y = 0.0f; staples[1][threadIdx.x].x = 0.0f; staples[1][threadIdx.x].y = 0.0f; staples[2][threadIdx.x].x = 0.0f; staples[2][threadIdx.x].y = 0.0f; staples[3][threadIdx.x].x = 0.0f; staples[3][threadIdx.x].y = 0.0f; staples[4][threadIdx.x].x = 0.0f; staples[4][threadIdx.x].y = 0.0f; staples[5][threadIdx.x].x = 0.0f; staples[5][threadIdx.x].y = 0.0f; staples[6][threadIdx.x].x = 0.0f; staples[6][threadIdx.x].y = 0.0f; staples[7][threadIdx.x].x = 0.0f; staples[7][threadIdx.x].y = 0.0f; staples[8][threadIdx.x].x = 0.0f; staples[8][threadIdx.x].y = 0.0f; // mu = blockIdx.y for(direction = 1; direction < 4; direction++) { nu = (direction+blockIdx.y) & 3; // nu = (direction+blockIdx.y) % 4; // (1) // +-->--+ // | | // ^mu | V (2) to calculate (1)*(2)*(3) // | | | // +--<--+ // ->nu idx (3) site_table[threadIdx.x] = table[idx+(4+blockIdx.y)*size_dev]; // idx+mu brde = (site_table[threadIdx.x] < size_dev)?0:1; brdg = brde; size_dev_t = (brde)?bsize_dev:size_dev; brde *= 3*size_dev; // brd*(4*size_dev - 1*size_dev) brdg *= 23*size_dev; LoadLinkRegsBrd(gauge_texRef, size_dev_t, site_table[threadIdx.x], nu); // U(idx+mu)_nu stag_phase=phases[site_table[threadIdx.x] + nu*size_dev_t + brde]; matrix_00.x = f_aux_dev*link0.x; matrix_00.y = f_aux_dev*link0.y; matrix_01.x = f_aux_dev*link0.z; matrix_01.y = f_aux_dev*link0.w; matrix_02.x = f_aux_dev*link1.x; matrix_02.y = f_aux_dev*link1.y; matrix_10.x = f_aux_dev*link1.z; matrix_10.y = f_aux_dev*link1.w; matrix_11.x = f_aux_dev*link2.x; matrix_11.y = f_aux_dev*link2.y; matrix_12.x = f_aux_dev*link2.z; matrix_12.y = f_aux_dev*link2.w; matrix_20.x = stag_phase*f_aux_dev*C1RE; matrix_20.y = stag_phase*f_aux_dev*C1IM; matrix_21.x = stag_phase*f_aux_dev*C2RE; matrix_21.y = stag_phase*f_aux_dev*C2IM; matrix_22.x = stag_phase*f_aux_dev*C3RE; matrix_22.y = stag_phase*f_aux_dev*C3IM; // matrix=f_aux_dev*U(idx+mu)_nu ///////////////////////////////////////////////////////////////////////////// site_table[threadIdx.x] = table[idx+(4+nu)*size_dev]; // idx+nu brde = (site_table[threadIdx.x] < size_dev)?0:1; brdg = brde; size_dev_t = (brde)?bsize_dev:size_dev; brde *= 3*size_dev; // brd*(4*size_dev - 1*size_dev) brdg *= 23*size_dev; LoadLinkRegsBrd(gauge_texRef, size_dev_t, site_table[threadIdx.x], blockIdx.y); // U(idx+nu)_mu stag_phase=phases[site_table[threadIdx.x] + blockIdx.y*size_dev_t + brde]; mat0.x = matrix_00.x*link0.x+matrix_00.y*link0.y+ matrix_01.x*link0.z+matrix_01.y*link0.w+ matrix_02.x*link1.x+matrix_02.y*link1.y; mat0.y = -matrix_00.x*link0.y+matrix_00.y*link0.x -matrix_01.x*link0.w+matrix_01.y*link0.z -matrix_02.x*link1.y+matrix_02.y*link1.x; mat1.x = matrix_00.x*link1.z+matrix_00.y*link1.w+ matrix_01.x*link2.x+matrix_01.y*link2.y+ matrix_02.x*link2.z+matrix_02.y*link2.w; mat1.y = -matrix_00.x*link1.w+matrix_00.y*link1.z -matrix_01.x*link2.y+matrix_01.y*link2.x -matrix_02.x*link2.w+matrix_02.y*link2.z; mat2.x = stag_phase*(matrix_00.x*C1RE+matrix_00.y*C1IM+ matrix_01.x*C2RE+matrix_01.y*C2IM+ matrix_02.x*C3RE+matrix_02.y*C3IM); mat2.y = stag_phase*(-matrix_00.x*C1IM+matrix_00.y*C1RE -matrix_01.x*C2IM+matrix_01.y*C2RE -matrix_02.x*C3IM+matrix_02.y*C3RE); matrix_00 = mat0; matrix_01 = mat1; matrix_02 = mat2; mat0.x = matrix_10.x*link0.x+matrix_10.y*link0.y+ matrix_11.x*link0.z+matrix_11.y*link0.w+ matrix_12.x*link1.x+matrix_12.y*link1.y; mat0.y = -matrix_10.x*link0.y+matrix_10.y*link0.x -matrix_11.x*link0.w+matrix_11.y*link0.z -matrix_12.x*link1.y+matrix_12.y*link1.x; mat1.x = matrix_10.x*link1.z+matrix_10.y*link1.w+ matrix_11.x*link2.x+matrix_11.y*link2.y+ matrix_12.x*link2.z+matrix_12.y*link2.w; mat1.y = -matrix_10.x*link1.w+matrix_10.y*link1.z -matrix_11.x*link2.y+matrix_11.y*link2.x -matrix_12.x*link2.w+matrix_12.y*link2.z; mat2.x = stag_phase*(matrix_10.x*C1RE+matrix_10.y*C1IM+ matrix_11.x*C2RE+matrix_11.y*C2IM+ matrix_12.x*C3RE+matrix_12.y*C3IM); mat2.y = stag_phase*(-matrix_10.x*C1IM+matrix_10.y*C1RE -matrix_11.x*C2IM+matrix_11.y*C2RE -matrix_12.x*C3IM+matrix_12.y*C3RE); matrix_10 = mat0; matrix_11 = mat1; matrix_12 = mat2; mat0.x = matrix_20.x*link0.x+matrix_20.y*link0.y+ matrix_21.x*link0.z+matrix_21.y*link0.w+ matrix_22.x*link1.x+matrix_22.y*link1.y; mat0.y = -matrix_20.x*link0.y+matrix_20.y*link0.x -matrix_21.x*link0.w+matrix_21.y*link0.z -matrix_22.x*link1.y+matrix_22.y*link1.x; mat1.x = matrix_20.x*link1.z+matrix_20.y*link1.w+ matrix_21.x*link2.x+matrix_21.y*link2.y+ matrix_22.x*link2.z+matrix_22.y*link2.w; mat1.y = -matrix_20.x*link1.w+matrix_20.y*link1.z -matrix_21.x*link2.y+matrix_21.y*link2.x -matrix_22.x*link2.w+matrix_22.y*link2.z; mat2.x = stag_phase*(matrix_20.x*C1RE+matrix_20.y*C1IM+ matrix_21.x*C2RE+matrix_21.y*C2IM+ matrix_22.x*C3RE+matrix_22.y*C3IM); mat2.y = stag_phase*(-matrix_20.x*C1IM+matrix_20.y*C1RE -matrix_21.x*C2IM+matrix_21.y*C2RE -matrix_22.x*C3IM+matrix_22.y*C3RE); matrix_20 = mat0; matrix_21 = mat1; matrix_22 = mat2; // matrix=f_aux_dev*U(idx+mu)_nu * [U(idx+nu)_mu]^{dag} barrier(CLK_LOCAL_MEM_FENCE); //////////////////////////////////////////////////////// LoadLinkRegs( gauge_texRef, size_dev, idx, nu); // U(x)_nu stag_phase=phases[idx+nu*size_dev]; mat0.x = matrix_00.x*link0.x+matrix_00.y*link0.y+ matrix_01.x*link0.z+matrix_01.y*link0.w+ matrix_02.x*link1.x+matrix_02.y*link1.y; mat0.y = -matrix_00.x*link0.y+matrix_00.y*link0.x -matrix_01.x*link0.w+matrix_01.y*link0.z -matrix_02.x*link1.y+matrix_02.y*link1.x; mat1.x = matrix_00.x*link1.z+matrix_00.y*link1.w+ matrix_01.x*link2.x+matrix_01.y*link2.y+ matrix_02.x*link2.z+matrix_02.y*link2.w; mat1.y = -matrix_00.x*link1.w+matrix_00.y*link1.z -matrix_01.x*link2.y+matrix_01.y*link2.x -matrix_02.x*link2.w+matrix_02.y*link2.z; mat2.x = stag_phase*(matrix_00.x*C1RE+matrix_00.y*C1IM+ matrix_01.x*C2RE+matrix_01.y*C2IM+ matrix_02.x*C3RE+matrix_02.y*C3IM); mat2.y = stag_phase*(-matrix_00.x*C1IM+matrix_00.y*C1RE -matrix_01.x*C2IM+matrix_01.y*C2RE -matrix_02.x*C3IM+matrix_02.y*C3RE); matrix_00 = mat0; matrix_01 = mat1; matrix_02 = mat2; mat0.x = matrix_10.x*link0.x+matrix_10.y*link0.y+ matrix_11.x*link0.z+matrix_11.y*link0.w+ matrix_12.x*link1.x+matrix_12.y*link1.y; mat0.y = -matrix_10.x*link0.y+matrix_10.y*link0.x -matrix_11.x*link0.w+matrix_11.y*link0.z -matrix_12.x*link1.y+matrix_12.y*link1.x; mat1.x = matrix_10.x*link1.z+matrix_10.y*link1.w+ matrix_11.x*link2.x+matrix_11.y*link2.y+ matrix_12.x*link2.z+matrix_12.y*link2.w; mat1.y = -matrix_10.x*link1.w+matrix_10.y*link1.z -matrix_11.x*link2.y+matrix_11.y*link2.x -matrix_12.x*link2.w+matrix_12.y*link2.z; mat2.x = stag_phase*(matrix_10.x*C1RE+matrix_10.y*C1IM+ matrix_11.x*C2RE+matrix_11.y*C2IM+ matrix_12.x*C3RE+matrix_12.y*C3IM); mat2.y = stag_phase*(-matrix_10.x*C1IM+matrix_10.y*C1RE -matrix_11.x*C2IM+matrix_11.y*C2RE -matrix_12.x*C3IM+matrix_12.y*C3RE); matrix_10 = mat0; matrix_11 = mat1; matrix_12 = mat2; mat0.x = matrix_20.x*link0.x+matrix_20.y*link0.y+ matrix_21.x*link0.z+matrix_21.y*link0.w+ matrix_22.x*link1.x+matrix_22.y*link1.y; mat0.y = -matrix_20.x*link0.y+matrix_20.y*link0.x -matrix_21.x*link0.w+matrix_21.y*link0.z -matrix_22.x*link1.y+matrix_22.y*link1.x; mat1.x = matrix_20.x*link1.z+matrix_20.y*link1.w+ matrix_21.x*link2.x+matrix_21.y*link2.y+ matrix_22.x*link2.z+matrix_22.y*link2.w; mat1.y = -matrix_20.x*link1.w+matrix_20.y*link1.z -matrix_21.x*link2.y+matrix_21.y*link2.x -matrix_22.x*link2.w+matrix_22.y*link2.z; mat2.x = stag_phase*(matrix_20.x*C1RE+matrix_20.y*C1IM+ matrix_21.x*C2RE+matrix_21.y*C2IM+ matrix_22.x*C3RE+matrix_22.y*C3IM); mat2.y = stag_phase*(-matrix_20.x*C1IM+matrix_20.y*C1RE -matrix_21.x*C2IM+matrix_21.y*C2RE -matrix_22.x*C3IM+matrix_22.y*C3RE); matrix_20 = mat0; matrix_21 = mat1; matrix_22 = mat2; // matrix=f_aux_dev*U(idx+mu)_nu * [U(idx+nu)_mu]^{dag} * [U(x)_nu]^{dag} barrier(CLK_LOCAL_MEM_FENCE); /////////////////////////// End of forward staples /// Write to global memory staples[0][threadIdx.x].x += matrix_00.x; staples[0][threadIdx.x].y += matrix_00.y; staples[1][threadIdx.x].x += matrix_01.x; staples[1][threadIdx.x].y += matrix_01.y; staples[2][threadIdx.x].x += matrix_02.x; staples[2][threadIdx.x].y += matrix_02.y; staples[3][threadIdx.x].x += matrix_10.x; staples[3][threadIdx.x].y += matrix_10.y; staples[4][threadIdx.x].x += matrix_11.x; staples[4][threadIdx.x].y += matrix_11.y; staples[5][threadIdx.x].x += matrix_12.x; staples[5][threadIdx.x].y += matrix_12.y; staples[6][threadIdx.x].x += matrix_20.x; staples[6][threadIdx.x].y += matrix_20.y; staples[7][threadIdx.x].x += matrix_21.x; staples[7][threadIdx.x].y += matrix_21.y; staples[8][threadIdx.x].x += matrix_22.x; staples[8][threadIdx.x].y += matrix_22.y; /////////////////////////////////////////////// // (1) // +--<--+ // | | // ^mu (2) V | to calculate (1)*(2)*(3) // | | | // +-->--+ // ->nu temp (3) idx temp_site = table[idx+nu*size_dev]; brds = (temp_site < size_dev)?0:1; size_dev_t = (brds)?bsize_dev:size_dev; brds *= 7*size_dev; site_table[threadIdx.x] = table[temp_site+(4+blockIdx.y)*size_dev_t + brds]; brde = (site_table[threadIdx.x] < size_dev)?0:1; brdg = brde; size_dev_t = (brde)?bsize_dev:size_dev; brde *= 3*size_dev; // brd*(4*size_dev - 1*size_dev) brdg *= 23*size_dev; LoadLinkRegsBrd(gauge_texRef, size_dev_t, site_table[threadIdx.x], nu); // U(idx-nu+mu)_{nu} stag_phase=phases[site_table[threadIdx.x] + nu*size_dev_t + brde]; matrix_00.x = f_aux_dev*link0.x; matrix_00.y = -f_aux_dev*link0.y; matrix_01.x = f_aux_dev*link1.z; matrix_01.y = -f_aux_dev*link1.w; matrix_02.x = stag_phase*f_aux_dev*C1RE; matrix_02.y = -stag_phase*f_aux_dev*C1IM; matrix_10.x = f_aux_dev*link0.z; matrix_10.y = -f_aux_dev*link0.w; matrix_11.x = f_aux_dev*link2.x; matrix_11.y = -f_aux_dev*link2.y; matrix_12.x = stag_phase*f_aux_dev*C2RE; matrix_12.y = -stag_phase*f_aux_dev*C2IM; matrix_20.x = f_aux_dev*link1.x; matrix_20.y = -f_aux_dev*link1.y; matrix_21.x = f_aux_dev*link2.z; matrix_21.y = -f_aux_dev*link2.w; matrix_22.x = stag_phase*f_aux_dev*C3RE; matrix_22.y = -stag_phase*f_aux_dev*C3IM; // matrix=f_aux_dev [U(idx-nu+mu)_{nu}]^{dag} /////////////////////////////////////////////// brde = (temp_site < size_dev)?0:1; brdg = brde; size_dev_t = (brde)?bsize_dev:size_dev; brde *= 3*size_dev; // brd*(4*size_dev - 1*size_dev) brdg *= 23*size_dev; LoadLinkRegsBrd(gauge_texRef, size_dev_t, temp_site, blockIdx.y); // U(idx-nu)_mu stag_phase=phases[temp_site + blockIdx.y*size_dev_t + brde]; mat0.x = matrix_00.x*link0.x+matrix_00.y*link0.y+ matrix_01.x*link0.z+matrix_01.y*link0.w+ matrix_02.x*link1.x+matrix_02.y*link1.y; mat0.y = -matrix_00.x*link0.y+matrix_00.y*link0.x -matrix_01.x*link0.w+matrix_01.y*link0.z -matrix_02.x*link1.y+matrix_02.y*link1.x; mat1.x = matrix_00.x*link1.z+matrix_00.y*link1.w+ matrix_01.x*link2.x+matrix_01.y*link2.y+ matrix_02.x*link2.z+matrix_02.y*link2.w; mat1.y = -matrix_00.x*link1.w+matrix_00.y*link1.z -matrix_01.x*link2.y+matrix_01.y*link2.x -matrix_02.x*link2.w+matrix_02.y*link2.z; mat2.x = stag_phase*(matrix_00.x*C1RE+matrix_00.y*C1IM+ matrix_01.x*C2RE+matrix_01.y*C2IM+ matrix_02.x*C3RE+matrix_02.y*C3IM); mat2.y = stag_phase*(-matrix_00.x*C1IM+matrix_00.y*C1RE -matrix_01.x*C2IM+matrix_01.y*C2RE -matrix_02.x*C3IM+matrix_02.y*C3RE); matrix_00 = mat0; matrix_01 = mat1; matrix_02 = mat2; mat0.x = matrix_10.x*link0.x+matrix_10.y*link0.y+ matrix_11.x*link0.z+matrix_11.y*link0.w+ matrix_12.x*link1.x+matrix_12.y*link1.y; mat0.y = -matrix_10.x*link0.y+matrix_10.y*link0.x -matrix_11.x*link0.w+matrix_11.y*link0.z -matrix_12.x*link1.y+matrix_12.y*link1.x; mat1.x = matrix_10.x*link1.z+matrix_10.y*link1.w+ matrix_11.x*link2.x+matrix_11.y*link2.y+ matrix_12.x*link2.z+matrix_12.y*link2.w; mat1.y = -matrix_10.x*link1.w+matrix_10.y*link1.z -matrix_11.x*link2.y+matrix_11.y*link2.x -matrix_12.x*link2.w+matrix_12.y*link2.z; mat2.x = stag_phase*(matrix_10.x*C1RE+matrix_10.y*C1IM+ matrix_11.x*C2RE+matrix_11.y*C2IM+ matrix_12.x*C3RE+matrix_12.y*C3IM); mat2.y = stag_phase*(-matrix_10.x*C1IM+matrix_10.y*C1RE -matrix_11.x*C2IM+matrix_11.y*C2RE -matrix_12.x*C3IM+matrix_12.y*C3RE); matrix_10 = mat0; matrix_11 = mat1; matrix_12 = mat2; mat0.x = matrix_20.x*link0.x+matrix_20.y*link0.y+ matrix_21.x*link0.z+matrix_21.y*link0.w+ matrix_22.x*link1.x+matrix_22.y*link1.y; mat0.y = -matrix_20.x*link0.y+matrix_20.y*link0.x -matrix_21.x*link0.w+matrix_21.y*link0.z -matrix_22.x*link1.y+matrix_22.y*link1.x; mat1.x = matrix_20.x*link1.z+matrix_20.y*link1.w+ matrix_21.x*link2.x+matrix_21.y*link2.y+ matrix_22.x*link2.z+matrix_22.y*link2.w; mat1.y = -matrix_20.x*link1.w+matrix_20.y*link1.z -matrix_21.x*link2.y+matrix_21.y*link2.x -matrix_22.x*link2.w+matrix_22.y*link2.z; mat2.x = stag_phase*(matrix_20.x*C1RE+matrix_20.y*C1IM+ matrix_21.x*C2RE+matrix_21.y*C2IM+ matrix_22.x*C3RE+matrix_22.y*C3IM); mat2.y = stag_phase*(-matrix_20.x*C1IM+matrix_20.y*C1RE -matrix_21.x*C2IM+matrix_21.y*C2RE -matrix_22.x*C3IM+matrix_22.y*C3RE); matrix_20 = mat0; matrix_21 = mat1; matrix_22 = mat2; // matrix=f_aux_dev [U(idx-nu+mu)_{nu}]^{dag} * [U(idx-nu)_mu]^{dag} barrier(CLK_LOCAL_MEM_FENCE); /////////////////////////////////////////////// LoadLinkRegsBrd(gauge_texRef, size_dev_t, temp_site, nu); // U(x-nu)_nu stag_phase=phases[temp_site + nu*size_dev_t + brde]; mat0.x =matrix_00.x*link0.x-matrix_00.y*link0.y+ matrix_01.x*link1.z-matrix_01.y*link1.w+ stag_phase*(matrix_02.x*C1RE -matrix_02.y*C1IM); mat0.y = matrix_00.x*link0.y+matrix_00.y*link0.x+ matrix_01.x*link1.w+matrix_01.y*link1.z+ stag_phase*(matrix_02.x*C1IM +matrix_02.y*C1RE); mat1.x = matrix_00.x*link0.z-matrix_00.y*link0.w+ matrix_01.x*link2.x-matrix_01.y*link2.y+ stag_phase*(matrix_02.x*C2RE -matrix_02.y*C2IM); mat1.y = matrix_00.x*link0.w+matrix_00.y*link0.z+ matrix_01.x*link2.y+matrix_01.y*link2.x+ stag_phase*(matrix_02.x*C2IM +matrix_02.y*C2RE); mat2.x = matrix_00.x*link1.x-matrix_00.y*link1.y+ matrix_01.x*link2.z-matrix_01.y*link2.w+ stag_phase*(matrix_02.x*C3RE -matrix_02.y*C3IM); mat2.y = matrix_00.x*link1.y+matrix_00.y*link1.x+ matrix_01.x*link2.w+matrix_01.y*link2.z+ stag_phase*(matrix_02.x*C3IM +matrix_02.y*C3RE); matrix_00 = mat0; matrix_01 = mat1; matrix_02 = mat2; mat0.x = matrix_10.x*link0.x-matrix_10.y*link0.y+ matrix_11.x*link1.z-matrix_11.y*link1.w+ stag_phase*(matrix_12.x*C1RE -matrix_12.y*C1IM); mat0.y = matrix_10.x*link0.y+matrix_10.y*link0.x+ matrix_11.x*link1.w+matrix_11.y*link1.z+ stag_phase*(matrix_12.x*C1IM +matrix_12.y*C1RE); mat1.x = matrix_10.x*link0.z-matrix_10.y*link0.w+ matrix_11.x*link2.x-matrix_11.y*link2.y+ stag_phase*(matrix_12.x*C2RE -matrix_12.y*C2IM); mat1.y = matrix_10.x*link0.w+matrix_10.y*link0.z+ matrix_11.x*link2.y+matrix_11.y*link2.x+ stag_phase*(matrix_12.x*C2IM +matrix_12.y*C2RE); mat2.x = matrix_10.x*link1.x-matrix_10.y*link1.y+ matrix_11.x*link2.z-matrix_11.y*link2.w+ stag_phase*(matrix_12.x*C3RE -matrix_12.y*C3IM); mat2.y = matrix_10.x*link1.y+matrix_10.y*link1.x+ matrix_11.x*link2.w+matrix_11.y*link2.z+ stag_phase*(matrix_12.x*C3IM +matrix_12.y*C3RE); matrix_10 = mat0; matrix_11 = mat1; matrix_12 = mat2; mat0.x = matrix_20.x*link0.x-matrix_20.y*link0.y+ matrix_21.x*link1.z-matrix_21.y*link1.w+ stag_phase*(matrix_22.x*C1RE -matrix_22.y*C1IM); mat0.y = matrix_20.x*link0.y+matrix_20.y*link0.x+ matrix_21.x*link1.w+matrix_21.y*link1.z+ stag_phase*(matrix_22.x*C1IM +matrix_22.y*C1RE); mat1.x = matrix_20.x*link0.z-matrix_20.y*link0.w+ matrix_21.x*link2.x-matrix_21.y*link2.y+ stag_phase*(matrix_22.x*C2RE -matrix_22.y*C2IM); mat1.y = matrix_20.x*link0.w+matrix_20.y*link0.z+ matrix_21.x*link2.y+matrix_21.y*link2.x+ stag_phase*(matrix_22.x*C2IM +matrix_22.y*C2RE); mat2.x = matrix_20.x*link1.x-matrix_20.y*link1.y+ matrix_21.x*link2.z-matrix_21.y*link2.w+ stag_phase*(matrix_22.x*C3RE -matrix_22.y*C3IM); mat2.y = matrix_20.x*link1.y+matrix_20.y*link1.x+ matrix_21.x*link2.w+matrix_21.y*link2.z+ stag_phase*(matrix_22.x*C3IM +matrix_22.y*C3RE); matrix_20 = mat0; matrix_21 = mat1; matrix_22 = mat2; // matrix=f_aux_dev [U(idx-nu+mu)_{nu}]^{dag} * [U(idx-nu)_mu]^{dag} * U(x-nu)_nu barrier(CLK_LOCAL_MEM_FENCE); ///////////////////////////// /// Write to global memory staples[0][threadIdx.x].x += matrix_00.x; staples[0][threadIdx.x].y += matrix_00.y; staples[1][threadIdx.x].x += matrix_01.x; staples[1][threadIdx.x].y += matrix_01.y; staples[2][threadIdx.x].x += matrix_02.x; staples[2][threadIdx.x].y += matrix_02.y; staples[3][threadIdx.x].x += matrix_10.x; staples[3][threadIdx.x].y += matrix_10.y; staples[4][threadIdx.x].x += matrix_11.x; staples[4][threadIdx.x].y += matrix_11.y; staples[5][threadIdx.x].x += matrix_12.x; staples[5][threadIdx.x].y += matrix_12.y; staples[6][threadIdx.x].x += matrix_20.x; staples[6][threadIdx.x].y += matrix_20.y; staples[7][threadIdx.x].x += matrix_21.x; staples[7][threadIdx.x].y += matrix_21.y; staples[8][threadIdx.x].x += matrix_22.x; staples[8][threadIdx.x].y += matrix_22.y; } /////////////////////////////////////////// // Load out_matrix stag_phase=phases[idx+blockIdx.y*size_dev]; matrix_00.x = staples[0][threadIdx.x].x; matrix_00.y = staples[0][threadIdx.x].y; matrix_01.x = staples[1][threadIdx.x].x; matrix_01.y = staples[1][threadIdx.x].y; matrix_02.x = staples[2][threadIdx.x].x; matrix_02.y = staples[2][threadIdx.x].y; matrix_10.x = staples[3][threadIdx.x].x; matrix_10.y = staples[3][threadIdx.x].y; matrix_11.x = staples[4][threadIdx.x].x; matrix_11.y = staples[4][threadIdx.x].y; matrix_12.x = staples[5][threadIdx.x].x; matrix_12.y = staples[5][threadIdx.x].y; matrix_20.x = staples[6][threadIdx.x].x; matrix_20.y = staples[6][threadIdx.x].y; matrix_21.x = staples[7][threadIdx.x].x; matrix_21.y = staples[7][threadIdx.x].y; matrix_22.x = staples[8][threadIdx.x].x; matrix_22.y = staples[8][threadIdx.x].y; ////////////////// // Multiply u_mu * staple LoadLinkRegs(gauge_texRef, size_dev, idx, blockIdx.y); //Loads U_mu mat0.x = link0.x*matrix_00.x - link0.y*matrix_00.y + link0.z*matrix_10.x - link0.w*matrix_10.y + link1.x*matrix_20.x - link1.y*matrix_20.y; mat0.y = link0.x*matrix_00.y + link0.y*matrix_00.x + link0.z*matrix_10.y + link0.w*matrix_10.x + link1.x*matrix_20.y + link1.y*matrix_20.x; mat1.x = link1.z*matrix_00.x - link1.w*matrix_00.y + link2.x*matrix_10.x - link2.y*matrix_10.y + link2.z*matrix_20.x - link2.w*matrix_20.y; mat1.y = link1.z*matrix_00.y + link1.w*matrix_00.x + link2.x*matrix_10.y + link2.y*matrix_10.x + link2.z*matrix_20.y + link2.w*matrix_20.x; mat2.x = stag_phase*(C1RE*matrix_00.x - C1IM*matrix_00.y + C2RE*matrix_10.x - C2IM*matrix_10.y + C3RE*matrix_20.x - C3IM*matrix_20.y); mat2.y = stag_phase*(C1RE*matrix_00.y + C1IM*matrix_00.x + C2RE*matrix_10.y + C2IM*matrix_10.x + C3RE*matrix_20.y + C3IM*matrix_20.x); matrix_00 = mat0; matrix_10 = mat1; matrix_20 = mat2; mat0.x = link0.x*matrix_01.x - link0.y*matrix_01.y + link0.z*matrix_11.x - link0.w*matrix_11.y + link1.x*matrix_21.x - link1.y*matrix_21.y; mat0.y = link0.x*matrix_01.y + link0.y*matrix_01.x + link0.z*matrix_11.y + link0.w*matrix_11.x + link1.x*matrix_21.y + link1.y*matrix_21.x; mat1.x = link1.z*matrix_01.x - link1.w*matrix_01.y + link2.x*matrix_11.x - link2.y*matrix_11.y + link2.z*matrix_21.x - link2.w*matrix_21.y; mat1.y = link1.z*matrix_01.y + link1.w*matrix_01.x + link2.x*matrix_11.y + link2.y*matrix_11.x + link2.z*matrix_21.y + link2.w*matrix_21.x; mat2.x = stag_phase*(C1RE*matrix_01.x - C1IM*matrix_01.y + C2RE*matrix_11.x - C2IM*matrix_11.y + C3RE*matrix_21.x - C3IM*matrix_21.y); mat2.y = stag_phase*(C1RE*matrix_01.y + C1IM*matrix_01.x + C2RE*matrix_11.y + C2IM*matrix_11.x + C3RE*matrix_21.y + C3IM*matrix_21.x); matrix_01 = mat0; matrix_11 = mat1; matrix_21 = mat2; mat0.x = link0.x*matrix_02.x - link0.y*matrix_02.y + link0.z*matrix_12.x - link0.w*matrix_12.y + link1.x*matrix_22.x - link1.y*matrix_22.y; mat0.y = link0.x*matrix_02.y + link0.y*matrix_02.x + link0.z*matrix_12.y + link0.w*matrix_12.x + link1.x*matrix_22.y + link1.y*matrix_22.x; mat1.x = link1.z*matrix_02.x - link1.w*matrix_02.y + link2.x*matrix_12.x - link2.y*matrix_12.y + link2.z*matrix_22.x - link2.w*matrix_22.y; mat1.y = link1.z*matrix_02.y + link1.w*matrix_02.x + link2.x*matrix_12.y + link2.y*matrix_12.x + link2.z*matrix_22.y + link2.w*matrix_22.x; mat2.x = stag_phase*(C1RE*matrix_02.x - C1IM*matrix_02.y + C2RE*matrix_12.x - C2IM*matrix_12.y + C3RE*matrix_22.x - C3IM*matrix_22.y); mat2.y = stag_phase*(C1RE*matrix_02.y + C1IM*matrix_02.x + C2RE*matrix_12.y + C2IM*matrix_12.x + C3RE*matrix_22.y + C3IM*matrix_22.x); matrix_02 = mat0; matrix_12 = mat1; matrix_22 = mat2; ///////////////////////////// /// Write to global memory the traceless antihermitian part out_matrix[idx + 2*blockIdx.y*size_dev].x+=0.5f*(matrix_01.x-matrix_10.x); out_matrix[idx + 2*blockIdx.y*size_dev].y+=0.5f*(matrix_01.y+matrix_10.y); out_matrix[idx + 2*blockIdx.y*size_dev].z+=0.5f*(matrix_02.x-matrix_20.x); out_matrix[idx + 2*blockIdx.y*size_dev].w+=0.5f*(matrix_02.y+matrix_20.y); out_matrix[idx + size_dev + 2*blockIdx.y*size_dev].x+=0.5f*(matrix_12.x-matrix_21.x); out_matrix[idx + size_dev + 2*blockIdx.y*size_dev].y+=0.5f*(matrix_12.y+matrix_21.y); out_matrix[idx + size_dev + 2*blockIdx.y*size_dev].z+=matrix_00.y-0.3333333f*(matrix_00.y+matrix_11.y+matrix_22.y); out_matrix[idx + size_dev + 2*blockIdx.y*size_dev].w+=matrix_11.y-0.3333333f*(matrix_00.y+matrix_11.y+matrix_22.y); } #else #endif

n_treutner · ‎01-25-2011

damnit, my GDebugger Licence has just run out. I wanted to test, what ibird suggested, but i can't read back the values so easily atm. is there any other easy way, to check the buffers during runtime, like the GDebugger allows to?

EDIT: I'm stupid, they give away free licences. Never mind ...

but what i can say is, that the code still does not work. i think, i'll have to switch back to 2.2.

himanshu.guatam, i'm afraid, i won't be able to build a simple test case in reasonable time. but maybe, i'll try anyway in a couple of days.

Jawed · ‎01-26-2011

Himanshu, if you are interested in a test case you can take a look at the Deathray thread I started recently, as I have posted the entire source code. The kernel source documents the bug, which I've attached.

If you delete the two lines that say this:

int x = (local_id.x == 7) ? 44 : 40;

and then paste one copy of that line just before:

if (top_group) {

the error occurs. This is with SDK 2.3 and Catalyst 10.12.

Please note the #define for TILE_SIDE at the top of that source file. If the value of TILE_SIDE is defined as 48, the error does not occur. But when it is 53 (actually any value > 48 and <=64) the error occurs, if the single line of code computing "x" is used.

So even though it looks silly, the two identical lines of code are required to work around this bug.

} else if ((top_group || bot_group) && local_id.y < 😎 { if (top_group) { int x = (local_id.x == 7) ? 44 : 40; // ATI 2.3: compiler bug requires this here if TILE_SIDE is not 48 mirror = ReadTile4(x, 8 + local_id.y + 1, tile); WriteTile4(mirror, x, 8 - local_id.y - 1, tile); } if (bot_group) { int x = (local_id.x == 7) ? 44 : 40; // ATI 2.3: compiler bug requires this here if TILE_SIDE is not 48 int bottom = ((height - 1) & 31) + 8; mirror = ReadTile4(x, bottom - local_id.y - 1, tile); WriteTile4(mirror, x, bottom + local_id.y + 1, tile); } }

n_treutner · ‎01-27-2011

allright, i think i found the problem (in my special case): it's the imagesampler. i'll try to find out, what exactly went wrong.

himanshu_gautam · ‎01-27-2011

jawed,

I downloaded the deathray. But i am little unfamiliar with what the code would do. When I run it it asks for a executable for debug session. I am not sure what to provide there, although i tried giving it a executable from OpenCL samples(Bitonic Sort) and on execution the deathray executed that .exe. Can you please provide what type of error do i need to expect and what .exe needs to be provided to the app.

Do you get some compilation error in the situation you mentioned i.e when TILE_SIDE=53?If that is the case with you I think the issue has been resolved as it no longer occur with internal dlls.

Jawed · ‎01-27-2011

Hi Himanshu, Deathray is a plug-in for the Avisynth scripting environment.

To create a debug session with Deathray requires a working installation of Avisynth 2.5.8 and an application that enables single-frame stepping through a video. e.g. AvsP or VirtualDub.

To be honest I was merely expecting you to take a look at the compilations that are produced when the code is varied as I specified above. The entire compilation is around 700 ALU bundles and the error is occurring somewhere in the middle of the first 300-odd ALU bundles.

I presume you guys have decent tools for tracing through ISA!

As published, Deathray won't exhibit the bug unless you change the int x = ... lines as I described earlier. I was hoping that you would be able to analyse the compiled output and discern the problem.

I don't get a compilation error at runtime. Instead the kernel malfunctions in the area of those int x = ... lines.

Those lines are copying pixels from one place in local memory to another. With TILE_SIDE set to 48 the pixel copy is perfect. When set to 53 the copy fails.

I'm making heavy use of read_mem_fence and write_mem_fence in the kernel to ensure that ordering is correct.

It seems like you are saying you have discovered and fixed a problem with the way the compiler does optimisation of literals and/or literals used in local memory addressing. Is that the case?

I've actually discovered another problem relating to reads from "a register". This is more troubling (I can't find a work-around) and I will raise a support ticket for it.

Basically I define:

float16 target_window[7]

If I assign to a float4 using:

float4 test = target_window[3].s2345

or

float4 test = target_window[3].s6789

then I get a faulty result. The peculiar thing is the faulty result is in only one channel of the float4 result and the fault is data-dependent (faulty pixels flicker based upon data in the video frame!). The fault seems to disappear if I do lots of assignments, i.e. .s0123 .s1234 .s2345 .s3456 .s4567 .s5678 .s6789. Some smaller combinations from that set also seem to make the fault disappear.

n_treutner · ‎01-30-2011

I played around with various settings and figured out, that the sampler created in the Host Program (i.e. the C++ code), which was then passed to the GPU as a parameter (as i wrote before, this had to be done, to work on my boss' NVidia GPU) doesn't work as it should in 2.3

The Sampler was defined as

imageSampler = clCreateSampler(context, false, CL_ADDRESS_CLAMP, CL_FILTER_NEAREST, &status);

Now, some image-access still works: This means, that i can read values from images with the sampler and, for example, write them to a outputBuffer. What seems to be a problem is accessing the image inside loops with changing parameters/coordinates. In the end i got the same results for all iterations of the loop, which was the major problem I had.

When i create the sampler inside the OCL code (just after the pragmas), it works just like it should. It looks like this:

__constant sampler_t imageSampler2 = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;

Maybe the Sampler created outside the OCL-code, passed as a parameter, is some kind of semi-static?

Archives Discussions

Problem after change from 2.2 to 2.3