cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

n_treutner
Journeyman III

Problem after change from 2.2 to 2.3

Jawed, to your suggestion with the if (x != 20 && y != 20) return;

i did as you suggested (with the small change, that i used || instead of && and the result was, as i thought it would be: only one kernel was executed, with the wrong (i.e. the same) results for all instances of ww.

and yes, i pass the sampler to the kernel, rather than creating it inside, beacuse my boss runs it on a nvidia machine, and nvidia needs it created outside, and passed via param. could this be the reason?

i already tested normal reading from images and it works, so the sampler itself isn't broken. maybe it just handles multiple calls to similar areas incorrectly?

0 Likes
Jawed
Adept II

Problem after change from 2.2 to 2.3

LOL at my logic fail.

My code runs on NVidia (have only known this for about 20 hours), so the sampler problem might be with old NVidia drivers. The sampler is defined locally rather than globally, for what it's worth. Though I don't see why changing the sampler should help you.

Try literal values instead of all the loop parameters? e.g.:

    int k = 4;//(params[0] - 1)/2;
    int w = 160;//params[1];
    int w_min = 40;//params[2];
  
    int stepping = 2;//params[3];

This results in lots of unrolling. See if the result is what you're expecting?

I had mysterious behaviour in my code due to a compiler error. It's pretty tricky to explain. It seems to be a mixture of literals and computed values for address computation - coupled with nested IFs based upon work item IDs. I'm now wondering if you are suffering a similar compiler problem.

One thing I noticed, which is really puzzling, is that I would get different compilations in SKA depending on whether another, similar, kernel was also loaded. e.g. right now NLMSingleFrameFourPixel compiles as 669 ALU instructions if NLMMultiFrameFourPixel and NLMFinalise are in SKA. If I put #if0 ... #endif around them, I get 677 ALU instructions.

That's just nuts in my view.

0 Likes
ibird
Adept I

Problem after change from 2.2 to 2.3

I has also the same problem upgrading from 2.2 to 2.3 ( and driver ) on gpu i has different

results from  CPU ( work on CPU Nvidia GPU not on ATI gpu ).

 

After a lot work trying to isolate the problem i has finded that there is a problem with the constant parameters passed. Im my case it seem that the constant in some situation has strange behaviour.

 

I order to see if you have the same problem you can change the constant parameters with the runtime value

for example if you have parameter[0] = 256;

change  your code from int k = (params[0] - 1)/2; to int k = (256 - 1)/2;

Is not a workaround, but is just to see if you have the same problem

 

 

0 Likes
himanshu_gautam
Grandmaster

Problem after change from 2.2 to 2.3

Jawed and ibird,

I would really appreciate if you can send a test case. I do not find any problem with your kernel in SKA though.

 

0 Likes
ibird
Adept I

Problem after change from 2.2 to 2.3

 

himanshu

 

Is not easy to build a Test Case, i has tried to simplify my kernel to build an easy test case but is extremly difficult to isolate.

 

The kernel is this attached to this post. (  )

CPU and GPU has different result, GPU work if i multiply with f_aux_dev outside the for cicle. This type of problem has been introduced with the last Catalyst driver needed by the new SDK, downgrading the SDK do not solve the problem. There are not buffer overflow on reading, as one can think; each kernel is alway controlled with a on OpenCL CPU, Nvidia GPU, and simple CPU, no one of them generate errors on result except ATI GPU with last drivers and SDK, (previous driver and SDK do not generate errors).

I hope hat Jawed has a simpler kernel to create a Test Case.

I can create a test case from this kernel to isolate the problem, but not before two or three week.

 

#define C1RE ( (link0.z*link2.z-link0.w*link2.w) - (link1.x*link2.x-link1.y*link2.y) ) #define C1IM (-(link0.z*link2.w+link0.w*link2.z) + (link1.x*link2.y+link1.y*link2.x) ) #define C2RE ( (link1.x*link1.z-link1.y*link1.w) - (link0.x*link2.z-link0.y*link2.w) ) #define C2IM (-(link1.x*link1.w+link1.y*link1.z) + (link0.x*link2.w+link0.y*link2.z) ) #define C3RE ( (link0.x*link2.x-link0.y*link2.y) - (link0.z*link1.z-link0.w*link1.w) ) #define C3IM (-(link0.x*link2.y+link0.y*link2.x) + (link0.z*link1.w+link0.w*link1.z) ) #define DeclareMatrixRegs float4 link0, link1, link2 #ifndef IMAGE #define LoadLinkRegs(gauge, vol, idx, mu) \ link0 = gauge[idx+vol*(0+3*mu)]; \ link1 = gauge[idx+vol*(1+3*mu)]; \ link2 = gauge[idx+vol*(2+3*mu)] #define LoadLinkRegsBrd(gauge, vol, idx, mu) \ link0 = gauge[idx+vol*(0+3*mu)+brdg]; \ link1 = gauge[idx+vol*(1+3*mu)+brdg]; \ link2 = gauge[idx+vol*(2+3*mu)+brdg] __kernel void DeriveFieldKernel(__global float4 *out_matrix, __global int *table, __global int *phases, __global float4 *gauge_texRef, const unsigned int bsize_dev, const unsigned int size_dev, const float f_aux_dev) { unsigned int brds; unsigned int brdg; unsigned int brde; unsigned int size_dev_t; int2 threadIdx = {get_local_id(0),get_local_id(1)}; int2 blockDim = {get_local_size(0),get_local_size(1)}; int2 blockIdx = {get_group_id(0),get_group_id(1)}; //Linear index int idx = blockIdx.x * blockDim.x + threadIdx.x; int nu, direction; int temp_site; int stag_phase; //New table indexing (index fastest) __local int site_table[NUM_THREADS]; // Staples __local float2 staples[9][NUM_THREADS]; //Temporary row-column store float2 mat0, mat1, mat2; //6 registers DeclareMatrixRegs; //12 registers float2 matrix_00, matrix_01, matrix_02, matrix_10, matrix_11, matrix_12, matrix_20, matrix_21, matrix_22; //18 registers staples[0][threadIdx.x].x = 0.0f; staples[0][threadIdx.x].y = 0.0f; staples[1][threadIdx.x].x = 0.0f; staples[1][threadIdx.x].y = 0.0f; staples[2][threadIdx.x].x = 0.0f; staples[2][threadIdx.x].y = 0.0f; staples[3][threadIdx.x].x = 0.0f; staples[3][threadIdx.x].y = 0.0f; staples[4][threadIdx.x].x = 0.0f; staples[4][threadIdx.x].y = 0.0f; staples[5][threadIdx.x].x = 0.0f; staples[5][threadIdx.x].y = 0.0f; staples[6][threadIdx.x].x = 0.0f; staples[6][threadIdx.x].y = 0.0f; staples[7][threadIdx.x].x = 0.0f; staples[7][threadIdx.x].y = 0.0f; staples[8][threadIdx.x].x = 0.0f; staples[8][threadIdx.x].y = 0.0f; // mu = blockIdx.y for(direction = 1; direction < 4; direction++) { nu = (direction+blockIdx.y) & 3; // nu = (direction+blockIdx.y) % 4; // (1) // +-->--+ // | | // ^mu | V (2) to calculate (1)*(2)*(3) // | | | // +--<--+ // ->nu idx (3) site_table[threadIdx.x] = table[idx+(4+blockIdx.y)*size_dev]; // idx+mu brde = (site_table[threadIdx.x] < size_dev)?0:1; brdg = brde; size_dev_t = (brde)?bsize_dev:size_dev; brde *= 3*size_dev; // brd*(4*size_dev - 1*size_dev) brdg *= 23*size_dev; LoadLinkRegsBrd(gauge_texRef, size_dev_t, site_table[threadIdx.x], nu); // U(idx+mu)_nu stag_phase=phases[site_table[threadIdx.x] + nu*size_dev_t + brde]; matrix_00.x = f_aux_dev*link0.x; matrix_00.y = f_aux_dev*link0.y; matrix_01.x = f_aux_dev*link0.z; matrix_01.y = f_aux_dev*link0.w; matrix_02.x = f_aux_dev*link1.x; matrix_02.y = f_aux_dev*link1.y; matrix_10.x = f_aux_dev*link1.z; matrix_10.y = f_aux_dev*link1.w; matrix_11.x = f_aux_dev*link2.x; matrix_11.y = f_aux_dev*link2.y; matrix_12.x = f_aux_dev*link2.z; matrix_12.y = f_aux_dev*link2.w; matrix_20.x = stag_phase*f_aux_dev*C1RE; matrix_20.y = stag_phase*f_aux_dev*C1IM; matrix_21.x = stag_phase*f_aux_dev*C2RE; matrix_21.y = stag_phase*f_aux_dev*C2IM; matrix_22.x = stag_phase*f_aux_dev*C3RE; matrix_22.y = stag_phase*f_aux_dev*C3IM; // matrix=f_aux_dev*U(idx+mu)_nu ///////////////////////////////////////////////////////////////////////////// site_table[threadIdx.x] = table[idx+(4+nu)*size_dev]; // idx+nu brde = (site_table[threadIdx.x] < size_dev)?0:1; brdg = brde; size_dev_t = (brde)?bsize_dev:size_dev; brde *= 3*size_dev; // brd*(4*size_dev - 1*size_dev) brdg *= 23*size_dev; LoadLinkRegsBrd(gauge_texRef, size_dev_t, site_table[threadIdx.x], blockIdx.y); // U(idx+nu)_mu stag_phase=phases[site_table[threadIdx.x] + blockIdx.y*size_dev_t + brde]; mat0.x = matrix_00.x*link0.x+matrix_00.y*link0.y+ matrix_01.x*link0.z+matrix_01.y*link0.w+ matrix_02.x*link1.x+matrix_02.y*link1.y; mat0.y = -matrix_00.x*link0.y+matrix_00.y*link0.x -matrix_01.x*link0.w+matrix_01.y*link0.z -matrix_02.x*link1.y+matrix_02.y*link1.x; mat1.x = matrix_00.x*link1.z+matrix_00.y*link1.w+ matrix_01.x*link2.x+matrix_01.y*link2.y+ matrix_02.x*link2.z+matrix_02.y*link2.w; mat1.y = -matrix_00.x*link1.w+matrix_00.y*link1.z -matrix_01.x*link2.y+matrix_01.y*link2.x -matrix_02.x*link2.w+matrix_02.y*link2.z; mat2.x = stag_phase*(matrix_00.x*C1RE+matrix_00.y*C1IM+ matrix_01.x*C2RE+matrix_01.y*C2IM+ matrix_02.x*C3RE+matrix_02.y*C3IM); mat2.y = stag_phase*(-matrix_00.x*C1IM+matrix_00.y*C1RE -matrix_01.x*C2IM+matrix_01.y*C2RE -matrix_02.x*C3IM+matrix_02.y*C3RE); matrix_00 = mat0; matrix_01 = mat1; matrix_02 = mat2; mat0.x = matrix_10.x*link0.x+matrix_10.y*link0.y+ matrix_11.x*link0.z+matrix_11.y*link0.w+ matrix_12.x*link1.x+matrix_12.y*link1.y; mat0.y = -matrix_10.x*link0.y+matrix_10.y*link0.x -matrix_11.x*link0.w+matrix_11.y*link0.z -matrix_12.x*link1.y+matrix_12.y*link1.x; mat1.x = matrix_10.x*link1.z+matrix_10.y*link1.w+ matrix_11.x*link2.x+matrix_11.y*link2.y+ matrix_12.x*link2.z+matrix_12.y*link2.w; mat1.y = -matrix_10.x*link1.w+matrix_10.y*link1.z -matrix_11.x*link2.y+matrix_11.y*link2.x -matrix_12.x*link2.w+matrix_12.y*link2.z; mat2.x = stag_phase*(matrix_10.x*C1RE+matrix_10.y*C1IM+ matrix_11.x*C2RE+matrix_11.y*C2IM+ matrix_12.x*C3RE+matrix_12.y*C3IM); mat2.y = stag_phase*(-matrix_10.x*C1IM+matrix_10.y*C1RE -matrix_11.x*C2IM+matrix_11.y*C2RE -matrix_12.x*C3IM+matrix_12.y*C3RE); matrix_10 = mat0; matrix_11 = mat1; matrix_12 = mat2; mat0.x = matrix_20.x*link0.x+matrix_20.y*link0.y+ matrix_21.x*link0.z+matrix_21.y*link0.w+ matrix_22.x*link1.x+matrix_22.y*link1.y; mat0.y = -matrix_20.x*link0.y+matrix_20.y*link0.x -matrix_21.x*link0.w+matrix_21.y*link0.z -matrix_22.x*link1.y+matrix_22.y*link1.x; mat1.x = matrix_20.x*link1.z+matrix_20.y*link1.w+ matrix_21.x*link2.x+matrix_21.y*link2.y+ matrix_22.x*link2.z+matrix_22.y*link2.w; mat1.y = -matrix_20.x*link1.w+matrix_20.y*link1.z -matrix_21.x*link2.y+matrix_21.y*link2.x -matrix_22.x*link2.w+matrix_22.y*link2.z; mat2.x = stag_phase*(matrix_20.x*C1RE+matrix_20.y*C1IM+ matrix_21.x*C2RE+matrix_21.y*C2IM+ matrix_22.x*C3RE+matrix_22.y*C3IM); mat2.y = stag_phase*(-matrix_20.x*C1IM+matrix_20.y*C1RE -matrix_21.x*C2IM+matrix_21.y*C2RE -matrix_22.x*C3IM+matrix_22.y*C3RE); matrix_20 = mat0; matrix_21 = mat1; matrix_22 = mat2; // matrix=f_aux_dev*U(idx+mu)_nu * [U(idx+nu)_mu]^{dag} barrier(CLK_LOCAL_MEM_FENCE); //////////////////////////////////////////////////////// LoadLinkRegs( gauge_texRef, size_dev, idx, nu); // U(x)_nu stag_phase=phases[idx+nu*size_dev]; mat0.x = matrix_00.x*link0.x+matrix_00.y*link0.y+ matrix_01.x*link0.z+matrix_01.y*link0.w+ matrix_02.x*link1.x+matrix_02.y*link1.y; mat0.y = -matrix_00.x*link0.y+matrix_00.y*link0.x -matrix_01.x*link0.w+matrix_01.y*link0.z -matrix_02.x*link1.y+matrix_02.y*link1.x; mat1.x = matrix_00.x*link1.z+matrix_00.y*link1.w+ matrix_01.x*link2.x+matrix_01.y*link2.y+ matrix_02.x*link2.z+matrix_02.y*link2.w; mat1.y = -matrix_00.x*link1.w+matrix_00.y*link1.z -matrix_01.x*link2.y+matrix_01.y*link2.x -matrix_02.x*link2.w+matrix_02.y*link2.z; mat2.x = stag_phase*(matrix_00.x*C1RE+matrix_00.y*C1IM+ matrix_01.x*C2RE+matrix_01.y*C2IM+ matrix_02.x*C3RE+matrix_02.y*C3IM); mat2.y = stag_phase*(-matrix_00.x*C1IM+matrix_00.y*C1RE -matrix_01.x*C2IM+matrix_01.y*C2RE -matrix_02.x*C3IM+matrix_02.y*C3RE); matrix_00 = mat0; matrix_01 = mat1; matrix_02 = mat2; mat0.x = matrix_10.x*link0.x+matrix_10.y*link0.y+ matrix_11.x*link0.z+matrix_11.y*link0.w+ matrix_12.x*link1.x+matrix_12.y*link1.y; mat0.y = -matrix_10.x*link0.y+matrix_10.y*link0.x -matrix_11.x*link0.w+matrix_11.y*link0.z -matrix_12.x*link1.y+matrix_12.y*link1.x; mat1.x = matrix_10.x*link1.z+matrix_10.y*link1.w+ matrix_11.x*link2.x+matrix_11.y*link2.y+ matrix_12.x*link2.z+matrix_12.y*link2.w; mat1.y = -matrix_10.x*link1.w+matrix_10.y*link1.z -matrix_11.x*link2.y+matrix_11.y*link2.x -matrix_12.x*link2.w+matrix_12.y*link2.z; mat2.x = stag_phase*(matrix_10.x*C1RE+matrix_10.y*C1IM+ matrix_11.x*C2RE+matrix_11.y*C2IM+ matrix_12.x*C3RE+matrix_12.y*C3IM); mat2.y = stag_phase*(-matrix_10.x*C1IM+matrix_10.y*C1RE -matrix_11.x*C2IM+matrix_11.y*C2RE -matrix_12.x*C3IM+matrix_12.y*C3RE); matrix_10 = mat0; matrix_11 = mat1; matrix_12 = mat2; mat0.x = matrix_20.x*link0.x+matrix_20.y*link0.y+ matrix_21.x*link0.z+matrix_21.y*link0.w+ matrix_22.x*link1.x+matrix_22.y*link1.y; mat0.y = -matrix_20.x*link0.y+matrix_20.y*link0.x -matrix_21.x*link0.w+matrix_21.y*link0.z -matrix_22.x*link1.y+matrix_22.y*link1.x; mat1.x = matrix_20.x*link1.z+matrix_20.y*link1.w+ matrix_21.x*link2.x+matrix_21.y*link2.y+ matrix_22.x*link2.z+matrix_22.y*link2.w; mat1.y = -matrix_20.x*link1.w+matrix_20.y*link1.z -matrix_21.x*link2.y+matrix_21.y*link2.x -matrix_22.x*link2.w+matrix_22.y*link2.z; mat2.x = stag_phase*(matrix_20.x*C1RE+matrix_20.y*C1IM+ matrix_21.x*C2RE+matrix_21.y*C2IM+ matrix_22.x*C3RE+matrix_22.y*C3IM); mat2.y = stag_phase*(-matrix_20.x*C1IM+matrix_20.y*C1RE -matrix_21.x*C2IM+matrix_21.y*C2RE -matrix_22.x*C3IM+matrix_22.y*C3RE); matrix_20 = mat0; matrix_21 = mat1; matrix_22 = mat2; // matrix=f_aux_dev*U(idx+mu)_nu * [U(idx+nu)_mu]^{dag} * [U(x)_nu]^{dag} barrier(CLK_LOCAL_MEM_FENCE); /////////////////////////// End of forward staples /// Write to global memory staples[0][threadIdx.x].x += matrix_00.x; staples[0][threadIdx.x].y += matrix_00.y; staples[1][threadIdx.x].x += matrix_01.x; staples[1][threadIdx.x].y += matrix_01.y; staples[2][threadIdx.x].x += matrix_02.x; staples[2][threadIdx.x].y += matrix_02.y; staples[3][threadIdx.x].x += matrix_10.x; staples[3][threadIdx.x].y += matrix_10.y; staples[4][threadIdx.x].x += matrix_11.x; staples[4][threadIdx.x].y += matrix_11.y; staples[5][threadIdx.x].x += matrix_12.x; staples[5][threadIdx.x].y += matrix_12.y; staples[6][threadIdx.x].x += matrix_20.x; staples[6][threadIdx.x].y += matrix_20.y; staples[7][threadIdx.x].x += matrix_21.x; staples[7][threadIdx.x].y += matrix_21.y; staples[8][threadIdx.x].x += matrix_22.x; staples[8][threadIdx.x].y += matrix_22.y; /////////////////////////////////////////////// // (1) // +--<--+ // | | // ^mu (2) V | to calculate (1)*(2)*(3) // | | | // +-->--+ // ->nu temp (3) idx temp_site = table[idx+nu*size_dev]; brds = (temp_site < size_dev)?0:1; size_dev_t = (brds)?bsize_dev:size_dev; brds *= 7*size_dev; site_table[threadIdx.x] = table[temp_site+(4+blockIdx.y)*size_dev_t + brds]; brde = (site_table[threadIdx.x] < size_dev)?0:1; brdg = brde; size_dev_t = (brde)?bsize_dev:size_dev; brde *= 3*size_dev; // brd*(4*size_dev - 1*size_dev) brdg *= 23*size_dev; LoadLinkRegsBrd(gauge_texRef, size_dev_t, site_table[threadIdx.x], nu); // U(idx-nu+mu)_{nu} stag_phase=phases[site_table[threadIdx.x] + nu*size_dev_t + brde]; matrix_00.x = f_aux_dev*link0.x; matrix_00.y = -f_aux_dev*link0.y; matrix_01.x = f_aux_dev*link1.z; matrix_01.y = -f_aux_dev*link1.w; matrix_02.x = stag_phase*f_aux_dev*C1RE; matrix_02.y = -stag_phase*f_aux_dev*C1IM; matrix_10.x = f_aux_dev*link0.z; matrix_10.y = -f_aux_dev*link0.w; matrix_11.x = f_aux_dev*link2.x; matrix_11.y = -f_aux_dev*link2.y; matrix_12.x = stag_phase*f_aux_dev*C2RE; matrix_12.y = -stag_phase*f_aux_dev*C2IM; matrix_20.x = f_aux_dev*link1.x; matrix_20.y = -f_aux_dev*link1.y; matrix_21.x = f_aux_dev*link2.z; matrix_21.y = -f_aux_dev*link2.w; matrix_22.x = stag_phase*f_aux_dev*C3RE; matrix_22.y = -stag_phase*f_aux_dev*C3IM; // matrix=f_aux_dev [U(idx-nu+mu)_{nu}]^{dag} /////////////////////////////////////////////// brde = (temp_site < size_dev)?0:1; brdg = brde; size_dev_t = (brde)?bsize_dev:size_dev; brde *= 3*size_dev; // brd*(4*size_dev - 1*size_dev) brdg *= 23*size_dev; LoadLinkRegsBrd(gauge_texRef, size_dev_t, temp_site, blockIdx.y); // U(idx-nu)_mu stag_phase=phases[temp_site + blockIdx.y*size_dev_t + brde]; mat0.x = matrix_00.x*link0.x+matrix_00.y*link0.y+ matrix_01.x*link0.z+matrix_01.y*link0.w+ matrix_02.x*link1.x+matrix_02.y*link1.y; mat0.y = -matrix_00.x*link0.y+matrix_00.y*link0.x -matrix_01.x*link0.w+matrix_01.y*link0.z -matrix_02.x*link1.y+matrix_02.y*link1.x; mat1.x = matrix_00.x*link1.z+matrix_00.y*link1.w+ matrix_01.x*link2.x+matrix_01.y*link2.y+ matrix_02.x*link2.z+matrix_02.y*link2.w; mat1.y = -matrix_00.x*link1.w+matrix_00.y*link1.z -matrix_01.x*link2.y+matrix_01.y*link2.x -matrix_02.x*link2.w+matrix_02.y*link2.z; mat2.x = stag_phase*(matrix_00.x*C1RE+matrix_00.y*C1IM+ matrix_01.x*C2RE+matrix_01.y*C2IM+ matrix_02.x*C3RE+matrix_02.y*C3IM); mat2.y = stag_phase*(-matrix_00.x*C1IM+matrix_00.y*C1RE -matrix_01.x*C2IM+matrix_01.y*C2RE -matrix_02.x*C3IM+matrix_02.y*C3RE); matrix_00 = mat0; matrix_01 = mat1; matrix_02 = mat2; mat0.x = matrix_10.x*link0.x+matrix_10.y*link0.y+ matrix_11.x*link0.z+matrix_11.y*link0.w+ matrix_12.x*link1.x+matrix_12.y*link1.y; mat0.y = -matrix_10.x*link0.y+matrix_10.y*link0.x -matrix_11.x*link0.w+matrix_11.y*link0.z -matrix_12.x*link1.y+matrix_12.y*link1.x; mat1.x = matrix_10.x*link1.z+matrix_10.y*link1.w+ matrix_11.x*link2.x+matrix_11.y*link2.y+ matrix_12.x*link2.z+matrix_12.y*link2.w; mat1.y = -matrix_10.x*link1.w+matrix_10.y*link1.z -matrix_11.x*link2.y+matrix_11.y*link2.x -matrix_12.x*link2.w+matrix_12.y*link2.z; mat2.x = stag_phase*(matrix_10.x*C1RE+matrix_10.y*C1IM+ matrix_11.x*C2RE+matrix_11.y*C2IM+ matrix_12.x*C3RE+matrix_12.y*C3IM); mat2.y = stag_phase*(-matrix_10.x*C1IM+matrix_10.y*C1RE -matrix_11.x*C2IM+matrix_11.y*C2RE -matrix_12.x*C3IM+matrix_12.y*C3RE); matrix_10 = mat0; matrix_11 = mat1; matrix_12 = mat2; mat0.x = matrix_20.x*link0.x+matrix_20.y*link0.y+ matrix_21.x*link0.z+matrix_21.y*link0.w+ matrix_22.x*link1.x+matrix_22.y*link1.y; mat0.y = -matrix_20.x*link0.y+matrix_20.y*link0.x -matrix_21.x*link0.w+matrix_21.y*link0.z -matrix_22.x*link1.y+matrix_22.y*link1.x; mat1.x = matrix_20.x*link1.z+matrix_20.y*link1.w+ matrix_21.x*link2.x+matrix_21.y*link2.y+ matrix_22.x*link2.z+matrix_22.y*link2.w; mat1.y = -matrix_20.x*link1.w+matrix_20.y*link1.z -matrix_21.x*link2.y+matrix_21.y*link2.x -matrix_22.x*link2.w+matrix_22.y*link2.z; mat2.x = stag_phase*(matrix_20.x*C1RE+matrix_20.y*C1IM+ matrix_21.x*C2RE+matrix_21.y*C2IM+ matrix_22.x*C3RE+matrix_22.y*C3IM); mat2.y = stag_phase*(-matrix_20.x*C1IM+matrix_20.y*C1RE -matrix_21.x*C2IM+matrix_21.y*C2RE -matrix_22.x*C3IM+matrix_22.y*C3RE); matrix_20 = mat0; matrix_21 = mat1; matrix_22 = mat2; // matrix=f_aux_dev [U(idx-nu+mu)_{nu}]^{dag} * [U(idx-nu)_mu]^{dag} barrier(CLK_LOCAL_MEM_FENCE); /////////////////////////////////////////////// LoadLinkRegsBrd(gauge_texRef, size_dev_t, temp_site, nu); // U(x-nu)_nu stag_phase=phases[temp_site + nu*size_dev_t + brde]; mat0.x =matrix_00.x*link0.x-matrix_00.y*link0.y+ matrix_01.x*link1.z-matrix_01.y*link1.w+ stag_phase*(matrix_02.x*C1RE -matrix_02.y*C1IM); mat0.y = matrix_00.x*link0.y+matrix_00.y*link0.x+ matrix_01.x*link1.w+matrix_01.y*link1.z+ stag_phase*(matrix_02.x*C1IM +matrix_02.y*C1RE); mat1.x = matrix_00.x*link0.z-matrix_00.y*link0.w+ matrix_01.x*link2.x-matrix_01.y*link2.y+ stag_phase*(matrix_02.x*C2RE -matrix_02.y*C2IM); mat1.y = matrix_00.x*link0.w+matrix_00.y*link0.z+ matrix_01.x*link2.y+matrix_01.y*link2.x+ stag_phase*(matrix_02.x*C2IM +matrix_02.y*C2RE); mat2.x = matrix_00.x*link1.x-matrix_00.y*link1.y+ matrix_01.x*link2.z-matrix_01.y*link2.w+ stag_phase*(matrix_02.x*C3RE -matrix_02.y*C3IM); mat2.y = matrix_00.x*link1.y+matrix_00.y*link1.x+ matrix_01.x*link2.w+matrix_01.y*link2.z+ stag_phase*(matrix_02.x*C3IM +matrix_02.y*C3RE); matrix_00 = mat0; matrix_01 = mat1; matrix_02 = mat2; mat0.x = matrix_10.x*link0.x-matrix_10.y*link0.y+ matrix_11.x*link1.z-matrix_11.y*link1.w+ stag_phase*(matrix_12.x*C1RE -matrix_12.y*C1IM); mat0.y = matrix_10.x*link0.y+matrix_10.y*link0.x+ matrix_11.x*link1.w+matrix_11.y*link1.z+ stag_phase*(matrix_12.x*C1IM +matrix_12.y*C1RE); mat1.x = matrix_10.x*link0.z-matrix_10.y*link0.w+ matrix_11.x*link2.x-matrix_11.y*link2.y+ stag_phase*(matrix_12.x*C2RE -matrix_12.y*C2IM); mat1.y = matrix_10.x*link0.w+matrix_10.y*link0.z+ matrix_11.x*link2.y+matrix_11.y*link2.x+ stag_phase*(matrix_12.x*C2IM +matrix_12.y*C2RE); mat2.x = matrix_10.x*link1.x-matrix_10.y*link1.y+ matrix_11.x*link2.z-matrix_11.y*link2.w+ stag_phase*(matrix_12.x*C3RE -matrix_12.y*C3IM); mat2.y = matrix_10.x*link1.y+matrix_10.y*link1.x+ matrix_11.x*link2.w+matrix_11.y*link2.z+ stag_phase*(matrix_12.x*C3IM +matrix_12.y*C3RE); matrix_10 = mat0; matrix_11 = mat1; matrix_12 = mat2; mat0.x = matrix_20.x*link0.x-matrix_20.y*link0.y+ matrix_21.x*link1.z-matrix_21.y*link1.w+ stag_phase*(matrix_22.x*C1RE -matrix_22.y*C1IM); mat0.y = matrix_20.x*link0.y+matrix_20.y*link0.x+ matrix_21.x*link1.w+matrix_21.y*link1.z+ stag_phase*(matrix_22.x*C1IM +matrix_22.y*C1RE); mat1.x = matrix_20.x*link0.z-matrix_20.y*link0.w+ matrix_21.x*link2.x-matrix_21.y*link2.y+ stag_phase*(matrix_22.x*C2RE -matrix_22.y*C2IM); mat1.y = matrix_20.x*link0.w+matrix_20.y*link0.z+ matrix_21.x*link2.y+matrix_21.y*link2.x+ stag_phase*(matrix_22.x*C2IM +matrix_22.y*C2RE); mat2.x = matrix_20.x*link1.x-matrix_20.y*link1.y+ matrix_21.x*link2.z-matrix_21.y*link2.w+ stag_phase*(matrix_22.x*C3RE -matrix_22.y*C3IM); mat2.y = matrix_20.x*link1.y+matrix_20.y*link1.x+ matrix_21.x*link2.w+matrix_21.y*link2.z+ stag_phase*(matrix_22.x*C3IM +matrix_22.y*C3RE); matrix_20 = mat0; matrix_21 = mat1; matrix_22 = mat2; // matrix=f_aux_dev [U(idx-nu+mu)_{nu}]^{dag} * [U(idx-nu)_mu]^{dag} * U(x-nu)_nu barrier(CLK_LOCAL_MEM_FENCE); ///////////////////////////// /// Write to global memory staples[0][threadIdx.x].x += matrix_00.x; staples[0][threadIdx.x].y += matrix_00.y; staples[1][threadIdx.x].x += matrix_01.x; staples[1][threadIdx.x].y += matrix_01.y; staples[2][threadIdx.x].x += matrix_02.x; staples[2][threadIdx.x].y += matrix_02.y; staples[3][threadIdx.x].x += matrix_10.x; staples[3][threadIdx.x].y += matrix_10.y; staples[4][threadIdx.x].x += matrix_11.x; staples[4][threadIdx.x].y += matrix_11.y; staples[5][threadIdx.x].x += matrix_12.x; staples[5][threadIdx.x].y += matrix_12.y; staples[6][threadIdx.x].x += matrix_20.x; staples[6][threadIdx.x].y += matrix_20.y; staples[7][threadIdx.x].x += matrix_21.x; staples[7][threadIdx.x].y += matrix_21.y; staples[8][threadIdx.x].x += matrix_22.x; staples[8][threadIdx.x].y += matrix_22.y; } /////////////////////////////////////////// // Load out_matrix stag_phase=phases[idx+blockIdx.y*size_dev]; matrix_00.x = staples[0][threadIdx.x].x; matrix_00.y = staples[0][threadIdx.x].y; matrix_01.x = staples[1][threadIdx.x].x; matrix_01.y = staples[1][threadIdx.x].y; matrix_02.x = staples[2][threadIdx.x].x; matrix_02.y = staples[2][threadIdx.x].y; matrix_10.x = staples[3][threadIdx.x].x; matrix_10.y = staples[3][threadIdx.x].y; matrix_11.x = staples[4][threadIdx.x].x; matrix_11.y = staples[4][threadIdx.x].y; matrix_12.x = staples[5][threadIdx.x].x; matrix_12.y = staples[5][threadIdx.x].y; matrix_20.x = staples[6][threadIdx.x].x; matrix_20.y = staples[6][threadIdx.x].y; matrix_21.x = staples[7][threadIdx.x].x; matrix_21.y = staples[7][threadIdx.x].y; matrix_22.x = staples[8][threadIdx.x].x; matrix_22.y = staples[8][threadIdx.x].y; ////////////////// // Multiply u_mu * staple LoadLinkRegs(gauge_texRef, size_dev, idx, blockIdx.y); //Loads U_mu mat0.x = link0.x*matrix_00.x - link0.y*matrix_00.y + link0.z*matrix_10.x - link0.w*matrix_10.y + link1.x*matrix_20.x - link1.y*matrix_20.y; mat0.y = link0.x*matrix_00.y + link0.y*matrix_00.x + link0.z*matrix_10.y + link0.w*matrix_10.x + link1.x*matrix_20.y + link1.y*matrix_20.x; mat1.x = link1.z*matrix_00.x - link1.w*matrix_00.y + link2.x*matrix_10.x - link2.y*matrix_10.y + link2.z*matrix_20.x - link2.w*matrix_20.y; mat1.y = link1.z*matrix_00.y + link1.w*matrix_00.x + link2.x*matrix_10.y + link2.y*matrix_10.x + link2.z*matrix_20.y + link2.w*matrix_20.x; mat2.x = stag_phase*(C1RE*matrix_00.x - C1IM*matrix_00.y + C2RE*matrix_10.x - C2IM*matrix_10.y + C3RE*matrix_20.x - C3IM*matrix_20.y); mat2.y = stag_phase*(C1RE*matrix_00.y + C1IM*matrix_00.x + C2RE*matrix_10.y + C2IM*matrix_10.x + C3RE*matrix_20.y + C3IM*matrix_20.x); matrix_00 = mat0; matrix_10 = mat1; matrix_20 = mat2; mat0.x = link0.x*matrix_01.x - link0.y*matrix_01.y + link0.z*matrix_11.x - link0.w*matrix_11.y + link1.x*matrix_21.x - link1.y*matrix_21.y; mat0.y = link0.x*matrix_01.y + link0.y*matrix_01.x + link0.z*matrix_11.y + link0.w*matrix_11.x + link1.x*matrix_21.y + link1.y*matrix_21.x; mat1.x = link1.z*matrix_01.x - link1.w*matrix_01.y + link2.x*matrix_11.x - link2.y*matrix_11.y + link2.z*matrix_21.x - link2.w*matrix_21.y; mat1.y = link1.z*matrix_01.y + link1.w*matrix_01.x + link2.x*matrix_11.y + link2.y*matrix_11.x + link2.z*matrix_21.y + link2.w*matrix_21.x; mat2.x = stag_phase*(C1RE*matrix_01.x - C1IM*matrix_01.y + C2RE*matrix_11.x - C2IM*matrix_11.y + C3RE*matrix_21.x - C3IM*matrix_21.y); mat2.y = stag_phase*(C1RE*matrix_01.y + C1IM*matrix_01.x + C2RE*matrix_11.y + C2IM*matrix_11.x + C3RE*matrix_21.y + C3IM*matrix_21.x); matrix_01 = mat0; matrix_11 = mat1; matrix_21 = mat2; mat0.x = link0.x*matrix_02.x - link0.y*matrix_02.y + link0.z*matrix_12.x - link0.w*matrix_12.y + link1.x*matrix_22.x - link1.y*matrix_22.y; mat0.y = link0.x*matrix_02.y + link0.y*matrix_02.x + link0.z*matrix_12.y + link0.w*matrix_12.x + link1.x*matrix_22.y + link1.y*matrix_22.x; mat1.x = link1.z*matrix_02.x - link1.w*matrix_02.y + link2.x*matrix_12.x - link2.y*matrix_12.y + link2.z*matrix_22.x - link2.w*matrix_22.y; mat1.y = link1.z*matrix_02.y + link1.w*matrix_02.x + link2.x*matrix_12.y + link2.y*matrix_12.x + link2.z*matrix_22.y + link2.w*matrix_22.x; mat2.x = stag_phase*(C1RE*matrix_02.x - C1IM*matrix_02.y + C2RE*matrix_12.x - C2IM*matrix_12.y + C3RE*matrix_22.x - C3IM*matrix_22.y); mat2.y = stag_phase*(C1RE*matrix_02.y + C1IM*matrix_02.x + C2RE*matrix_12.y + C2IM*matrix_12.x + C3RE*matrix_22.y + C3IM*matrix_22.x); matrix_02 = mat0; matrix_12 = mat1; matrix_22 = mat2; ///////////////////////////// /// Write to global memory the traceless antihermitian part out_matrix[idx + 2*blockIdx.y*size_dev].x+=0.5f*(matrix_01.x-matrix_10.x); out_matrix[idx + 2*blockIdx.y*size_dev].y+=0.5f*(matrix_01.y+matrix_10.y); out_matrix[idx + 2*blockIdx.y*size_dev].z+=0.5f*(matrix_02.x-matrix_20.x); out_matrix[idx + 2*blockIdx.y*size_dev].w+=0.5f*(matrix_02.y+matrix_20.y); out_matrix[idx + size_dev + 2*blockIdx.y*size_dev].x+=0.5f*(matrix_12.x-matrix_21.x); out_matrix[idx + size_dev + 2*blockIdx.y*size_dev].y+=0.5f*(matrix_12.y+matrix_21.y); out_matrix[idx + size_dev + 2*blockIdx.y*size_dev].z+=matrix_00.y-0.3333333f*(matrix_00.y+matrix_11.y+matrix_22.y); out_matrix[idx + size_dev + 2*blockIdx.y*size_dev].w+=matrix_11.y-0.3333333f*(matrix_00.y+matrix_11.y+matrix_22.y); } #else #endif

0 Likes
n_treutner
Journeyman III

Problem after change from 2.2 to 2.3

damnit, my GDebugger Licence has just run out. I wanted to test, what ibird suggested, but i can't read back the values so easily atm. is there any other easy way, to check the buffers during runtime, like the GDebugger allows to?

EDIT: I'm stupid, they give away free licences. Never mind ...

but what i can say is, that the code still does not work. i think, i'll have to switch back to 2.2.

himanshu.guatam, i'm afraid, i won't be able to build a simple test case in reasonable time. but maybe, i'll try anyway in a couple of days.

0 Likes
Jawed
Adept II

Problem after change from 2.2 to 2.3

Himanshu, if you are interested in a test case you can take a look at the Deathray thread I started recently, as I have posted the entire source code. The kernel source documents the bug, which I've attached.

If you delete the two lines that say this:

int x = (local_id.x == 7) ? 44 : 40;

and then paste one copy of that line just before:

if (top_group) {

the error occurs. This is with SDK 2.3 and Catalyst 10.12.

Please note the #define for TILE_SIDE at the top of that source file. If the value of TILE_SIDE is defined as 48, the error does not occur. But when it is 53 (actually any value > 48 and <=64) the error occurs, if the single line of code computing "x" is used.

So even though it looks silly, the two identical lines of code are required to work around this bug.

} else if ((top_group || bot_group) && local_id.y < 😎 { if (top_group) { int x = (local_id.x == 7) ? 44 : 40; // ATI 2.3: compiler bug requires this here if TILE_SIDE is not 48 mirror = ReadTile4(x, 8 + local_id.y + 1, tile); WriteTile4(mirror, x, 8 - local_id.y - 1, tile); } if (bot_group) { int x = (local_id.x == 7) ? 44 : 40; // ATI 2.3: compiler bug requires this here if TILE_SIDE is not 48 int bottom = ((height - 1) & 31) + 8; mirror = ReadTile4(x, bottom - local_id.y - 1, tile); WriteTile4(mirror, x, bottom + local_id.y + 1, tile); } }

0 Likes
n_treutner
Journeyman III

Problem after change from 2.2 to 2.3

allright, i think i found the problem (in my special case): it's the imagesampler. i'll try to find out, what exactly went wrong.

0 Likes
himanshu_gautam
Grandmaster

Problem after change from 2.2 to 2.3

jawed,

I downloaded the deathray. But i am little unfamiliar with what the code would do. When I run it it asks for a executable for debug session. I am not sure what to provide there, although i tried giving it a executable from OpenCL samples(Bitonic Sort) and on execution the deathray executed that .exe. Can you please provide what type of error do i need to expect and what .exe needs to be provided to the app.

Do you get some compilation error in the situation you mentioned i.e when TILE_SIDE=53?If that is the case with you I think the issue has been resolved as it no longer occur with internal dlls.

0 Likes
Jawed
Adept II

Problem after change from 2.2 to 2.3

Hi Himanshu, Deathray is a plug-in for the Avisynth scripting environment.

To create a debug session with Deathray requires a working installation of Avisynth 2.5.8 and an application that enables single-frame stepping through a video. e.g. AvsP or VirtualDub.

To be honest I was merely expecting you to take a look at the compilations that are produced when the code is varied as I specified above. The entire compilation is around 700 ALU bundles and the error is occurring somewhere in the middle of the first 300-odd ALU bundles.

I presume you guys have decent tools for tracing through ISA!

As published, Deathray won't exhibit the bug unless you change the int x = ... lines as I described earlier. I was hoping that you would be able to analyse the compiled output and discern the problem.

I don't get a compilation error at runtime. Instead the kernel malfunctions in the area of those int x = ... lines.

Those lines are copying pixels from one place in local memory to another. With TILE_SIDE set to 48 the pixel copy is perfect. When set to 53 the copy fails.

I'm making heavy use of read_mem_fence and write_mem_fence in the kernel to ensure that ordering is correct.

It seems like you are saying you have discovered and fixed a problem with the way the compiler does optimisation of literals and/or literals used in local memory addressing. Is that the case?

I've actually discovered another problem relating to reads from "a register". This is more troubling (I can't find a work-around) and I will raise a support ticket for it.

Basically I define:

float16 target_window[7]

If I assign to a float4 using:

float4 test = target_window[3].s2345

or

float4 test = target_window[3].s6789

then I get a faulty result. The peculiar thing is the faulty result is in only one channel of the float4 result and the fault is data-dependent (faulty pixels flicker based upon data in the video frame!). The fault seems to disappear if I do lots of assignments, i.e. .s0123 .s1234 .s2345 .s3456 .s4567 .s5678 .s6789. Some smaller combinations from that set also seem to make the fault disappear.

0 Likes