As promised here ( http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=145087&enterthread=y ) this is a simple test case to reproduce the issue
This code simply create some buffers fermion, gauge, table, eta do some calculation
into the kernel than print out the result with this format " (n buffer index) value"
Now running on GPU i get
( 0 ) 29.9808543717 ( 1 ) -30.5956300187 ( 2 ) 39.421714042 ( 3 ) -65.1260533737 ( 4 ) 36.0566958781 ( 5 ) -48.636864618 ( 6 )
26.4227317676 ( 7 ) -62.8406544328 ( 8 ) 0 ( 9 ) 0 ( 10 ) 0 ( 11 ) 0 ( 12 ) 0 ( 13 ) 0 ( 14 ) 0 ( 15 ) 0 ( 16 ) 30.6030340145 ( 17 )
-51.5434104653 ( 18 ) 38.6194434131 ( 19 ) -51.0394421127 ( 20 ) 28.8687884796 ( 21 ) -46.1770377079 ( 22 ) 38.3811695387 ( 23 )
-59.4306483653 ( 24 ) 0 ( 25 ) 0 ( 26 ) 0 ( 27 ) 0 ( 28 ) 0 ( 29 ) 0 ( 30 ) 0 ( 31 ) 0 ( 32 ) 37.6226123498 ( 33 ) -33.3488500679 ( 34 )
37.7464948716 ( 35 ) -9.0478474369 ( 36 ) 53.1298898854 ( 37 ) 12.1399464495 ( 38 ) 33.5315784346 ( 39 ) -8.02393247651 ( 40 ) 0 ( 41 )
0 ( 42 ) 0 ( 43 ) 0 ( 44 ) 0 ( 45 ) 0 ( 46 ) 0 ( 47 ) 0
Now running on CPU i get
( 0 ) 29.9808543717 ( 1 ) -30.5956300187 ( 2 ) 39.421714042 ( 3 ) -65.1260533737 ( 4 ) 36.0566958781 ( 5 ) -48.636864618 ( 6 )
26.4227317676 ( 7 ) -62.8406544328 ( 8 ) 0 ( 9 ) 0 ( 10 ) 0 ( 11 ) 0 ( 12 ) 0 ( 13 ) 0 ( 14 ) 0 ( 15 ) 0 ( 16 ) 30.6030340145 ( 17 )
-51.5434104653 ( 18 ) 38.6194434131 ( 19 ) -51.0394421127 ( 20 ) 29.490630454 ( 21 ) -45.8204361705 ( 22 ) 38.2942618562 ( 23 )
-59.8944927521 ( 24 ) 0 ( 25 ) 0 ( 26 ) 0 ( 27 ) 0 ( 28 ) 0 ( 29 ) 0 ( 30 ) 0 ( 31 ) 0 ( 32 ) 37.6226123498 ( 33 ) -33.3488500679 ( 34 )
37.7464948716 ( 35 ) -9.0478474369 ( 36 ) 51.3163383616 ( 37 ) 11.6219953431 ( 38 ) 33.7140404702 ( 39 ) -8.27195007533 ( 40 ) 0 ( 41 )
0 ( 42 ) 0 ( 43 ) 0 ( 44 ) 0 ( 45 ) 0 ( 46 ) 0 ( 47 ) 0
Are the equal from 0 to 19 are different from 20 to 47
The test case has no buffer overflown, you can check this
workgroup is 2, total work is 4 NUM_THREADS=2
fermion has dimension 8*6 = 48 double
gauge has dimension 8*12*4*2 = 768 float
table has dimansion 8*8 = 64 unsigned int site table store number from 0 to 7 table=(i*123)%NSITE
eta has dimension 8*4 = 32 int
size_dev = 8 b_size_dev = 0 mass = 0.1
inside the kernel the maxium fermion buffer address is 2*size_dev_t + site_table[threadIdx] so 2*8 + 7 = 23 for double2 = 47 < 48 (good no overflown)
inside the kernel the maxium gauge buffer address is 12*vol + idx + vol*(2+3*mu) so 12*8+4+8*(2+3*3) = 188 for float4 = 752 < 768 (good no overflown)
inside the kernel the maxium table buffer address is idx+7*size_dev so 4+8*7 = 60 < 64 (good no overflown)
inside the kernel the maxium eta buffer address is site_table[threadIdx]+3*size_dev_t so 4+3*7 = 25 < 32 (good no overflown)
Locals are
__local double ferm_out[3][2][NUM_THREADS]; NUM_THREADS = 2
__local int site_table[NUM_THREADS];
both are adressed with threadIdx where workgroup = 2, so ThreadIdx is 0 or 1
Assuming this, and the fact that the memory is all initialized, there is no way CPU and GPU must converge on the same result.
But this does not happen. CPU has correct result, GPU no.
Unfortunately is difficult to reduce the kernel, the bug disappear trying to reduce the kernel. The kernel has some unsusefull instruction,
but are usefull to reproduce the issue.
example
brd = (site_table[threadIdx] < size_dev)?0:1;
size_dev_t = (brd)?bsize_dev:size_dev;
size_dev_t is always = size_dev because site_table[threadIdx] < size_dev is always true, but deleting this instruction the issue disappear
it also disappear if i try to reduce the number of indipendent component calculated, so this is the maxium i can do. The problem
is limited to ATI GPU (OpenCL CPU , OpenCL GPU Nvidia work without problems), the problem has appeared with latest driver catalyst 10.12 (stable)
+ SDK 2.3, downgrading the SDK to 2.2 do not solvethe problem. I am on Ubuntu 9.04
/////////////////// bug.cpp #define RESULT_OK 1 #define RESULT_FAIL -1 #include <CL/cl.h> #include <iostream> #include <iomanip> #include <fstream> #include <string.h> #define ATI_PLATFORM 0 #define ATI_DEVICE_1 1 #define NSITE 8 #define FERMION_SIZE NSITE*6 #define GAUGE_SIZE NSITE*12*4*2 #define TABLE_SIZE NSITE*8 #define ETA_SIZE NSITE*4 double * fermion; float * gauge; unsigned int * table; int * eta; struct test { unsigned int x; unsigned int y; }; int main() { int id_dev = 0; cl_int status = 0; cl_uint nplat = 0; /* Get number of platforms */\ status = clGetPlatformIDs(0,NULL,&nplat);\ if(status != CL_SUCCESS)\ { std::cerr << "Error: clGetPlatformIDs error \n"; return -1; } /* Get all platforms ID */ cl_platform_id * platformst = new cl_platform_id [nplat]; status = clGetPlatformIDs(nplat,platformst,NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clGetPlatformIDs error \n"; return -1; } /* Get devices */ unsigned int num_devices; status = clGetDeviceIDs(platformst[ATI_PLATFORM],CL_DEVICE_TYPE_ALL,0,NULL,&num_devices); if (status != CL_SUCCESS) { std::cerr << "Error: clGetDeviceIDs error \n"; return -1; } /* Get all devices ID in a platform */ cl_device_id * devicest = new cl_device_id [num_devices]; status = clGetDeviceIDs(platformst[ATI_PLATFORM],CL_DEVICE_TYPE_ALL,num_devices,devicest,NULL); if (status != CL_SUCCESS) { std::cerr << "Error: clGetDeviceIDs error \n"; return -1; } /* create a context */ cl_context_properties prop[3]; prop[0] = CL_CONTEXT_PLATFORM; prop[1] = (cl_context_properties)platformst[ATI_PLATFORM]; prop[2] = 0; cl_device_id * devid = new cl_device_id [num_devices]; for (int j = 0 ; j < (int)num_devices ; j++) { devid
= devicest ; } cl_context context = clCreateContext(prop, num_devices, devid, NULL, NULL ,&status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateContext error \n"; return -1; } /* Load and Build Kernel */ cl_program prgdslash; std::ifstream loadf; loadf.open("DslashDaggerDDKernelEO.cl",std::ios::binary); if (!loadf.is_open()) return RESULT_FAIL; /* get file size */ loadf.seekg (0, std::ios::end); int sz = loadf.tellg(); loadf.seekg (0, std::ios::beg); unsigned int sourceSize1 = sz; char * source = new char [sz]; /* Load the dump */ loadf.read((char *)source,sz); loadf.close(); prgdslash = clCreateProgramWithSource(context, 1, (const char **)&source, &sourceSize1, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateProgramWithSource error \n"; return -1; } status = clBuildProgram(prgdslash, 1, &devicest[ATI_DEVICE_1], "-D NUM_THREADS=2 -D FP64_AMD", NULL, NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clBuildProgram error \n"; return -1; } cl_kernel kernelsDslash_t = clCreateKernel(prgdslash, "DslashDaggerDDKernelEO", &status); /* for one device in a platform we create 2 commandQueue */ cl_command_queue * commandQueue = new cl_command_queue [2]; commandQueue[0] = clCreateCommandQueue(context, devicest[ATI_DEVICE_1], 0, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateCommandQueue error \n"; return -1; } /* Load buffer */ fermion = new double [FERMION_SIZE]; for (unsigned int i = 0 ; i < FERMION_SIZE ; i++) { fermion = 1.0+(0.4343234*(float)i - (int)(0.4343234*i)); } gauge = new float [GAUGE_SIZE]; for (unsigned int i = 0 ; i < GAUGE_SIZE ; i++) { gauge = 1.0+(0.8463654*(float)i - (int)(0.8463654*i)); } table = new unsigned int [TABLE_SIZE]; for (unsigned int i = 0 ; i < TABLE_SIZE ; i++) { table = (i*123)%NSITE; } eta = new int [ETA_SIZE]; for (unsigned int i = 0 ; i < ETA_SIZE ; i++) { eta = 2*(i%2)-1; } size_t global[3] = {0,0,0}; size_t local[3] = {0,0,0}; cl_mem bufout = clCreateBuffer(context, CL_MEM_READ_WRITE, FERMION_SIZE*sizeof(cl_double), NULL, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } cl_mem buffermion = clCreateBuffer(context, CL_MEM_READ_WRITE, FERMION_SIZE*sizeof(cl_double), NULL, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } cl_mem bufgauge = clCreateBuffer(context, CL_MEM_READ_WRITE, GAUGE_SIZE*sizeof(cl_float), NULL, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } cl_mem buftable = clCreateBuffer(context, CL_MEM_READ_WRITE, TABLE_SIZE*sizeof(cl_int), NULL, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } cl_mem bufeta = clCreateBuffer(context, CL_MEM_READ_WRITE, ETA_SIZE*sizeof(cl_int), NULL, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } status = clEnqueueWriteBuffer(commandQueue[0],buffermion,CL_TRUE,0,FERMION_SIZE*sizeof(cl_double),fermion,0,NULL,NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } status = clEnqueueWriteBuffer(commandQueue[0],bufgauge,CL_TRUE,0,GAUGE_SIZE*sizeof(cl_float),gauge,0,NULL,NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } status = clEnqueueWriteBuffer(commandQueue[0],buftable,CL_TRUE,0,TABLE_SIZE*sizeof(cl_uint),table,0,NULL,NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } status = clEnqueueWriteBuffer(commandQueue[0],bufeta,CL_TRUE,0,ETA_SIZE*sizeof(cl_uint),eta,0,NULL,NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } global[0] = NSITE/2; local[0] = 2; unsigned int size_dev = NSITE; unsigned int size_dev_h = NSITE/2; unsigned int b_size_dev = 0; double mass = 0.1; status = clSetKernelArg(kernelsDslash_t, 0, sizeof(cl_mem), (void *)&bufout); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 1, sizeof(cl_mem), (void *)&bufgauge); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 2, sizeof(cl_mem), (void *)&buffermion); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 3, sizeof(cl_mem), (void *)&buftable); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 4, sizeof(cl_mem), (void *)&bufeta); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 5, sizeof(cl_uint), (void *)&size_dev); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 6, sizeof(cl_uint), (void *)&size_dev_h); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 7, sizeof(cl_uint), (void *)&b_size_dev); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 8, sizeof(cl_double), (void *)&mass); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clEnqueueNDRangeKernel(commandQueue[0],kernelsDslash_t,1,NULL,global,local,0,NULL,NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clEnqueueNDRangeKernel error \n"; return -1; } clFinish(commandQueue[0]); double bufferout[FERMION_SIZE]; clEnqueueReadBuffer(commandQueue[0],bufout,CL_TRUE,0,FERMION_SIZE*sizeof(cl_double),bufferout,0,NULL,NULL); for (unsigned int i = 0 ; i < FERMION_SIZE; i++) { std::cerr << " ( " << i << " ) " << std::setprecision(12) << bufferout; } clReleaseCommandQueue(commandQueue[0]); clReleaseKernel(kernelsDslash_t); clReleaseProgram(prgdslash); delete [] devid; delete [] devicest; delete [] platformst; return 0; } //////////// DslashDaggerDDKernelEO.cl #ifdef FP64_AMD #pragma OPENCL EXTENSION cl_amd_fp64 : enable #endif #ifdef FP64 #pragma OPENCL EXTENSION cl_khr_fp64 : enable #endif #define C1RED ( (link0z*link2z-link0w*link2w) - (link1x*link2x-link1y*link2y) ) #define C1IMD (-(link0z*link2w+link0w*link2z) + (link1x*link2y+link1y*link2x) ) #define C2RED ( (link1x*link1z-link1y*link1w) - (link0x*link2z-link0y*link2w) ) #define C2IMD (-(link1x*link1w+link1y*link1z) + (link0x*link2w+link0y*link2z) ) #define C3RED ( (link0x*link2x-link0y*link2y) - (link0z*link1z-link0w*link1w) ) #define C3IMD (-(link0x*link2y+link0y*link2x) + (link0z*link1w+link0w*link1z) ) #define LoadLinkDDRegs(gauge, vol, idx, mu) \ auxlink = gauge_texRef[ idx + vol*(0+3*mu)];\ link0x=(double) auxlink.x;\ link0y=(double) auxlink.y;\ link0z=(double) auxlink.z;\ link0w=(double) auxlink.w;\ auxlink = gauge_texRef[ idx + vol*(1+3*mu)];\ link1x=(double) auxlink.x;\ link1y=(double) auxlink.y;\ link1z=(double) auxlink.z;\ link1w=(double) auxlink.w;\ auxlink = gauge_texRef[ idx + vol*(2+3*mu)];\ link2x=(double) auxlink.x;\ link2y=(double) auxlink.y;\ link2z=(double) auxlink.z;\ link2w=(double) auxlink.w;\ /* 2nd float*/\ auxlink = gauge_texRef[ 12*vol + idx + vol*(0+3*mu)];\ link0x+=(double) auxlink.x;\ link0y+=(double) auxlink.y;\ link0z+=(double) auxlink.z;\ link0w+=(double) auxlink.w;\ auxlink = gauge_texRef[ 12*vol + idx + vol*(1+3*mu)];\ link1x+=(double) auxlink.x;\ link1y+=(double) auxlink.y;\ link1z+=(double) auxlink.z;\ link1w+=(double) auxlink.w;\ auxlink = gauge_texRef[ 12*vol + idx + vol*(2+3*mu)];\ link2x+=(double) auxlink.x;\ link2y+=(double) auxlink.y;\ link2z+=(double) auxlink.z;\ link2w+=(double) auxlink.w; __kernel void DslashDaggerDDKernelEO(__global double2 *out, __global float4 * gauge_texRef, __global double2 *fermion_texRef, __global int *tables, __global int *phases, const unsigned int size_dev, const unsigned int size_dev_h, const unsigned int bsize_dev, const double mass_d_dev) { unsigned int brd; int threadIdx = {get_local_id(0)}; int blockDim = {get_local_size(0)}; int blockIdx = {get_group_id(0)}; int idx = blockIdx*blockDim + threadIdx; // idx< sizeh, EVEN!! double stag_phase = 1.0; //Store result in sharedMem __local double ferm_out[3][2][NUM_THREADS]; unsigned int size_dev_t; //New tables indexing (index fastest) __local int site_table[NUM_THREADS]; //Load link matrix U_mu(ix) in registers double link0x, link0y, link0z, link0w, link1x, link1y, link1z, link1w, link2x, link2y, link2z, link2w; float4 auxlink; double2 ferm_in_0, ferm_in_1, ferm_in_2; // DIRECTION 0 site_table[threadIdx] = tables[idx+4*size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef,size_dev,idx,0); ferm_out[0][0][threadIdx] = link0x*ferm_in_0.x-link0y*ferm_in_0.y+ link0z*ferm_in_1.x-link0w*ferm_in_1.y+ link1x*ferm_in_2.x-link1y*ferm_in_2.y; ferm_out[0][1][threadIdx] = link0x*ferm_in_0.y+link0y*ferm_in_0.x+ link0z*ferm_in_1.y+link0w*ferm_in_1.x+ link1x*ferm_in_2.y+link1y*ferm_in_2.x; ferm_out[1][0][threadIdx] = link1z*ferm_in_0.x-link1w*ferm_in_0.y+ link2x*ferm_in_1.x-link2y*ferm_in_1.y+ link2z*ferm_in_2.x-link2w*ferm_in_2.y; ferm_out[1][1][threadIdx] = link1z*ferm_in_0.y+link1w*ferm_in_0.x+ link2x*ferm_in_1.y+link2y*ferm_in_1.x+ link2z*ferm_in_2.y+link2w*ferm_in_2.x; ferm_out[2][0][threadIdx] = C1RED*ferm_in_0.x-C1IMD*ferm_in_0.y+ C2RED*ferm_in_1.x-C2IMD*ferm_in_1.y+ C3RED*ferm_in_2.x-C3IMD*ferm_in_2.y; ferm_out[2][1][threadIdx] = C1RED*ferm_in_0.y+C1IMD*ferm_in_0.x+ C2RED*ferm_in_1.y+C2IMD*ferm_in_1.x+ C3RED*ferm_in_2.y+C3IMD*ferm_in_2.x; //DIRECTION 1 site_table[threadIdx] = tables[idx+5*size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; stag_phase = (double) phases[idx+size_dev]; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef,size_dev,idx,1); ferm_out[0][0][threadIdx] += link0x*ferm_in_0.x-link0y*ferm_in_0.y+ link0z*ferm_in_1.x-link0w*ferm_in_1.y+ link1x*ferm_in_2.x-link1y*ferm_in_2.y; ferm_out[0][1][threadIdx] += link0x*ferm_in_0.y+link0y*ferm_in_0.x+ link0z*ferm_in_1.y+link0w*ferm_in_1.x+ link1x*ferm_in_2.y+link1y*ferm_in_2.x; ferm_out[1][0][threadIdx] += link1z*ferm_in_0.x-link1w*ferm_in_0.y+ link2x*ferm_in_1.x-link2y*ferm_in_1.y+ link2z*ferm_in_2.x-link2w*ferm_in_2.y; ferm_out[1][1][threadIdx] += link1z*ferm_in_0.y+link1w*ferm_in_0.x+ link2x*ferm_in_1.y+link2y*ferm_in_1.x+ link2z*ferm_in_2.y+link2w*ferm_in_2.x; ferm_out[2][0][threadIdx] += stag_phase*(C1RED*ferm_in_0.x-C1IMD*ferm_in_0.y+ C2RED*ferm_in_1.x-C2IMD*ferm_in_1.y+ C3RED*ferm_in_2.x-C3IMD*ferm_in_2.y); ferm_out[2][1][threadIdx] += stag_phase*(C1RED*ferm_in_0.y+C1IMD*ferm_in_0.x+ C2RED*ferm_in_1.y+C2IMD*ferm_in_1.x+ C3RED*ferm_in_2.y+C3IMD*ferm_in_2.x); //DIRECTION 2 site_table[threadIdx] = tables[idx+6*size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; stag_phase = (double) phases[idx+2*size_dev]; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef, size_dev, idx, 2); ferm_out[0][0][threadIdx] += link0x*ferm_in_0.x-link0y*ferm_in_0.y+ link0z*ferm_in_1.x-link0w*ferm_in_1.y+ link1x*ferm_in_2.x-link1y*ferm_in_2.y; ferm_out[0][1][threadIdx] += link0x*ferm_in_0.y+link0y*ferm_in_0.x+ link0z*ferm_in_1.y+link0w*ferm_in_1.x+ link1x*ferm_in_2.y+link1y*ferm_in_2.x; ferm_out[1][0][threadIdx] += link1z*ferm_in_0.x-link1w*ferm_in_0.y+ link2x*ferm_in_1.x-link2y*ferm_in_1.y+ link2z*ferm_in_2.x-link2w*ferm_in_2.y; ferm_out[1][1][threadIdx] += link1z*ferm_in_0.y+link1w*ferm_in_0.x+ link2x*ferm_in_1.y+link2y*ferm_in_1.x+ link2z*ferm_in_2.y+link2w*ferm_in_2.x; ferm_out[2][0][threadIdx] += stag_phase*(C1RED*ferm_in_0.x-C1IMD*ferm_in_0.y+ C2RED*ferm_in_1.x-C2IMD*ferm_in_1.y+ C3RED*ferm_in_2.x-C3IMD*ferm_in_2.y); ferm_out[2][1][threadIdx] += stag_phase*(C1RED*ferm_in_0.y+C1IMD*ferm_in_0.x+ C2RED*ferm_in_1.y+C2IMD*ferm_in_1.x+ C3RED*ferm_in_2.y+C3IMD*ferm_in_2.x); //DIRECTION 3 site_table[threadIdx] = tables[idx+7*size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; stag_phase = (double) phases[idx+3*size_dev]; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef, size_dev, idx, 3); ferm_out[0][0][threadIdx] += link0x*ferm_in_0.x-link0y*ferm_in_0.y+ link0z*ferm_in_1.x-link0w*ferm_in_1.y+ link1x*ferm_in_2.x-link1y*ferm_in_2.y; ferm_out[0][1][threadIdx] += link0x*ferm_in_0.y+link0y*ferm_in_0.x+ link0z*ferm_in_1.y+link0w*ferm_in_1.x+ link1x*ferm_in_2.y+link1y*ferm_in_2.x; ferm_out[1][0][threadIdx] += link1z*ferm_in_0.x-link1w*ferm_in_0.y+ link2x*ferm_in_1.x-link2y*ferm_in_1.y+ link2z*ferm_in_2.x-link2w*ferm_in_2.y; ferm_out[1][1][threadIdx] += link1z*ferm_in_0.y+link1w*ferm_in_0.x+ link2x*ferm_in_1.y+link2y*ferm_in_1.x+ link2z*ferm_in_2.y+link2w*ferm_in_2.x; ferm_out[2][0][threadIdx] += stag_phase*(C1RED*ferm_in_0.x-C1IMD*ferm_in_0.y+ C2RED*ferm_in_1.x-C2IMD*ferm_in_1.y+ C3RED*ferm_in_2.x-C3IMD*ferm_in_2.y); ferm_out[2][1][threadIdx] += stag_phase*(C1RED*ferm_in_0.y+C1IMD*ferm_in_0.x+ C2RED*ferm_in_1.y+C2IMD*ferm_in_1.x+ C3RED*ferm_in_2.y+C3IMD*ferm_in_2.x); //---------------------------------------------------end of first block //DIRECTION 0 site_table[threadIdx] = tables[idx]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef, size_dev_t, site_table[threadIdx], 0); ferm_out[0][0][threadIdx] -= link0x*ferm_in_0.x+link0y*ferm_in_0.y + link1z*ferm_in_1.x+link1w*ferm_in_1.y + C1RED*ferm_in_2.x +C1IMD*ferm_in_2.y; ferm_out[0][1][threadIdx] -= link0x*ferm_in_0.y-link0y*ferm_in_0.x + link1z*ferm_in_1.y-link1w*ferm_in_1.x + C1RED*ferm_in_2.y -C1IMD*ferm_in_2.x; ferm_out[1][0][threadIdx] -= link0z*ferm_in_0.x+link0w*ferm_in_0.y + link2x*ferm_in_1.x+link2y*ferm_in_1.y + C2RED*ferm_in_2.x +C2IMD*ferm_in_2.y; ferm_out[1][1][threadIdx] -= link0z*ferm_in_0.y-link0w*ferm_in_0.x + link2x*ferm_in_1.y-link2y*ferm_in_1.x + C2RED*ferm_in_2.y -C2IMD*ferm_in_2.x; ferm_out[2][0][threadIdx] -= link1x*ferm_in_0.x+link1y*ferm_in_0.y + link2z*ferm_in_1.x+link2w*ferm_in_1.y + C3RED*ferm_in_2.x +C3IMD*ferm_in_2.y; ferm_out[2][1][threadIdx] -= link1x*ferm_in_0.y-link1y*ferm_in_0.x + link2z*ferm_in_1.y-link2w*ferm_in_1.x + C3RED*ferm_in_2.y -C3IMD*ferm_in_2.x; //DIRECTION 1 site_table[threadIdx] = tables[idx+size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; stag_phase = (double) phases[site_table[threadIdx]+size_dev_t]; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef, size_dev_t, site_table[threadIdx] , 1); ferm_out[0][0][threadIdx] -= link0x*ferm_in_0.x+link0y*ferm_in_0.y + link1z*ferm_in_1.x+link1w*ferm_in_1.y + stag_phase*(C1RED*ferm_in_2.x+C1IMD*ferm_in_2.y); ferm_out[0][1][threadIdx] -= link0x*ferm_in_0.y-link0y*ferm_in_0.x + link1z*ferm_in_1.y-link1w*ferm_in_1.x + stag_phase*(C1RED*ferm_in_2.y-C1IMD*ferm_in_2.x); ferm_out[1][0][threadIdx] -= link0z*ferm_in_0.x+link0w*ferm_in_0.y + link2x*ferm_in_1.x+link2y*ferm_in_1.y + stag_phase*(C2RED*ferm_in_2.x+C2IMD*ferm_in_2.y); ferm_out[1][1][threadIdx] -= link0z*ferm_in_0.y-link0w*ferm_in_0.x + link2x*ferm_in_1.y-link2y*ferm_in_1.x + stag_phase*(C2RED*ferm_in_2.y-C2IMD*ferm_in_2.x); ferm_out[2][0][threadIdx] -= link1x*ferm_in_0.x+link1y*ferm_in_0.y + link2z*ferm_in_1.x+link2w*ferm_in_1.y + stag_phase*(C3RED*ferm_in_2.x+C3IMD*ferm_in_2.y); ferm_out[2][1][threadIdx] -= link1x*ferm_in_0.y-link1y*ferm_in_0.x + link2z*ferm_in_1.y-link2w*ferm_in_1.x + stag_phase*(C3RED*ferm_in_2.y- C3IMD*ferm_in_2.x); //DIRECTION 2 site_table[threadIdx] = tables[idx+2*size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; stag_phase = (double) phases[site_table[threadIdx]+2*size_dev_t]; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef, size_dev_t, site_table[threadIdx], 2); ferm_out[0][0][threadIdx] -= link0x*ferm_in_0.x+link0y*ferm_in_0.y + link1z*ferm_in_1.x+link1w*ferm_in_1.y + stag_phase*(C1RED*ferm_in_2.x+ C1IMD*ferm_in_2.y); ferm_out[0][1][threadIdx] -= link0x*ferm_in_0.y-link0y*ferm_in_0.x + link1z*ferm_in_1.y-link1w*ferm_in_1.x + stag_phase*(C1RED*ferm_in_2.y- C1IMD*ferm_in_2.x); ferm_out[1][0][threadIdx] -= link0z*ferm_in_0.x+link0w*ferm_in_0.y + link2x*ferm_in_1.x+link2y*ferm_in_1.y + stag_phase*(C2RED*ferm_in_2.x+ C2IMD*ferm_in_2.y); ferm_out[1][1][threadIdx] -= link0z*ferm_in_0.y-link0w*ferm_in_0.x + link2x*ferm_in_1.y-link2y*ferm_in_1.x + stag_phase*(C2RED*ferm_in_2.y- C2IMD*ferm_in_2.x); ferm_out[2][0][threadIdx] -= link1x*ferm_in_0.x+link1y*ferm_in_0.y + link2z*ferm_in_1.x+link2w*ferm_in_1.y + stag_phase*(C3RED*ferm_in_2.x+ C3IMD*ferm_in_2.y); ferm_out[2][1][threadIdx] -= link1x*ferm_in_0.y-link1y*ferm_in_0.x + link2z*ferm_in_1.y-link2w*ferm_in_1.x + stag_phase*(C3RED*ferm_in_2.y- C3IMD*ferm_in_2.x); //DIRECTION 3 site_table[threadIdx] = tables[idx+3*size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; stag_phase = (double) phases[site_table[threadIdx]+3*size_dev_t]; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef, size_dev_t, site_table[threadIdx], 3); ferm_out[0][0][threadIdx] -= link0x*ferm_in_0.x+link0y*ferm_in_0.y + link1z*ferm_in_1.x+link1w*ferm_in_1.y + stag_phase*(C1RED*ferm_in_2.x+ C1IMD*ferm_in_2.y); ferm_out[0][1][threadIdx] -= link0x*ferm_in_0.y-link0y*ferm_in_0.x + link1z*ferm_in_1.y-link1w*ferm_in_1.x + stag_phase*(C1RED*ferm_in_2.y- C1IMD*ferm_in_2.x); ferm_out[1][0][threadIdx] -= link0z*ferm_in_0.x+link0w*ferm_in_0.y + link2x*ferm_in_1.x+link2y*ferm_in_1.y + stag_phase*(C2RED*ferm_in_2.x+ C2IMD*ferm_in_2.y); ferm_out[1][1][threadIdx] -= link0z*ferm_in_0.y-link0w*ferm_in_0.x + link2x*ferm_in_1.y-link2y*ferm_in_1.x + stag_phase*(C2RED*ferm_in_2.y- C2IMD*ferm_in_2.x); ferm_out[2][0][threadIdx] -= link1x*ferm_in_0.x+link1y*ferm_in_0.y + link2z*ferm_in_1.x+link2w*ferm_in_1.y + stag_phase*(C3RED*ferm_in_2.x+ C3IMD*ferm_in_2.y); ferm_out[2][1][threadIdx] -= link1x*ferm_in_0.y-link1y*ferm_in_0.x + link2z*ferm_in_1.y-link2w*ferm_in_1.x + stag_phase*(C3RED*ferm_in_2.y- C3IMD*ferm_in_2.x); //-------------------------------------------------end of second block // even ferm_in_0 = fermion_texRef[ idx]; ferm_in_1 = fermion_texRef[ size_dev + idx]; ferm_in_2 = fermion_texRef[ 2*size_dev + idx]; out[idx ].x = mass_d_dev*ferm_in_0.x - ferm_out[0][0][threadIdx]*(double)0.5; out[idx ].y = mass_d_dev*ferm_in_0.y - ferm_out[0][1][threadIdx]*(double)0.5; out[idx + size_dev ].x = mass_d_dev*ferm_in_1.x - ferm_out[1][0][threadIdx]*(double)0.5; out[idx + size_dev ].y = mass_d_dev*ferm_in_1.y - ferm_out[1][1][threadIdx]*(double)0.5; out[idx + 2*size_dev ].x = mass_d_dev*ferm_in_2.x - ferm_out[2][0][threadIdx]*(double)0.5; out[idx + 2*size_dev ].y = mass_d_dev*ferm_in_2.y - ferm_out[2][1][threadIdx]*(double)0.5; // odd out[idx + size_dev_h ].x = (double)0.0; out[idx + size_dev_h ].y = (double)0.0; out[idx + size_dev + size_dev_h ].x = (double)0.0; out[idx + size_dev + size_dev_h ].y = (double)0.0; out[idx + 2*size_dev + size_dev_h ].x = (double)0.0; out[idx + 2*size_dev + size_dev_h ].y = (double)0.0; //-------------------------------------------------end of DslashDagger }
Originally posted by: ibird As promised here ( http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=145087&enterthread=y ) this is a simple test case to reproduce the issue This code simply create some buffers fermion, gauge, table, eta do some calculation into the kernel than print out the result with this format " (n buffer index) value" Now running on GPU i get ( 0 ) 29.9808543717 ( 1 ) -30.5956300187 ( 2 ) 39.421714042 ( 3 ) -65.1260533737 ( 4 ) 36.0566958781 ( 5 ) -48.636864618 ( 6 ) 26.4227317676 ( 7 ) -62.8406544328 ( 8 ) 0 ( 9 ) 0 ( 10 ) 0 ( 11 ) 0 ( 12 ) 0 ( 13 ) 0 ( 14 ) 0 ( 15 ) 0 ( 16 ) 30.6030340145 ( 17 ) -51.5434104653 ( 18 ) 38.6194434131 ( 19 ) -51.0394421127 ( 20 ) 28.8687884796 ( 21 ) -46.1770377079 ( 22 ) 38.3811695387 ( 23 ) -59.4306483653 ( 24 ) 0 ( 25 ) 0 ( 26 ) 0 ( 27 ) 0 ( 28 ) 0 ( 29 ) 0 ( 30 ) 0 ( 31 ) 0 ( 32 ) 37.6226123498 ( 33 ) -33.3488500679 ( 34 ) 37.7464948716 ( 35 ) -9.0478474369 ( 36 ) 53.1298898854 ( 37 ) 12.1399464495 ( 38 ) 33.5315784346 ( 39 ) -8.02393247651 ( 40 ) 0 ( 41 ) 0 ( 42 ) 0 ( 43 ) 0 ( 44 ) 0 ( 45 ) 0 ( 46 ) 0 ( 47 ) 0 Now running on CPU i get ( 0 ) 29.9808543717 ( 1 ) -30.5956300187 ( 2 ) 39.421714042 ( 3 ) -65.1260533737 ( 4 ) 36.0566958781 ( 5 ) -48.636864618 ( 6 ) 26.4227317676 ( 7 ) -62.8406544328 ( 8 ) 0 ( 9 ) 0 ( 10 ) 0 ( 11 ) 0 ( 12 ) 0 ( 13 ) 0 ( 14 ) 0 ( 15 ) 0 ( 16 ) 30.6030340145 ( 17 ) -51.5434104653 ( 18 ) 38.6194434131 ( 19 ) -51.0394421127 ( 20 ) 29.490630454 ( 21 ) -45.8204361705 ( 22 ) 38.2942618562 ( 23 ) -59.8944927521 ( 24 ) 0 ( 25 ) 0 ( 26 ) 0 ( 27 ) 0 ( 28 ) 0 ( 29 ) 0 ( 30 ) 0 ( 31 ) 0 ( 32 ) 37.6226123498 ( 33 ) -33.3488500679 ( 34 ) 37.7464948716 ( 35 ) -9.0478474369 ( 36 ) 51.3163383616 ( 37 ) 11.6219953431 ( 38 ) 33.7140404702 ( 39 ) -8.27195007533 ( 40 ) 0 ( 41 ) 0 ( 42 ) 0 ( 43 ) 0 ( 44 ) 0 ( 45 ) 0 ( 46 ) 0 ( 47 ) 0 Are the equal from 0 to 19 are different from 20 to 47 The test case has no buffer overflown, you can check this workgroup is 2, total work is 4 NUM_THREADS=2 fermion has dimension 8*6 = 48 double gauge has dimension 8*12*4*2 = 768 float table has dimansion 8*8 = 64 unsigned int site table store number from 0 to 7 table=(i*123)%NSITE eta has dimension 8*4 = 32 int size_dev = 8 b_size_dev = 0 mass = 0.1 inside the kernel the maxium fermion buffer address is 2*size_dev_t + site_table[threadIdx] so 2*8 + 7 = 23 for double2 = 47 < 48 (good no overflown) inside the kernel the maxium gauge buffer address is 12*vol + idx + vol*(2+3*mu) so 12*8+4+8*(2+3*3) = 188 for float4 = 752 < 768 (good no overflown) inside the kernel the maxium table buffer address is idx+7*size_dev so 4+8*7 = 60 < 64 (good no overflown) inside the kernel the maxium eta buffer address is site_table[threadIdx]+3*size_dev_t so 4+3*7 = 25 < 32 (good no overflown) Locals are __local double ferm_out[3][2][NUM_THREADS]; NUM_THREADS = 2 __local int site_table[NUM_THREADS]; both are adressed with threadIdx where workgroup = 2, so ThreadIdx is 0 or 1 Assuming this, and the fact that the memory is all initialized, there is no way CPU and GPU must converge on the same result. But this does not happen. CPU has correct result, GPU no. Unfortunately is difficult to reduce the kernel, the bug disappear trying to reduce the kernel. The kernel has some unsusefull instruction, but are usefull to reproduce the issue. example brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; size_dev_t is always = size_dev because site_table[threadIdx] < size_dev is always true, but deleting this instruction the issue disappear it also disappear if i try to reduce the number of indipendent component calculated, so this is the maxium i can do. The problem is limited to ATI GPU (OpenCL CPU , OpenCL GPU Nvidia work without problems), the problem has appeared with latest driver catalyst 10.12 (stable) + SDK 2.3, downgrading the SDK to 2.2 do not solvethe problem. I am on Ubuntu 9.04
On which GPU are you running?
Can you tell us with which driver and SDK it was working?
Please give us your system information(OS, CPU, GPU, SDK version and Driver version)
ibird,
I tried the code on my juniper( though i had to change doubles to float as juniper doesn't support DP). I observed small errors between CPU and GPU code, which can be accounted by floating point inaccuracies( CPU should be more accurate).
Can you tell what GPU were you using? I will try to test the code on a DP GPU soon.
Thanks
Sorry for my answer in late
i am running on 5870, ubuntu 9.04, SDK 2.3, catalyst 10.12
I will do a test on SDK 2.4 as soon as possible (one week probably)
Test done on ati SDK 2.4 on HD5870 catalyst 11.3 Ubuntu 10.04 LTS,same result discrepancy.
Bug is solved on AMD APP v2.6 and Catalyst 11.11