cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ibird
Adept I

Bad results CPU from GPU updating from v2.2 to 2.3

As promised here ( http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=145087&enterthread=y ) this is a simple test case to reproduce the issue

This code simply create some buffers fermion, gauge, table, eta do some calculation
into the kernel than print out the result with this format " (n buffer index) value"

Now running on GPU i get

 ( 0 ) 29.9808543717 ( 1 ) -30.5956300187 ( 2 ) 39.421714042 ( 3 ) -65.1260533737 ( 4 ) 36.0566958781 ( 5 ) -48.636864618 ( 6 )
26.4227317676 ( 7 ) -62.8406544328 ( 8 ) 0 ( 9 ) 0 ( 10 ) 0 ( 11 ) 0 ( 12 ) 0 ( 13 ) 0 ( 14 ) 0 ( 15 ) 0 ( 16 ) 30.6030340145 ( 17 )
 -51.5434104653 ( 18 ) 38.6194434131 ( 19 ) -51.0394421127 ( 20 ) 28.8687884796 ( 21 ) -46.1770377079 ( 22 ) 38.3811695387 ( 23 )
-59.4306483653 ( 24 ) 0 ( 25 ) 0 ( 26 ) 0 ( 27 ) 0 ( 28 ) 0 ( 29 ) 0 ( 30 ) 0 ( 31 ) 0 ( 32 ) 37.6226123498 ( 33 ) -33.3488500679 ( 34 )
 37.7464948716 ( 35 ) -9.0478474369 ( 36 ) 53.1298898854 ( 37 ) 12.1399464495 ( 38 ) 33.5315784346 ( 39 ) -8.02393247651 ( 40 ) 0 ( 41 )
 0 ( 42 ) 0 ( 43 ) 0 ( 44 ) 0 ( 45 ) 0 ( 46 ) 0 ( 47 ) 0

Now running on CPU i get

 ( 0 ) 29.9808543717 ( 1 ) -30.5956300187 ( 2 ) 39.421714042 ( 3 ) -65.1260533737 ( 4 ) 36.0566958781 ( 5 ) -48.636864618 ( 6 )
26.4227317676 ( 7 ) -62.8406544328 ( 8 ) 0 ( 9 ) 0 ( 10 ) 0 ( 11 ) 0 ( 12 ) 0 ( 13 ) 0 ( 14 ) 0 ( 15 ) 0 ( 16 ) 30.6030340145 ( 17 )
 -51.5434104653 ( 18 ) 38.6194434131 ( 19 ) -51.0394421127 ( 20 ) 29.490630454 ( 21 ) -45.8204361705 ( 22 ) 38.2942618562 ( 23 )
-59.8944927521 ( 24 ) 0 ( 25 ) 0 ( 26 ) 0 ( 27 ) 0 ( 28 ) 0 ( 29 ) 0 ( 30 ) 0 ( 31 ) 0 ( 32 ) 37.6226123498 ( 33 ) -33.3488500679 ( 34 )
 37.7464948716 ( 35 ) -9.0478474369 ( 36 ) 51.3163383616 ( 37 ) 11.6219953431 ( 38 ) 33.7140404702 ( 39 ) -8.27195007533 ( 40 ) 0 ( 41 )
 0 ( 42 ) 0 ( 43 ) 0 ( 44 ) 0 ( 45 ) 0 ( 46 ) 0 ( 47 ) 0


Are the equal from 0 to 19 are different from 20 to 47


The test case has no buffer overflown, you can check this


workgroup is 2, total work is 4          NUM_THREADS=2


fermion has dimension 8*6      = 48 double
gauge has dimension   8*12*4*2 = 768 float
table has dimansion   8*8      = 64 unsigned int        site table store number from 0 to 7  table=(i*123)%NSITE
eta has dimension     8*4      = 32 int

size_dev = 8 b_size_dev = 0 mass = 0.1

inside the kernel the maxium fermion buffer address is  2*size_dev_t + site_table[threadIdx] so 2*8 + 7 = 23 for double2 = 47 < 48            (good no overflown)
inside the kernel the maxium gauge   buffer address is  12*vol + idx + vol*(2+3*mu)         so 12*8+4+8*(2+3*3) = 188 for float4 = 752 < 768  (good no overflown)
inside the kernel the maxium table   buffer address is  idx+7*size_dev                      so 4+8*7 = 60 < 64                                (good no overflown)
inside the kernel the maxium eta     buffer address is  site_table[threadIdx]+3*size_dev_t  so 4+3*7 = 25 < 32                                (good no overflown)

Locals are

  __local double ferm_out[3][2][NUM_THREADS]; NUM_THREADS = 2
  __local int site_table[NUM_THREADS];

both are adressed with threadIdx where workgroup = 2, so ThreadIdx is 0 or 1

Assuming this, and the fact that the memory is all initialized, there is no way CPU and GPU must converge on the same result.
But this does not happen. CPU has correct result, GPU no.
Unfortunately is difficult to reduce the kernel, the bug disappear trying to reduce the kernel. The kernel has some unsusefull instruction,
but are usefull to reproduce the issue.

example

  brd = (site_table[threadIdx] < size_dev)?0:1;
  size_dev_t = (brd)?bsize_dev:size_dev;

size_dev_t is always = size_dev because site_table[threadIdx] < size_dev is always true, but deleting this instruction the issue disappear
it also disappear if i try to reduce the number of indipendent component calculated, so this is the maxium i can do. The problem
is limited to ATI GPU (OpenCL CPU , OpenCL GPU Nvidia work without problems), the problem has appeared with latest driver catalyst 10.12 (stable)
 + SDK 2.3, downgrading the SDK to 2.2 do not solvethe problem. I am on Ubuntu 9.04

/////////////////// bug.cpp #define RESULT_OK 1 #define RESULT_FAIL -1 #include <CL/cl.h> #include <iostream> #include <iomanip> #include <fstream> #include <string.h> #define ATI_PLATFORM 0 #define ATI_DEVICE_1 1 #define NSITE 8 #define FERMION_SIZE NSITE*6 #define GAUGE_SIZE NSITE*12*4*2 #define TABLE_SIZE NSITE*8 #define ETA_SIZE NSITE*4 double * fermion; float * gauge; unsigned int * table; int * eta; struct test { unsigned int x; unsigned int y; }; int main() { int id_dev = 0; cl_int status = 0; cl_uint nplat = 0; /* Get number of platforms */\ status = clGetPlatformIDs(0,NULL,&nplat);\ if(status != CL_SUCCESS)\ { std::cerr << "Error: clGetPlatformIDs error \n"; return -1; } /* Get all platforms ID */ cl_platform_id * platformst = new cl_platform_id [nplat]; status = clGetPlatformIDs(nplat,platformst,NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clGetPlatformIDs error \n"; return -1; } /* Get devices */ unsigned int num_devices; status = clGetDeviceIDs(platformst[ATI_PLATFORM],CL_DEVICE_TYPE_ALL,0,NULL,&num_devices); if (status != CL_SUCCESS) { std::cerr << "Error: clGetDeviceIDs error \n"; return -1; } /* Get all devices ID in a platform */ cl_device_id * devicest = new cl_device_id [num_devices]; status = clGetDeviceIDs(platformst[ATI_PLATFORM],CL_DEVICE_TYPE_ALL,num_devices,devicest,NULL); if (status != CL_SUCCESS) { std::cerr << "Error: clGetDeviceIDs error \n"; return -1; } /* create a context */ cl_context_properties prop[3]; prop[0] = CL_CONTEXT_PLATFORM; prop[1] = (cl_context_properties)platformst[ATI_PLATFORM]; prop[2] = 0; cl_device_id * devid = new cl_device_id [num_devices]; for (int j = 0 ; j < (int)num_devices ; j++) { devid = devicest; } cl_context context = clCreateContext(prop, num_devices, devid, NULL, NULL ,&status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateContext error \n"; return -1; } /* Load and Build Kernel */ cl_program prgdslash; std::ifstream loadf; loadf.open("DslashDaggerDDKernelEO.cl",std::ios::binary); if (!loadf.is_open()) return RESULT_FAIL; /* get file size */ loadf.seekg (0, std::ios::end); int sz = loadf.tellg(); loadf.seekg (0, std::ios::beg); unsigned int sourceSize1 = sz; char * source = new char [sz]; /* Load the dump */ loadf.read((char *)source,sz); loadf.close(); prgdslash = clCreateProgramWithSource(context, 1, (const char **)&source, &sourceSize1, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateProgramWithSource error \n"; return -1; } status = clBuildProgram(prgdslash, 1, &devicest[ATI_DEVICE_1], "-D NUM_THREADS=2 -D FP64_AMD", NULL, NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clBuildProgram error \n"; return -1; } cl_kernel kernelsDslash_t = clCreateKernel(prgdslash, "DslashDaggerDDKernelEO", &status); /* for one device in a platform we create 2 commandQueue */ cl_command_queue * commandQueue = new cl_command_queue [2]; commandQueue[0] = clCreateCommandQueue(context, devicest[ATI_DEVICE_1], 0, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateCommandQueue error \n"; return -1; } /* Load buffer */ fermion = new double [FERMION_SIZE]; for (unsigned int i = 0 ; i < FERMION_SIZE ; i++) { fermion = 1.0+(0.4343234*(float)i - (int)(0.4343234*i)); } gauge = new float [GAUGE_SIZE]; for (unsigned int i = 0 ; i < GAUGE_SIZE ; i++) { gauge = 1.0+(0.8463654*(float)i - (int)(0.8463654*i)); } table = new unsigned int [TABLE_SIZE]; for (unsigned int i = 0 ; i < TABLE_SIZE ; i++) { table = (i*123)%NSITE; } eta = new int [ETA_SIZE]; for (unsigned int i = 0 ; i < ETA_SIZE ; i++) { eta = 2*(i%2)-1; } size_t global[3] = {0,0,0}; size_t local[3] = {0,0,0}; cl_mem bufout = clCreateBuffer(context, CL_MEM_READ_WRITE, FERMION_SIZE*sizeof(cl_double), NULL, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } cl_mem buffermion = clCreateBuffer(context, CL_MEM_READ_WRITE, FERMION_SIZE*sizeof(cl_double), NULL, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } cl_mem bufgauge = clCreateBuffer(context, CL_MEM_READ_WRITE, GAUGE_SIZE*sizeof(cl_float), NULL, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } cl_mem buftable = clCreateBuffer(context, CL_MEM_READ_WRITE, TABLE_SIZE*sizeof(cl_int), NULL, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } cl_mem bufeta = clCreateBuffer(context, CL_MEM_READ_WRITE, ETA_SIZE*sizeof(cl_int), NULL, &status); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } status = clEnqueueWriteBuffer(commandQueue[0],buffermion,CL_TRUE,0,FERMION_SIZE*sizeof(cl_double),fermion,0,NULL,NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } status = clEnqueueWriteBuffer(commandQueue[0],bufgauge,CL_TRUE,0,GAUGE_SIZE*sizeof(cl_float),gauge,0,NULL,NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } status = clEnqueueWriteBuffer(commandQueue[0],buftable,CL_TRUE,0,TABLE_SIZE*sizeof(cl_uint),table,0,NULL,NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } status = clEnqueueWriteBuffer(commandQueue[0],bufeta,CL_TRUE,0,ETA_SIZE*sizeof(cl_uint),eta,0,NULL,NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clCreateBuffer error \n"; return -1; } global[0] = NSITE/2; local[0] = 2; unsigned int size_dev = NSITE; unsigned int size_dev_h = NSITE/2; unsigned int b_size_dev = 0; double mass = 0.1; status = clSetKernelArg(kernelsDslash_t, 0, sizeof(cl_mem), (void *)&bufout); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 1, sizeof(cl_mem), (void *)&bufgauge); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 2, sizeof(cl_mem), (void *)&buffermion); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 3, sizeof(cl_mem), (void *)&buftable); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 4, sizeof(cl_mem), (void *)&bufeta); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 5, sizeof(cl_uint), (void *)&size_dev); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 6, sizeof(cl_uint), (void *)&size_dev_h); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 7, sizeof(cl_uint), (void *)&b_size_dev); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clSetKernelArg(kernelsDslash_t, 8, sizeof(cl_double), (void *)&mass); if(status != CL_SUCCESS) { std::cerr << "Error: clSetKernelArg error \n"; return -1; } status = clEnqueueNDRangeKernel(commandQueue[0],kernelsDslash_t,1,NULL,global,local,0,NULL,NULL); if(status != CL_SUCCESS) { std::cerr << "Error: clEnqueueNDRangeKernel error \n"; return -1; } clFinish(commandQueue[0]); double bufferout[FERMION_SIZE]; clEnqueueReadBuffer(commandQueue[0],bufout,CL_TRUE,0,FERMION_SIZE*sizeof(cl_double),bufferout,0,NULL,NULL); for (unsigned int i = 0 ; i < FERMION_SIZE; i++) { std::cerr << " ( " << i << " ) " << std::setprecision(12) << bufferout; } clReleaseCommandQueue(commandQueue[0]); clReleaseKernel(kernelsDslash_t); clReleaseProgram(prgdslash); delete [] devid; delete [] devicest; delete [] platformst; return 0; } //////////// DslashDaggerDDKernelEO.cl #ifdef FP64_AMD #pragma OPENCL EXTENSION cl_amd_fp64 : enable #endif #ifdef FP64 #pragma OPENCL EXTENSION cl_khr_fp64 : enable #endif #define C1RED ( (link0z*link2z-link0w*link2w) - (link1x*link2x-link1y*link2y) ) #define C1IMD (-(link0z*link2w+link0w*link2z) + (link1x*link2y+link1y*link2x) ) #define C2RED ( (link1x*link1z-link1y*link1w) - (link0x*link2z-link0y*link2w) ) #define C2IMD (-(link1x*link1w+link1y*link1z) + (link0x*link2w+link0y*link2z) ) #define C3RED ( (link0x*link2x-link0y*link2y) - (link0z*link1z-link0w*link1w) ) #define C3IMD (-(link0x*link2y+link0y*link2x) + (link0z*link1w+link0w*link1z) ) #define LoadLinkDDRegs(gauge, vol, idx, mu) \ auxlink = gauge_texRef[ idx + vol*(0+3*mu)];\ link0x=(double) auxlink.x;\ link0y=(double) auxlink.y;\ link0z=(double) auxlink.z;\ link0w=(double) auxlink.w;\ auxlink = gauge_texRef[ idx + vol*(1+3*mu)];\ link1x=(double) auxlink.x;\ link1y=(double) auxlink.y;\ link1z=(double) auxlink.z;\ link1w=(double) auxlink.w;\ auxlink = gauge_texRef[ idx + vol*(2+3*mu)];\ link2x=(double) auxlink.x;\ link2y=(double) auxlink.y;\ link2z=(double) auxlink.z;\ link2w=(double) auxlink.w;\ /* 2nd float*/\ auxlink = gauge_texRef[ 12*vol + idx + vol*(0+3*mu)];\ link0x+=(double) auxlink.x;\ link0y+=(double) auxlink.y;\ link0z+=(double) auxlink.z;\ link0w+=(double) auxlink.w;\ auxlink = gauge_texRef[ 12*vol + idx + vol*(1+3*mu)];\ link1x+=(double) auxlink.x;\ link1y+=(double) auxlink.y;\ link1z+=(double) auxlink.z;\ link1w+=(double) auxlink.w;\ auxlink = gauge_texRef[ 12*vol + idx + vol*(2+3*mu)];\ link2x+=(double) auxlink.x;\ link2y+=(double) auxlink.y;\ link2z+=(double) auxlink.z;\ link2w+=(double) auxlink.w; __kernel void DslashDaggerDDKernelEO(__global double2 *out, __global float4 * gauge_texRef, __global double2 *fermion_texRef, __global int *tables, __global int *phases, const unsigned int size_dev, const unsigned int size_dev_h, const unsigned int bsize_dev, const double mass_d_dev) { unsigned int brd; int threadIdx = {get_local_id(0)}; int blockDim = {get_local_size(0)}; int blockIdx = {get_group_id(0)}; int idx = blockIdx*blockDim + threadIdx; // idx< sizeh, EVEN!! double stag_phase = 1.0; //Store result in sharedMem __local double ferm_out[3][2][NUM_THREADS]; unsigned int size_dev_t; //New tables indexing (index fastest) __local int site_table[NUM_THREADS]; //Load link matrix U_mu(ix) in registers double link0x, link0y, link0z, link0w, link1x, link1y, link1z, link1w, link2x, link2y, link2z, link2w; float4 auxlink; double2 ferm_in_0, ferm_in_1, ferm_in_2; // DIRECTION 0 site_table[threadIdx] = tables[idx+4*size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef,size_dev,idx,0); ferm_out[0][0][threadIdx] = link0x*ferm_in_0.x-link0y*ferm_in_0.y+ link0z*ferm_in_1.x-link0w*ferm_in_1.y+ link1x*ferm_in_2.x-link1y*ferm_in_2.y; ferm_out[0][1][threadIdx] = link0x*ferm_in_0.y+link0y*ferm_in_0.x+ link0z*ferm_in_1.y+link0w*ferm_in_1.x+ link1x*ferm_in_2.y+link1y*ferm_in_2.x; ferm_out[1][0][threadIdx] = link1z*ferm_in_0.x-link1w*ferm_in_0.y+ link2x*ferm_in_1.x-link2y*ferm_in_1.y+ link2z*ferm_in_2.x-link2w*ferm_in_2.y; ferm_out[1][1][threadIdx] = link1z*ferm_in_0.y+link1w*ferm_in_0.x+ link2x*ferm_in_1.y+link2y*ferm_in_1.x+ link2z*ferm_in_2.y+link2w*ferm_in_2.x; ferm_out[2][0][threadIdx] = C1RED*ferm_in_0.x-C1IMD*ferm_in_0.y+ C2RED*ferm_in_1.x-C2IMD*ferm_in_1.y+ C3RED*ferm_in_2.x-C3IMD*ferm_in_2.y; ferm_out[2][1][threadIdx] = C1RED*ferm_in_0.y+C1IMD*ferm_in_0.x+ C2RED*ferm_in_1.y+C2IMD*ferm_in_1.x+ C3RED*ferm_in_2.y+C3IMD*ferm_in_2.x; //DIRECTION 1 site_table[threadIdx] = tables[idx+5*size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; stag_phase = (double) phases[idx+size_dev]; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef,size_dev,idx,1); ferm_out[0][0][threadIdx] += link0x*ferm_in_0.x-link0y*ferm_in_0.y+ link0z*ferm_in_1.x-link0w*ferm_in_1.y+ link1x*ferm_in_2.x-link1y*ferm_in_2.y; ferm_out[0][1][threadIdx] += link0x*ferm_in_0.y+link0y*ferm_in_0.x+ link0z*ferm_in_1.y+link0w*ferm_in_1.x+ link1x*ferm_in_2.y+link1y*ferm_in_2.x; ferm_out[1][0][threadIdx] += link1z*ferm_in_0.x-link1w*ferm_in_0.y+ link2x*ferm_in_1.x-link2y*ferm_in_1.y+ link2z*ferm_in_2.x-link2w*ferm_in_2.y; ferm_out[1][1][threadIdx] += link1z*ferm_in_0.y+link1w*ferm_in_0.x+ link2x*ferm_in_1.y+link2y*ferm_in_1.x+ link2z*ferm_in_2.y+link2w*ferm_in_2.x; ferm_out[2][0][threadIdx] += stag_phase*(C1RED*ferm_in_0.x-C1IMD*ferm_in_0.y+ C2RED*ferm_in_1.x-C2IMD*ferm_in_1.y+ C3RED*ferm_in_2.x-C3IMD*ferm_in_2.y); ferm_out[2][1][threadIdx] += stag_phase*(C1RED*ferm_in_0.y+C1IMD*ferm_in_0.x+ C2RED*ferm_in_1.y+C2IMD*ferm_in_1.x+ C3RED*ferm_in_2.y+C3IMD*ferm_in_2.x); //DIRECTION 2 site_table[threadIdx] = tables[idx+6*size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; stag_phase = (double) phases[idx+2*size_dev]; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef, size_dev, idx, 2); ferm_out[0][0][threadIdx] += link0x*ferm_in_0.x-link0y*ferm_in_0.y+ link0z*ferm_in_1.x-link0w*ferm_in_1.y+ link1x*ferm_in_2.x-link1y*ferm_in_2.y; ferm_out[0][1][threadIdx] += link0x*ferm_in_0.y+link0y*ferm_in_0.x+ link0z*ferm_in_1.y+link0w*ferm_in_1.x+ link1x*ferm_in_2.y+link1y*ferm_in_2.x; ferm_out[1][0][threadIdx] += link1z*ferm_in_0.x-link1w*ferm_in_0.y+ link2x*ferm_in_1.x-link2y*ferm_in_1.y+ link2z*ferm_in_2.x-link2w*ferm_in_2.y; ferm_out[1][1][threadIdx] += link1z*ferm_in_0.y+link1w*ferm_in_0.x+ link2x*ferm_in_1.y+link2y*ferm_in_1.x+ link2z*ferm_in_2.y+link2w*ferm_in_2.x; ferm_out[2][0][threadIdx] += stag_phase*(C1RED*ferm_in_0.x-C1IMD*ferm_in_0.y+ C2RED*ferm_in_1.x-C2IMD*ferm_in_1.y+ C3RED*ferm_in_2.x-C3IMD*ferm_in_2.y); ferm_out[2][1][threadIdx] += stag_phase*(C1RED*ferm_in_0.y+C1IMD*ferm_in_0.x+ C2RED*ferm_in_1.y+C2IMD*ferm_in_1.x+ C3RED*ferm_in_2.y+C3IMD*ferm_in_2.x); //DIRECTION 3 site_table[threadIdx] = tables[idx+7*size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; stag_phase = (double) phases[idx+3*size_dev]; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef, size_dev, idx, 3); ferm_out[0][0][threadIdx] += link0x*ferm_in_0.x-link0y*ferm_in_0.y+ link0z*ferm_in_1.x-link0w*ferm_in_1.y+ link1x*ferm_in_2.x-link1y*ferm_in_2.y; ferm_out[0][1][threadIdx] += link0x*ferm_in_0.y+link0y*ferm_in_0.x+ link0z*ferm_in_1.y+link0w*ferm_in_1.x+ link1x*ferm_in_2.y+link1y*ferm_in_2.x; ferm_out[1][0][threadIdx] += link1z*ferm_in_0.x-link1w*ferm_in_0.y+ link2x*ferm_in_1.x-link2y*ferm_in_1.y+ link2z*ferm_in_2.x-link2w*ferm_in_2.y; ferm_out[1][1][threadIdx] += link1z*ferm_in_0.y+link1w*ferm_in_0.x+ link2x*ferm_in_1.y+link2y*ferm_in_1.x+ link2z*ferm_in_2.y+link2w*ferm_in_2.x; ferm_out[2][0][threadIdx] += stag_phase*(C1RED*ferm_in_0.x-C1IMD*ferm_in_0.y+ C2RED*ferm_in_1.x-C2IMD*ferm_in_1.y+ C3RED*ferm_in_2.x-C3IMD*ferm_in_2.y); ferm_out[2][1][threadIdx] += stag_phase*(C1RED*ferm_in_0.y+C1IMD*ferm_in_0.x+ C2RED*ferm_in_1.y+C2IMD*ferm_in_1.x+ C3RED*ferm_in_2.y+C3IMD*ferm_in_2.x); //---------------------------------------------------end of first block //DIRECTION 0 site_table[threadIdx] = tables[idx]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef, size_dev_t, site_table[threadIdx], 0); ferm_out[0][0][threadIdx] -= link0x*ferm_in_0.x+link0y*ferm_in_0.y + link1z*ferm_in_1.x+link1w*ferm_in_1.y + C1RED*ferm_in_2.x +C1IMD*ferm_in_2.y; ferm_out[0][1][threadIdx] -= link0x*ferm_in_0.y-link0y*ferm_in_0.x + link1z*ferm_in_1.y-link1w*ferm_in_1.x + C1RED*ferm_in_2.y -C1IMD*ferm_in_2.x; ferm_out[1][0][threadIdx] -= link0z*ferm_in_0.x+link0w*ferm_in_0.y + link2x*ferm_in_1.x+link2y*ferm_in_1.y + C2RED*ferm_in_2.x +C2IMD*ferm_in_2.y; ferm_out[1][1][threadIdx] -= link0z*ferm_in_0.y-link0w*ferm_in_0.x + link2x*ferm_in_1.y-link2y*ferm_in_1.x + C2RED*ferm_in_2.y -C2IMD*ferm_in_2.x; ferm_out[2][0][threadIdx] -= link1x*ferm_in_0.x+link1y*ferm_in_0.y + link2z*ferm_in_1.x+link2w*ferm_in_1.y + C3RED*ferm_in_2.x +C3IMD*ferm_in_2.y; ferm_out[2][1][threadIdx] -= link1x*ferm_in_0.y-link1y*ferm_in_0.x + link2z*ferm_in_1.y-link2w*ferm_in_1.x + C3RED*ferm_in_2.y -C3IMD*ferm_in_2.x; //DIRECTION 1 site_table[threadIdx] = tables[idx+size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; stag_phase = (double) phases[site_table[threadIdx]+size_dev_t]; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef, size_dev_t, site_table[threadIdx] , 1); ferm_out[0][0][threadIdx] -= link0x*ferm_in_0.x+link0y*ferm_in_0.y + link1z*ferm_in_1.x+link1w*ferm_in_1.y + stag_phase*(C1RED*ferm_in_2.x+C1IMD*ferm_in_2.y); ferm_out[0][1][threadIdx] -= link0x*ferm_in_0.y-link0y*ferm_in_0.x + link1z*ferm_in_1.y-link1w*ferm_in_1.x + stag_phase*(C1RED*ferm_in_2.y-C1IMD*ferm_in_2.x); ferm_out[1][0][threadIdx] -= link0z*ferm_in_0.x+link0w*ferm_in_0.y + link2x*ferm_in_1.x+link2y*ferm_in_1.y + stag_phase*(C2RED*ferm_in_2.x+C2IMD*ferm_in_2.y); ferm_out[1][1][threadIdx] -= link0z*ferm_in_0.y-link0w*ferm_in_0.x + link2x*ferm_in_1.y-link2y*ferm_in_1.x + stag_phase*(C2RED*ferm_in_2.y-C2IMD*ferm_in_2.x); ferm_out[2][0][threadIdx] -= link1x*ferm_in_0.x+link1y*ferm_in_0.y + link2z*ferm_in_1.x+link2w*ferm_in_1.y + stag_phase*(C3RED*ferm_in_2.x+C3IMD*ferm_in_2.y); ferm_out[2][1][threadIdx] -= link1x*ferm_in_0.y-link1y*ferm_in_0.x + link2z*ferm_in_1.y-link2w*ferm_in_1.x + stag_phase*(C3RED*ferm_in_2.y- C3IMD*ferm_in_2.x); //DIRECTION 2 site_table[threadIdx] = tables[idx+2*size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; stag_phase = (double) phases[site_table[threadIdx]+2*size_dev_t]; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef, size_dev_t, site_table[threadIdx], 2); ferm_out[0][0][threadIdx] -= link0x*ferm_in_0.x+link0y*ferm_in_0.y + link1z*ferm_in_1.x+link1w*ferm_in_1.y + stag_phase*(C1RED*ferm_in_2.x+ C1IMD*ferm_in_2.y); ferm_out[0][1][threadIdx] -= link0x*ferm_in_0.y-link0y*ferm_in_0.x + link1z*ferm_in_1.y-link1w*ferm_in_1.x + stag_phase*(C1RED*ferm_in_2.y- C1IMD*ferm_in_2.x); ferm_out[1][0][threadIdx] -= link0z*ferm_in_0.x+link0w*ferm_in_0.y + link2x*ferm_in_1.x+link2y*ferm_in_1.y + stag_phase*(C2RED*ferm_in_2.x+ C2IMD*ferm_in_2.y); ferm_out[1][1][threadIdx] -= link0z*ferm_in_0.y-link0w*ferm_in_0.x + link2x*ferm_in_1.y-link2y*ferm_in_1.x + stag_phase*(C2RED*ferm_in_2.y- C2IMD*ferm_in_2.x); ferm_out[2][0][threadIdx] -= link1x*ferm_in_0.x+link1y*ferm_in_0.y + link2z*ferm_in_1.x+link2w*ferm_in_1.y + stag_phase*(C3RED*ferm_in_2.x+ C3IMD*ferm_in_2.y); ferm_out[2][1][threadIdx] -= link1x*ferm_in_0.y-link1y*ferm_in_0.x + link2z*ferm_in_1.y-link2w*ferm_in_1.x + stag_phase*(C3RED*ferm_in_2.y- C3IMD*ferm_in_2.x); //DIRECTION 3 site_table[threadIdx] = tables[idx+3*size_dev]; brd = (site_table[threadIdx] < size_dev)?0:1; size_dev_t = (brd)?bsize_dev:size_dev; stag_phase = (double) phases[site_table[threadIdx]+3*size_dev_t]; ferm_in_0 = fermion_texRef[ site_table[threadIdx]]; ferm_in_1 = fermion_texRef[ size_dev_t + site_table[threadIdx]]; ferm_in_2 = fermion_texRef[ 2*size_dev_t + site_table[threadIdx]]; LoadLinkDDRegs(gauge_texRef, size_dev_t, site_table[threadIdx], 3); ferm_out[0][0][threadIdx] -= link0x*ferm_in_0.x+link0y*ferm_in_0.y + link1z*ferm_in_1.x+link1w*ferm_in_1.y + stag_phase*(C1RED*ferm_in_2.x+ C1IMD*ferm_in_2.y); ferm_out[0][1][threadIdx] -= link0x*ferm_in_0.y-link0y*ferm_in_0.x + link1z*ferm_in_1.y-link1w*ferm_in_1.x + stag_phase*(C1RED*ferm_in_2.y- C1IMD*ferm_in_2.x); ferm_out[1][0][threadIdx] -= link0z*ferm_in_0.x+link0w*ferm_in_0.y + link2x*ferm_in_1.x+link2y*ferm_in_1.y + stag_phase*(C2RED*ferm_in_2.x+ C2IMD*ferm_in_2.y); ferm_out[1][1][threadIdx] -= link0z*ferm_in_0.y-link0w*ferm_in_0.x + link2x*ferm_in_1.y-link2y*ferm_in_1.x + stag_phase*(C2RED*ferm_in_2.y- C2IMD*ferm_in_2.x); ferm_out[2][0][threadIdx] -= link1x*ferm_in_0.x+link1y*ferm_in_0.y + link2z*ferm_in_1.x+link2w*ferm_in_1.y + stag_phase*(C3RED*ferm_in_2.x+ C3IMD*ferm_in_2.y); ferm_out[2][1][threadIdx] -= link1x*ferm_in_0.y-link1y*ferm_in_0.x + link2z*ferm_in_1.y-link2w*ferm_in_1.x + stag_phase*(C3RED*ferm_in_2.y- C3IMD*ferm_in_2.x); //-------------------------------------------------end of second block // even ferm_in_0 = fermion_texRef[ idx]; ferm_in_1 = fermion_texRef[ size_dev + idx]; ferm_in_2 = fermion_texRef[ 2*size_dev + idx]; out[idx ].x = mass_d_dev*ferm_in_0.x - ferm_out[0][0][threadIdx]*(double)0.5; out[idx ].y = mass_d_dev*ferm_in_0.y - ferm_out[0][1][threadIdx]*(double)0.5; out[idx + size_dev ].x = mass_d_dev*ferm_in_1.x - ferm_out[1][0][threadIdx]*(double)0.5; out[idx + size_dev ].y = mass_d_dev*ferm_in_1.y - ferm_out[1][1][threadIdx]*(double)0.5; out[idx + 2*size_dev ].x = mass_d_dev*ferm_in_2.x - ferm_out[2][0][threadIdx]*(double)0.5; out[idx + 2*size_dev ].y = mass_d_dev*ferm_in_2.y - ferm_out[2][1][threadIdx]*(double)0.5; // odd out[idx + size_dev_h ].x = (double)0.0; out[idx + size_dev_h ].y = (double)0.0; out[idx + size_dev + size_dev_h ].x = (double)0.0; out[idx + size_dev + size_dev_h ].y = (double)0.0; out[idx + 2*size_dev + size_dev_h ].x = (double)0.0; out[idx + 2*size_dev + size_dev_h ].y = (double)0.0; //-------------------------------------------------end of DslashDagger }

0 Likes
7 Replies
genaganna
Journeyman III

Originally posted by: ibird As promised here ( http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=145087&enterthread=y ) this is a simple test case to reproduce the issue This code simply create some buffers fermion, gauge, table, eta do some calculation into the kernel than print out the result with this format " (n buffer index) value" Now running on GPU i get  ( 0 ) 29.9808543717 ( 1 ) -30.5956300187 ( 2 ) 39.421714042 ( 3 ) -65.1260533737 ( 4 ) 36.0566958781 ( 5 ) -48.636864618 ( 6 ) 26.4227317676 ( 7 ) -62.8406544328 ( 8 ) 0 ( 9 ) 0 ( 10 ) 0 ( 11 ) 0 ( 12 ) 0 ( 13 ) 0 ( 14 ) 0 ( 15 ) 0 ( 16 ) 30.6030340145 ( 17 )  -51.5434104653 ( 18 ) 38.6194434131 ( 19 ) -51.0394421127 ( 20 ) 28.8687884796 ( 21 ) -46.1770377079 ( 22 ) 38.3811695387 ( 23 ) -59.4306483653 ( 24 ) 0 ( 25 ) 0 ( 26 ) 0 ( 27 ) 0 ( 28 ) 0 ( 29 ) 0 ( 30 ) 0 ( 31 ) 0 ( 32 ) 37.6226123498 ( 33 ) -33.3488500679 ( 34 )  37.7464948716 ( 35 ) -9.0478474369 ( 36 ) 53.1298898854 ( 37 ) 12.1399464495 ( 38 ) 33.5315784346 ( 39 ) -8.02393247651 ( 40 ) 0 ( 41 )  0 ( 42 ) 0 ( 43 ) 0 ( 44 ) 0 ( 45 ) 0 ( 46 ) 0 ( 47 ) 0 Now running on CPU i get  ( 0 ) 29.9808543717 ( 1 ) -30.5956300187 ( 2 ) 39.421714042 ( 3 ) -65.1260533737 ( 4 ) 36.0566958781 ( 5 ) -48.636864618 ( 6 ) 26.4227317676 ( 7 ) -62.8406544328 ( 8 ) 0 ( 9 ) 0 ( 10 ) 0 ( 11 ) 0 ( 12 ) 0 ( 13 ) 0 ( 14 ) 0 ( 15 ) 0 ( 16 ) 30.6030340145 ( 17 )  -51.5434104653 ( 18 ) 38.6194434131 ( 19 ) -51.0394421127 ( 20 ) 29.490630454 ( 21 ) -45.8204361705 ( 22 ) 38.2942618562 ( 23 ) -59.8944927521 ( 24 ) 0 ( 25 ) 0 ( 26 ) 0 ( 27 ) 0 ( 28 ) 0 ( 29 ) 0 ( 30 ) 0 ( 31 ) 0 ( 32 ) 37.6226123498 ( 33 ) -33.3488500679 ( 34 )  37.7464948716 ( 35 ) -9.0478474369 ( 36 ) 51.3163383616 ( 37 ) 11.6219953431 ( 38 ) 33.7140404702 ( 39 ) -8.27195007533 ( 40 ) 0 ( 41 )  0 ( 42 ) 0 ( 43 ) 0 ( 44 ) 0 ( 45 ) 0 ( 46 ) 0 ( 47 ) 0 Are the equal from 0 to 19 are different from 20 to 47 The test case has no buffer overflown, you can check this workgroup is 2, total work is 4          NUM_THREADS=2 fermion has dimension 8*6      = 48 double gauge has dimension   8*12*4*2 = 768 float table has dimansion   8*8      = 64 unsigned int        site table store number from 0 to 7  table=(i*123)%NSITE eta has dimension     8*4      = 32 int size_dev = 8 b_size_dev = 0 mass = 0.1 inside the kernel the maxium fermion buffer address is  2*size_dev_t + site_table[threadIdx] so 2*8 + 7 = 23 for double2 = 47 < 48            (good no overflown) inside the kernel the maxium gauge   buffer address is  12*vol + idx + vol*(2+3*mu)         so 12*8+4+8*(2+3*3) = 188 for float4 = 752 < 768  (good no overflown) inside the kernel the maxium table   buffer address is  idx+7*size_dev                      so 4+8*7 = 60 < 64                                (good no overflown) inside the kernel the maxium eta     buffer address is  site_table[threadIdx]+3*size_dev_t  so 4+3*7 = 25 < 32                                (good no overflown) Locals are   __local double ferm_out[3][2][NUM_THREADS]; NUM_THREADS = 2   __local int site_table[NUM_THREADS]; both are adressed with threadIdx where workgroup = 2, so ThreadIdx is 0 or 1 Assuming this, and the fact that the memory is all initialized, there is no way CPU and GPU must converge on the same result. But this does not happen. CPU has correct result, GPU no. Unfortunately is difficult to reduce the kernel, the bug disappear trying to reduce the kernel. The kernel has some unsusefull instruction, but are usefull to reproduce the issue. example   brd = (site_table[threadIdx] < size_dev)?0:1;   size_dev_t = (brd)?bsize_dev:size_dev; size_dev_t is always = size_dev because site_table[threadIdx] < size_dev is always true, but deleting this instruction the issue disappear it also disappear if i try to reduce the number of indipendent component calculated, so this is the maxium i can do. The problem is limited to ATI GPU (OpenCL CPU , OpenCL GPU Nvidia work without problems), the problem has appeared with latest driver catalyst 10.12 (stable)  + SDK 2.3, downgrading the SDK to 2.2 do not solvethe problem. I am on Ubuntu 9.04

 

On which GPU are you running?

0 Likes

Can you tell us with which driver and SDK it was working?

Please give us your system information(OS, CPU, GPU, SDK version and Driver version)

0 Likes

ibird,

I tried the code on my juniper( though i had to change doubles to float as juniper doesn't support DP). I observed small errors between CPU and GPU code, which can be accounted by floating point inaccuracies( CPU should be more accurate).

Can you tell what GPU were you using? I will try to test the code on a DP GPU soon.

Thanks

0 Likes

Sorry for my answer in late

 

i am running on 5870, ubuntu 9.04, SDK 2.3, catalyst 10.12

0 Likes

I will do a test on SDK 2.4 as soon as possible (one week probably)

0 Likes

 

 

Test done on ati SDK 2.4 on HD5870 catalyst 11.3 Ubuntu 10.04 LTS,same result discrepancy.

0 Likes

Bug is solved on AMD APP v2.6 and Catalyst 11.11

 

 

0 Likes