Hi
I'm fairly new to OpenCL and I'm having a great deal of trouble trying to copy data (6 times int8 vectors) from the device to the host. I've set all the important bits that I can identify with, as described below. This is running with the following:
Windows 7 x64
Visual Studio 2008
AMD Catalyst 12.8
SDK 2.7.923.1
Tahiti 7970.
Any help would be very much appreciated.
Host:
cl_mem inDevice, outDevice;
cl_int err;
size_t global;
global = 1;
vecPart = 0;
inDevice = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(cl_int), NULL, NULL);
outDevice = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_int8) * global * 6, NULL, NULL);
.
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &inDevice);
err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &outDevice);
.
err = clEnqueueWriteBuffer(commands, inDevice, CL_TRUE, 0, sizeof(int), &vecPart, 0, NULL, NULL);
.
err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, NULL, 0, NULL, NULL);
.
Device:
__kernel void myproc(__global int *position, __global int *outBuffer)
{
int8 tallyL = 0;
int8 tallyLO = 0;
int8 tallyLOS = 0;
int8 tallyLM = 0;
int8 tallyLMO = 0;
int8 tallyLMOS = 0;
.
.
.
id = get_global_id(0);
id *= 6;
vstore8(tallyL, id, outBuffer); // Device stops responding but recovers
vstore8(tallyLO, id + 1, outBuffer);
vstore8(tallyLOS, id + 2, outBuffer);
vstore8(tallyLM, id + 3, outBuffer);
vstore8(tallyLMO, id + 4, outBuffer);
vstore8(tallyLMOS, id + 5, outBuffer);
}
I then commented the vstore commands and the kernel completed without any problems, albeit I didn't get any results.
.
.
.
id = get_global_id(0);
id *= 6;
// vstore8(tallyL, id, outBuffer); // Uncomment and device stops responding but recovers
// vstore8(tallyLO, id + 1, outBuffer);
// vstore8(tallyLOS, id + 2, outBuffer);
// vstore8(tallyLM, id + 3, outBuffer);
// vstore8(tallyLMO, id + 4, outBuffer);
// vstore8(tallyLMOS, id + 5, outBuffer);
}
I then changed the vstore commands to array writes as described below.
.
.
.
id = get_global_id(0);
id *= 6;
id *= sizeof(int8); // Everything below commented and completion is in 6 milli-seconds
// outBuffer[id] = tallyL.s0; // Uncommented increases to 7.5 seconds
// outBuffer[id + 1] = tallyL.s1; // Uncommented with above line increases to 8.5 seconds
// outBuffer[id + 2] = tallyL.s2; // Uncommented with above 2 lines increases to 11.5 seconds
// outBuffer[id + 3] = tallyL.s3; // Uncommented with above 3 lines - device stops responding but recovers
Solved! Go to Solution.
Ok, I've managed to sort this out once and for all.
In HKLM\System\CurrentControlSet\Control\GraphicsDrivers I added the following:
TdrDelay 80h
TdrDdiDelay 80h
TdrLevel 0
Yes, the values are a little extreme but so is the workload that I'm asking of the kernel on each iteration, which is now at 16 seconds - and stable.
Seems like it needed more than the TdrDelay (as well as putting into CurrentControlSet). It would probably be very helpful to put this sort of control into CCC, so that the user can make these changes on the fly without having to restart their PC,
when you comment all writes compiler will optimize out whole kernel so it is empty. thats why it takes 6ms. when device recover you cross watchdog timer which assume that GPU is locked up so it reset it. try lower global size so it won't take that long.
Hi nou, thank you for your response.
My global size for this exercise is currently 1. I plan to increase this when I get the problem sorted out, but for now I just have a single work item within the queue.
For the timing, I'm using:
tic = time();
err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, NULL, 0, NULL, NULL);
toc = time();
exTime = toc - tic; // etc
When I'm writing to the outBuffer, it is at the very end of the kernel and outside any while loops. While I can accept that for no writes it may optimise out the kernel and that for one write (with the kernel) it may take 7.5 seconds. However, why it would increase by another second for an additional write and then an additional 3 seconds for the subsequent write is beyond me, which would indicate that I have a problem somewhere, as well as the gpu not responding and then recovering (but with a broken execution).
I was wondering (and hoping) that I've missed something out or made a mistake somewhere (being new to this) that is causing the problem.
Ok, I've managed to ascertain that I don't have a problem with my memory allocations or my vstores, per se. I appear to have narrowed the problem down to a popcount statement in the kernel...phew, where I'm trying to determine the number of bits that have changed between the original vector and the manipulated vector.
I had
diff = popcount(stage.s04261537 ^ in);
but replacing it to
diff = (int8) 1;
and I can do the transfer ok. However, if I try tidying it up to
diff = popcount(stage ^ in);
then the problem reoccurs. Both stage and in variables are int8 vectors. I would really like to use the popcount function as it's exactly what I need.
Unfortunately, there isn't much to go on in the OpenCL1.2 pdf for popcount apart from two references:
gentype popcount (gentype x) Returns the number of non-zero bits in x.
and
New built-in functions
Any assistance would be appreciated.
Hi
I've managed to debug everything and ensure that everything is working correctly. However, I'm still experiencing a problem where the I appear to be encountering a "Display driver stopped responding and has recovered" message.
My kernel takes about 6.5 seconds to run each time, which appears to be conflicting with the default TdrDelay of 8 (seconds I presume), especially when a Windows event occurs (restart hdds for example). I've applied the fix and have increased the TdrDelay in HKEY_LOCAL_MACHINE\SYSTEM\ControlSet002\Control\GraphicsDrivers to 20h so that should hopefully fix the problem.
Ok...my simulation is running but it is stopping approximately every 7 hours for some reason. I know this because the GPU activity on the Catalyst Command Center goes to zero and my C program stalls. While I'm thankful that I've been dumping output to screen and file every iteration so I can restart where it failed fairly quickly, it's becomming somewhat of a pita to schedule my time around this for an anticipated 56 hour simulation. I may have to rerun this several times using different techniques too.
Any ideas what could be causing the GPU to be reinitialized approximately every 7 hours or anything else I can try?
Ok, I've managed to sort this out once and for all.
In HKLM\System\CurrentControlSet\Control\GraphicsDrivers I added the following:
TdrDelay 80h
TdrDdiDelay 80h
TdrLevel 0
Yes, the values are a little extreme but so is the workload that I'm asking of the kernel on each iteration, which is now at 16 seconds - and stable.
Seems like it needed more than the TdrDelay (as well as putting into CurrentControlSet). It would probably be very helpful to put this sort of control into CCC, so that the user can make these changes on the fly without having to restart their PC,