cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ankhster
Adept II

Problem writing to write buffer from within kernel

Hi

I'm fairly new to OpenCL and I'm having a great deal of trouble trying to copy data (6 times int8 vectors) from the device to the host. I've set all the important bits that I can identify with, as described below. This is running with the following:

Windows 7 x64

Visual Studio 2008

AMD Catalyst 12.8

SDK 2.7.923.1

Tahiti 7970.

Any help would be very much appreciated.

Host:

   cl_mem    inDevice, outDevice;

   cl_int    err;

   size_t    global;

   global = 1;

   vecPart = 0;

   inDevice = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(cl_int), NULL, NULL);

   outDevice = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_int8) * global * 6, NULL, NULL);

   .

   err  = clSetKernelArg(kernel, 0, sizeof(cl_mem), &inDevice);

   err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &outDevice);

   .

   err = clEnqueueWriteBuffer(commands, inDevice, CL_TRUE, 0, sizeof(int), &vecPart, 0, NULL, NULL);

   .

   err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, NULL, 0, NULL, NULL);

   .

Device:

__kernel void myproc(__global int *position, __global int *outBuffer)

{

    int8    tallyL = 0;

    int8    tallyLO = 0;

    int8    tallyLOS = 0;

    int8    tallyLM = 0;

    int8    tallyLMO = 0;

    int8    tallyLMOS = 0;

    .

    .

    .

    id = get_global_id(0);

    id *= 6;

    vstore8(tallyL, id, outBuffer);        // Device stops responding but recovers

    vstore8(tallyLO, id + 1, outBuffer);

    vstore8(tallyLOS, id + 2, outBuffer);

    vstore8(tallyLM, id + 3, outBuffer);

    vstore8(tallyLMO, id + 4, outBuffer);

    vstore8(tallyLMOS, id + 5, outBuffer);

}

I then commented the vstore commands and the kernel completed without any problems, albeit I didn't get any results.

    .

    .

    .

    id = get_global_id(0);

    id *= 6;

//    vstore8(tallyL, id, outBuffer);        // Uncomment and device stops responding but recovers

//    vstore8(tallyLO, id + 1, outBuffer);

//    vstore8(tallyLOS, id + 2, outBuffer);

//    vstore8(tallyLM, id + 3, outBuffer);

//    vstore8(tallyLMO, id + 4, outBuffer);

//    vstore8(tallyLMOS, id + 5, outBuffer);

}

I then changed the vstore commands to array writes as described below.

    .

    .

    .

    id = get_global_id(0);

    id *= 6;

    id *= sizeof(int8);        // Everything below commented and completion is in 6 milli-seconds

//    outBuffer[id] = tallyL.s0;    // Uncommented increases to 7.5 seconds

//    outBuffer[id + 1] = tallyL.s1;    // Uncommented with above line increases to 8.5 seconds

//    outBuffer[id + 2] = tallyL.s2;    // Uncommented with above 2 lines increases to 11.5 seconds

//    outBuffer[id + 3] = tallyL.s3;    // Uncommented with above 3 lines - device stops responding but recovers

0 Likes
1 Solution

Ok, I've managed to sort this out once and for all.

In HKLM\System\CurrentControlSet\Control\GraphicsDrivers I added the following:

TdrDelay 80h

TdrDdiDelay 80h

TdrLevel 0

Yes, the values are a little extreme but so is the workload that I'm asking of the kernel on each iteration, which is now at 16 seconds - and stable.

Seems like it needed more than the TdrDelay (as well as putting into CurrentControlSet). It would probably be very helpful to put this sort of control into CCC, so that the user can make these changes on the fly without having to restart their PC,

View solution in original post

0 Likes
6 Replies
nou
Exemplar

when you comment all writes compiler will optimize out whole kernel so it is empty. thats why it takes 6ms. when device recover you cross watchdog timer which assume that GPU is locked up so it reset it. try lower global size so it won't take that long.

Hi nou, thank you for your response.

My global size for this exercise is currently 1. I plan to increase this when I get the problem sorted out, but for now I just have a single work item within the queue.

For the timing, I'm using:

  tic = time();
  err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, NULL, 0, NULL, NULL);
  toc = time();

  exTime = toc - tic; // etc

When I'm writing to the outBuffer, it is at the very end of the kernel and outside any while loops. While I can accept that for no writes it may optimise out the kernel and that for one write (with the kernel) it may take 7.5 seconds. However, why it would increase by another second for an additional write and then an additional 3 seconds for the subsequent write is beyond me, which would indicate that I have a problem somewhere, as well as the gpu not responding and then recovering (but with a broken execution).

I was wondering (and hoping) that I've missed something out or made a mistake somewhere (being new to this) that is causing the problem.

0 Likes
ankhster
Adept II

Ok, I've managed to ascertain that I don't have a problem with my memory allocations or my vstores, per se. I appear to have narrowed the problem down to a popcount statement in the kernel...phew, where I'm trying to determine the number of bits that have changed between the original vector and the manipulated vector.

I had

   diff = popcount(stage.s04261537 ^ in);

but replacing it to

   diff = (int8) 1;

and I can do the transfer ok. However, if I try tidying it up to

   diff = popcount(stage ^ in);

then the problem reoccurs. Both stage and in variables are int8 vectors. I would really like to use the popcount function as it's exactly what I need.

Unfortunately, there isn't much to go on in the OpenCL1.2 pdf for popcount apart from two references:

gentype popcount (gentype x) Returns the number of non-zero bits in x.

and

New built-in functions

  • Functions to read from and write to a 1D image, 1D and 2D image arrays described in sections 6.12.14.2, 6.12.14.3 and 6.12.14.4.
  • Sampler-less image read functions described in section 6.12.14.3.
  • popcount integer function described in section 6.12.3.
  • printf function described in section 6.12.13.

Any assistance would be appreciated.

0 Likes
ankhster
Adept II

Hi

I've managed to debug everything and ensure that everything is working correctly. However, I'm still experiencing a problem where the I appear to be encountering a "Display driver stopped responding and has recovered" message.

My kernel takes about 6.5 seconds to run each time, which appears to be conflicting with the default TdrDelay of 8 (seconds I presume), especially when a Windows event occurs (restart hdds for example). I've applied the fix and have increased the TdrDelay in HKEY_LOCAL_MACHINE\SYSTEM\ControlSet002\Control\GraphicsDrivers to 20h so that should hopefully fix the problem.

0 Likes

Ok...my simulation is running but it is stopping approximately every 7 hours for some reason. I know this because the GPU activity on the Catalyst Command Center goes to zero and my C program stalls. While I'm thankful that I've been dumping output to screen and file every iteration so I can restart where it failed fairly quickly, it's becomming somewhat of a pita to schedule my time around this for an anticipated 56 hour simulation. I may have to rerun this several times using different techniques too.

Any ideas what could be causing the GPU to be reinitialized approximately every 7 hours or anything else I can try?

0 Likes

Ok, I've managed to sort this out once and for all.

In HKLM\System\CurrentControlSet\Control\GraphicsDrivers I added the following:

TdrDelay 80h

TdrDdiDelay 80h

TdrLevel 0

Yes, the values are a little extreme but so is the workload that I'm asking of the kernel on each iteration, which is now at 16 seconds - and stable.

Seems like it needed more than the TdrDelay (as well as putting into CurrentControlSet). It would probably be very helpful to put this sort of control into CCC, so that the user can make these changes on the fly without having to restart their PC,

0 Likes