cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

timchist
Elite

AMD 79xx GPUs skip kernel execution for certain indices

I'm experiencing a strange problem that occurs on 7950 and 7970 cards, but does not happen on 5850 and 6870.

My application processes images in tiles. For each tile a series of OpenCL kernels is called. When tile size becomes relatively small (say, 128x128), some parts of output image may be not fully processed. I simplified my algorithm so that it is only executing the following operations for each tile:

  1. Temp1 = 100
  2. Temp2 = 30
  3. Temp1 = Temp1 + Temp2
  4. Dst = Src + Temp1

(Temp1, Temp2, Src, Dst are all vectors of 128x128).

After that I call clFinish and copy Temp1, Temp2, Src and Dst to host memory for checking. For those tiles that have been calculated incorrectly, I have found out that:

  • Temp1 is equal to 130 for all vector components
  • Temp2 is equal to 30 for all vector components
  • Dst is not equal to Src + Temp1 (Src + 130) for some vector components, but is rather equal to Src + 100

The number of incorrect vector components is often (but not always) divisible by 64, so it seems that under some circumstances whole wavefronts get skipped.

Even though the problem is 100% reproducible in this simplified version of our application, it does not show up when I try to write a standalone test, even when it very accurately models the behaviour of the application. Apparently there are some other factors that trigger the problem that I'm not aware of.

I'm attaching a screenshot showing a fragment of the output from our application. Grid indicates the tile boundary. If the output was correct, all the image would be equally pink, without any stripes.

The larger tiles become, the less is the likelihood of the problem to appear.

My best uneducated guess is that something wrong is happening when kernels are scheduled to hardware either on driver or on firmware level.

I tried several versions of driver, specifically: Catalyst 12.4, 12.8, 12.10, 13.1, 13.3 beta, 13.4. I also tried two different 7970s in two computers (one based on AMD FX 8350, the other one with i7 3770K). I also tried a 7950 in a compute based with i7 3930K. On all computers Windows 7 x64 was used. We did not check that under Linux or Mac OS. In all these configurations the problem did occur.

Does that ring a bell?

0 Likes
1 Solution
timchist
Elite

The problem is no longer reproducible with the Catalyst 13.6 beta driver. Apparently, something has been fixed. Thank you all for your help.

View solution in original post

0 Likes
48 Replies
vmiura
Adept II

Are you using any complex control flow?


I ran into 2 bugs that I could reproduce on 13.3 beta:

while(a && b) {} // loops even when b is false

do
{

   store some debug

  if(a) return;  // <- having a return inside do while loop caused register clobbering, and weird data was stored to my debug buffer
}while(b);

0 Likes

In the full version we may have some non-trivial flow control operators.

However, as I wrote in the post, in the "simplified" version we are only using very simple kernels, such as Add or Memset, that only have a single if inside:

if(x < size)

{

...

}

0 Likes
timchist
Elite

So far I have two possible reasons of why this problem may occur:

  • as task size is small and not all compute units are utilised, GPU may attempt to schedule next kernel to free compute units while the previous kernel is still not finished. This may be caused by an error in dependency analysis
  • cache coherency problem: second call to Add (Dst = Src + Temp1) is executed on the compute unit that has previously executed the first Fill with 100 (Temp1 = 100) and for some reason the cache of this compute unit did not get updated with a subsequent value of 130 (after Temp1 = Temp1 + Temp2 was executed)
0 Likes
timchist
Elite

I have just got a confirmation that the behaviour I experience is caused by executing two kernels that have dependencies in parallel. Please see two attached screenshots, one showing timeline from a correct tile, the other one -- from an incorrect tile.

As you can see, for a correct tile GPU executes Temp1 = 100 and Temp2 = 30 in parallel. That's ok, there is no dependencies. Temp1 = Temp1 + Temp2 and Dst = Src + Temp1 are executed sequentially, as the first kernel modifies Temp1, so the second one depends on the results of the first one.

For a tile that is calculated incorrectly the timeline is different: Temp1 = 100 and Temp2 = 30 are executed sequentially, but two Add calls are incorrectly executed in parallel.

Is there a workaround?

0 Likes

Is it an asynchronous queue?

Technically you should execute clEnqueueBarrier() or clEnqueueMarker() beween kernels if you don't want them to execute in parallel, although I thought that the current drivers don't support asynchronous execution.

0 Likes

No, this queue is synchronous. In addition, as far as I know AMD OpenCL does not support asynchronous queues.

In synchronous queues kernels MUST execute sequentially without executing clEnqueueBarrier, clEnqueueMarker or any other explicit synchronization points ("CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE  Determines whether the commands queued in the command-queue are executed in-order or out-of-order. If set, the commands in the command-queue are executed out-of-order. Otherwise, commands are executed in-order.").

I'd say it's OK to execute commands in parallel even in a synchronous queue, but only if there is 100% no dependencies between them. Which is not true in my case.

0 Likes

Yeah, it shouldn't need extra synchronization.

I've seen some unexpected results overlapping in CodeXL kernel tracing though, so I'm not sure you can trust that they are actually overlapping.

Do you get the same bug if you use clFinish() to force sync between the kernel dispatches?

0 Likes

No, inserting clFinish helps to avoid the errors, but with a performance penalty of ~30%.

0 Likes

Can you post a small repro case so that we can take this issue up?

0 Likes

I would be happy to, but as I wrote earlier I can't isolate that to a small sample despite all my efforts - the problem magically disappears there

I can provide you with the source of my kernels and with the APP profiler trace files if it helps.

0 Likes

Hi Timchist,

Are you able to create a repro case, after figuring out the asynchronous issue with the kernels. I can send the test case to AMD Engginers, to fix the issue.

0 Likes

Please see my reply just above.

0 Likes

I did read that you were not able to reproduce it with a minimal testcase. But I cannot help you without a repro case.

I just asked you for test case again, as you have done quite a work since the first post. Sorry for the confusion.

0 Likes

Dear Himanshu,

even though I don't have a working example demonstrating the problem, I'm pretty sure the problem exists.

I found a workaround for the problem: if for kernels such as Add(Src1, Src2, Dst) I never pass the same pointer as both Src and Dst, the problem does not appear. My guess is that the dependency analyzer somehow deduces that kernel Add only reads from Src1 and Src2, but only writes to Dst, but does not take into account the fact that I can pass the same pointer in both arguments. This does not happen always (otherwise I would be able to reproduce this in a simple program very quickly), but the bug does seem to be present under some circumstances, which I can't figure out with the tools I have and with my level of understanding of the hardware.

I'd appreciate if you pass all the data I posted above (the description of the workflow, the screenshots from APP profiler, and if required -- I can also post APP trace files along with the source of kernels I was using) to AMD engineers (even though you can't reproduce it), as such an incorrect behavior is basically a show stopper.

Regards,

Tim

0 Likes

Reminds me of "fno-alias" compiler option. Previously we had to pass it to the compiler. At some point, it was made to be default.... May be, Passing a compiler flag to consider aliasing might help...

Just a thought... Can't remember what that option is (or if at all such an option exists).

Will check on Monday to see if I can find something.

0 Likes

Hi Himanshu,

are you talking about cl-strict-aliasing?


This option is documented in Khronos specs (see clBuildProgram), however, AMD APP Programming Guide says that AMD compiler only supports -I and -D (see section 2.1.3). Neither does it mention any default options. It also seems that this option was only present in OpenCL 1.0 and 1.1 (it is not mentioned in OpenCL 1.2), but I also can't find any info on whether this option is now enabled by default.

If this option is enabled by default, then yes, this could explain the behavior I observe.

Regards,

Tim

0 Likes

Nice find.. OpenCL 1.2 spec specifically says that "cl-strict-aliasing" is not supported from 1.1 (Appendix F)

Anyway. all these aliasing fundas work only within a kernel (for compilation purposes)....and does not work out for dependencies among kernel launches. I don't know how the RT handles it..

Coming to your problem -->

1) 2 kernels cannot execute in parallel in current AMD implementations (as on today)

2) out of order processing of command queue does not happen with AMD Runtime (even if you had enabled it)

btw.. Are the dependent kernels scheduled from 2 different command queues? or a single command queue?

Please post a quick smallest reproduction case for us to pursue with AMD engg.

0 Likes

> 2 kernels cannot execute in parallel in current AMD implementations (as on today)

Strange. That's a quote from APP SDK Guide (see section 5.9, page 5-34) about Southern Islands GPUs:

"Execution of kernel dispatches can overlap if there are no dependencies between them and if there are resources available in the GPU."

The whole application uses a single command queue. This queue was created with 0 passed as the value of the properties argument of clCreateCommandQueue.

I'm attaching the source code of kernels I have used as well as the trace files produced with AMD APP profiler. When inspecting the timeline, an example of a correct tile starts at 7151.988, while the incorrect tile starts at 7181.370.

Unfortunately, as I have explained in the initial post, I failed to produce a complete working example.

0 Likes

Thanks for attaching a case. Will try it out.

ANd, the APP Programming guide, most likely was talking about the Hardware capability.

Hardware has it... but the currnet OpenCL RT does not use it.

0 Likes
glimberg
Journeyman III

I've come across a very similar issue in an application I'm working on.

In my case, the images are not processed in tiles, but rather as full image buffers.  Like your case, everything works fine on 5xxx and 6xxx based cards and we're only seeing the issue on 7950 and 7970, though I still need to track down our 7750 and see if it's an issue there, too.  I have also tried the 7950 and 7970 on Mac OS 10.8.3 and the issue does NOT appear there.  I can't say for certain as I haven't tried Linux yet, but it sure seems like a Windows only issue.

Basically, the moment that there are more than 2 kernels queued, I start to get incorrect results, but only on buffers above a certain size.

The only way I've been able to circumvent the issue is to disable the OpenCL optimizer by passing "-cl-opt-disable" to the options field of clBuildProgram().

0 Likes

It will be helpful, if you can post a small repro case.

0 Likes

Much like the original author of this thread, I'm unable to create a small repro case.  I cannot post any of our code to these forums.  The best I can do is begin a dialog with one of your engineers, and once there is an NDA in place between our two companies, send over the code in question privately.

0 Likes

Did you try the workaround I have suggested above? Specifically, avoid passing the same pointers in different parameters to kernels, paying special attention to not passing the same pointer in both input and output parameters.

0 Likes

We don't have that specific case as far as I can tell.  We do have a few kernels that accept only a single buffer for input and output, i.e. the input buffer is modified in place, but none where the same buffer is passed to two different kernel args.  Said buffer is then passed to other kernels for additional operations.

I will try to modify those kernels tomorrow and see if adding a destination buffer argument helps things.

0 Likes

Another thing to note.  There is a difference between your issue and mine.  No amount of calling clFinish() between my kernels helps in this case. The only thing thus far that has helped me is completely disabling optimization on the kernels.  And optimization must be disabled for all kernels in the chain of 8 or so.  If it's enabled for a single one of the kernels, I start getting incorrect results.

0 Likes

Could be a different issue then. Do you see any overlap happening in APP Profiler timeline or kernel executions look sequential (both graphically and when inspecting start and end times)?

0 Likes

Yeah, no overlapping kernels on my end as far as I can tell.  Kernels just aren't running in their entirety before the next one starts up when the optimizer is enabled.  It skips the last 1/3rd or so of the buffer once the buffer size starts to approach 39,321,600 bytes (5120 x 2560 x 3 channels).

0 Likes

Hi glimberg,

I request you to start a new thread for your question. The two topics being discussed here, does not seem related.

Also as i understand, you question is not at all related to multiple kernel launches, but more likely related to some incorrect kernel optimization. Please provide a suitable testcase if possible to reproduce it at our end.

0 Likes

Actually it's very related to multiple kernel launches. If we onlyrun the first kernel in the 8 kernel chain, everything works fine.  Once a second kernel is added is when the issue surfaces.  The commonality with Tim's issue is that the first few kernels in the chain are operations in place on the input buffer via a single input/output parameter to the kernel. This is slightly different in kernel code form from Tim's issue, but likely compiles down to the same thing. Tim sometimes does an operation in place by passing in the same memory pointer to different input and output parameters on the kernel function.

Just to demonstrate the issue, which I have done in a minimal sample here, but have not been authorized to release the sample:

1) Create a memory buffer that can encapsulate a 5120x2160 16-Bit RGB image.  Approximately 66MB.

2) before uploading the buffer to the card, memset the entire buffer to 0.

3) Upload the host buffer to the card.

4) Run a memset() kernel like the following:

__kernel void cl_memset(__global unsigned char *buffer,

                        const unsigned char value)

{

     buffer[get_global_id(0)] = value;

}

and set the value kernel arg to 255 and have it run over every byte in the buffer.

5) Run any other kernel with an operation that takes a separate input and output buffer.  In my case in my minimal sample, it was a pixel type conversion from RGB-16 to BGRA-8.

6) Download the kernel back to a host memory buffer and ensure each byte is equal to 255.

At around buffer index 33,400,000 (not exact), you will start getting char values of 0 in the host buffer downloaded back from the card.

IF you just run the cl_memset kernel without any other kernels queued after it, everything will work just fine.  The moment there's another kernel enqueued after is when everything starts going to hell. 

No amount of enqueuing barriers, markers, or waiting for events changes anything.

0 Likes

This problem might be L1 cache flush related, but could you confirm all the buffer sizes - check that you have no overflows and no overlapping buffers?

5120x2160x16bit isn't ~66MB, it's ~21MB.

5120x2160 has 11,059,200 indices, so 33,400,000 would be out of bounds.

-Vic

0 Likes

It's 16-bits per channel, not an RGB value packed into 16-bits.

5120 x 2160 * 3 (number of channels) * 2 (bytes) = 66,355,200.

The memset buffer I posted works on a per-byte basis, so that means 66,355,200 indices.

0 Likes

L1 cache flush-related problems is my second hypothesis (after illegal parallel execution) on what could be causing the errors I observe.

0 Likes

You may not be authorized to share your proprietary code, but it does not seem that difficult to create a small repro-case, based on your steps. Use two simple kernels, and write appropriate host code.

0 Likes
vmiura
Adept II

Did you try putting clEnqueueBarrier after all calls to clEnqueueNDRangeKernel?

Since you don't want kernels to overlap, this is what you want anyway.  In the worst case it will cost nothing, but if it fixes the bugs it might confirm the idea that kernels are running in parallel.

0 Likes

No, I didn't, but it's worth giving a try. I'll post my findings here after I have a chance to make this experiment.

UPDATE (May 23rd): events do not seem to help. Neither do barriers.

0 Likes

Kernels can run in parallel, but we have dependency checking to make sure that kernels with dependencies don't overlap, i.e. if the output from one kernel is used as the input or output to another, then they can't run in parallel.

If you find a bug with this, please let us know.

0 Likes

Thank Jeff. Apparently I've found a bug, please see the posts above.

0 Likes

If you are using profiling, then concurrent execution is disabled.  If you aren't getting correct results with profiling enabled, then it's not an issue with dependency checking.

Which kernel is generating incorrect data?  There are 80 kernels in your trace file.

0 Likes

I am executing these 4 steps for every of ~80 tiles:

  1. Temp1 = 100
  2. Temp2 = 30
  3. Temp1 = Temp1 + Temp2
  4. Dst = Src + Temp1

The results for incorrect tiles look like steps 3 and 4 are executing in parallel (some elements in Dst contain Src + 100 and some of them contain Src + 130). When inspecting the timeline in my trace file, an example of a correct tile starts at 7151.988, while the incorrect tile starts at 7181.370.

0 Likes