cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

kbala
Adept I

Strange behavior of a kernel, need fresh ideas

I lost two days debugging or better to say tried to debug my kernel. Basically the kernel looks like this (part of dagger-hashimoto initialization):

1. copy from global to private

2. do private

3. copy from private to local

4. do local

5. copy from local to private

6. do private

7. copy from private to global

After two days of digging, it seems to me that the problem is somewhere between 4. and 6.

If I compile kernel with "-opt-disable" everything works fine. With optimization, it doesn't.

Now, if put barrier or mem_fence between 4. and 5. and/or 5. and 6. nothing happens. But if I instead put conditional printf (never will fire) between 5. and 6. then again everything works fine.

So i tried everything cross my mind.

The line that is problematic looks like this:

for (uint word_id = 0; word_id < 16; word_id++)

     state.Words[word_id] = sharedBlocks[groupId][threadId][word_id]; 

// workgroup = 64

// uint sharedBlocks[4][16][16]

// groupId [0..3]

// threadId [0..15]

I looked at the ISA and found (I hope I'm right) that compiler copied only 10 of 16 words. Don't know why.

(This is in line with the following experiment: loop works fine with indexes word_id = 0, 1, 2, 3, 6, 7, 10, 11, 14, 15; but not when word_id = 4, 5, 8, 9, 12, 13)

However, if I change those two lines to something like this:

for (uint word_id = 0; word_id < 8; word_id++)

     ((ulong*)(&state))[word_id] = ((ulong*)(sharedBlocks[groupId][threadId]))[word_id];

then, it works again. Compiler copied all 16 words.

However something like this:

for (uint word_id = 0; word_id < 16; word_id++)

((uint*)(&state))[word_id] = ((uint*)(sharedBlocks[groupId][threadId]))[word_id];

doesn't work.

I'm really puzzled.

Any Ideas?

0 Likes
7 Replies
dipak
Big Boss

Thanks for reporting it.

From your description, it looks like a compiler optimization issue. For investigation, we need a minimal test-case (host code + kernel) that reproduces the problem. Please provide a repro and mention about the setup details (OS, GPU, driver etc.).

0 Likes

I pulled out a piece of code that I'm not sure is right.

Honestly, I doubt that the problem is in the compiler, however I'm just blind and I can not see what is wrong.

I'm sending the project in VS2017, OS is Win10, gpu rx550, latest driver. Had same problem with rx480 and older driver.

Almost the same code works on nVidia Cuda, so I suspect that I violated OpenCl C++ standard somewhere, but I don't know where.

Hope that you have answer.

Thanks.

0 Likes

After a quick look at the kernel code, I suspect below declaration might be causing the problem. Please find my comments right-side of the code.

typedef union

{

ulong words[KECCAK_BYTES / sizeof(uint)]; ----> causing conversion between ulong and uint;

                                                                                  expected declaration: uint words[KECCAK_BYTES / sizeof(uint)];

ulong dwords[KECCAK_BYTES / sizeof(ulong)];

block64_t block;

} state_t;

Thanks.

0 Likes

Thank you Dipak.

I'm sorry I made a typo error when I tried to simplify the original structures.

Your answer helped me to look for an error elsewhere and I found it here:

#define BLOCK64_BYTES 64

#define BLOCK64_WORDS (BLOCK64_BYTES / sizeof (uint))

#define BLOCK64_DWORDS (BLOCK64_BYTES / sizeof (ulong))

#define BLOCK64_QWORDS (BLOCK64_BYTES / sizeof (uint4))

typedef union

{

uint mWords [BLOCK64_WORDS];

ulong mDWords [BLOCK64_DWORDS];

uint4 mQWords [BLOCK64_QWORDS]; <---- whithout this line everything works fine

} tBlock64;

Can the error be due to automatic alignment?

This time I'm sending a little extended version. I would be grateful for one more help. I just have a need to understand what the problem is

0 Likes

Addition:

The error does not appear if we use user defined tUInt4

typedef struct

{

uint x;

uint y;

uint u;

uint w;

} tUInt4;

typedef union

{

uint mWords[BLOCK64_WORDS];

ulong mDWords[BLOCK64_DWORDS];

tUInt4 mQWords[BLOCK64_QWORDS];

} tBlock64;

0 Likes

Hi Karlo,

I tested the latest code. As per my observation, it looks like a compiler optimization problem particularly for OpenCL 2.0. If the same kernel is built for OpenCL 1.2 (with and without optimization), it produces expected result. I will report this problem to the concerned team.

Thanks.

0 Likes

The mystery is then more or less solved

Thanks.

0 Likes