kbala

Strange behavior of a kernel, need fresh ideas

Discussion created by kbala on Sep 13, 2018
Latest reply on Sep 20, 2018 by kbala

I lost two days debugging or better to say tried to debug my kernel. Basically the kernel looks like this (part of dagger-hashimoto initialization):

 

1. copy from global to private

2. do private

3. copy from private to local

4. do local

5. copy from local to private

6. do private

7. copy from private to global

 

After two days of digging, it seems to me that the problem is somewhere between 4. and 6.

 

If I compile kernel with "-opt-disable" everything works fine. With optimization, it doesn't.

 

Now, if put barrier or mem_fence between 4. and 5. and/or 5. and 6. nothing happens. But if I instead put conditional printf (never will fire) between 5. and 6. then again everything works fine.

 

So i tried everything cross my mind.

 

The line that is problematic looks like this:

 

for (uint word_id = 0; word_id < 16; word_id++)

     state.Words[word_id] = sharedBlocks[groupId][threadId][word_id]; 

 

// workgroup = 64

// uint sharedBlocks[4][16][16]

// groupId [0..3]

// threadId [0..15]

 

I looked at the ISA and found (I hope I'm right) that compiler copied only 10 of 16 words. Don't know why.

(This is in line with the following experiment: loop works fine with indexes word_id = 0, 1, 2, 3, 6, 7, 10, 11, 14, 15; but not when word_id = 4, 5, 8, 9, 12, 13)

 

However, if I change those two lines to something like this:

 

for (uint word_id = 0; word_id < 8; word_id++)

     ((ulong*)(&state))[word_id] = ((ulong*)(sharedBlocks[groupId][threadId]))[word_id];

 

then, it works again. Compiler copied all 16 words.

 

However something like this:

 

for (uint word_id = 0; word_id < 16; word_id++)

((uint*)(&state))[word_id] = ((uint*)(sharedBlocks[groupId][threadId]))[word_id];

 

doesn't work.

 

I'm really puzzled.

 

Any Ideas?

Outcomes