7 Replies Latest reply on Sep 20, 2018 1:45 PM by kbala

    Strange behavior of a kernel, need fresh ideas

    kbala

      I lost two days debugging or better to say tried to debug my kernel. Basically the kernel looks like this (part of dagger-hashimoto initialization):

       

      1. copy from global to private

      2. do private

      3. copy from private to local

      4. do local

      5. copy from local to private

      6. do private

      7. copy from private to global

       

      After two days of digging, it seems to me that the problem is somewhere between 4. and 6.

       

      If I compile kernel with "-opt-disable" everything works fine. With optimization, it doesn't.

       

      Now, if put barrier or mem_fence between 4. and 5. and/or 5. and 6. nothing happens. But if I instead put conditional printf (never will fire) between 5. and 6. then again everything works fine.

       

      So i tried everything cross my mind.

       

      The line that is problematic looks like this:

       

      for (uint word_id = 0; word_id < 16; word_id++)

           state.Words[word_id] = sharedBlocks[groupId][threadId][word_id]; 

       

      // workgroup = 64

      // uint sharedBlocks[4][16][16]

      // groupId [0..3]

      // threadId [0..15]

       

      I looked at the ISA and found (I hope I'm right) that compiler copied only 10 of 16 words. Don't know why.

      (This is in line with the following experiment: loop works fine with indexes word_id = 0, 1, 2, 3, 6, 7, 10, 11, 14, 15; but not when word_id = 4, 5, 8, 9, 12, 13)

       

      However, if I change those two lines to something like this:

       

      for (uint word_id = 0; word_id < 8; word_id++)

           ((ulong*)(&state))[word_id] = ((ulong*)(sharedBlocks[groupId][threadId]))[word_id];

       

      then, it works again. Compiler copied all 16 words.

       

      However something like this:

       

      for (uint word_id = 0; word_id < 16; word_id++)

      ((uint*)(&state))[word_id] = ((uint*)(sharedBlocks[groupId][threadId]))[word_id];

       

      doesn't work.

       

      I'm really puzzled.

       

      Any Ideas?

        • Re: Strange behavior of a kernel, need fresh ideas
          dipak

          Thanks for reporting it.

          From your description, it looks like a compiler optimization issue. For investigation, we need a minimal test-case (host code + kernel) that reproduces the problem. Please provide a repro and mention about the setup details (OS, GPU, driver etc.).

            • Re: Strange behavior of a kernel, need fresh ideas
              kbala

              I pulled out a piece of code that I'm not sure is right.

              Honestly, I doubt that the problem is in the compiler, however I'm just blind and I can not see what is wrong.

               

              I'm sending the project in VS2017, OS is Win10, gpu rx550, latest driver. Had same problem with rx480 and older driver.

              Almost the same code works on nVidia Cuda, so I suspect that I violated OpenCl C++ standard somewhere, but I don't know where.

               

              Hope that you have answer.

               

              Thanks.

                • Re: Strange behavior of a kernel, need fresh ideas
                  dipak

                  After a quick look at the kernel code, I suspect below declaration might be causing the problem. Please find my comments right-side of the code.

                   

                  typedef union

                  {

                  ulong words[KECCAK_BYTES / sizeof(uint)]; ----> causing conversion between ulong and uint;

                                                                                                    expected declaration: uint words[KECCAK_BYTES / sizeof(uint)];

                   

                  ulong dwords[KECCAK_BYTES / sizeof(ulong)];

                  block64_t block;

                  } state_t;

                   

                  Thanks.

                    • Re: Strange behavior of a kernel, need fresh ideas
                      kbala

                      Thank you Dipak.

                       

                      I'm sorry I made a typo error when I tried to simplify the original structures.

                       

                      Your answer helped me to look for an error elsewhere and I found it here:

                       

                      #define BLOCK64_BYTES 64

                      #define BLOCK64_WORDS (BLOCK64_BYTES / sizeof (uint))

                      #define BLOCK64_DWORDS (BLOCK64_BYTES / sizeof (ulong))

                      #define BLOCK64_QWORDS (BLOCK64_BYTES / sizeof (uint4))

                       

                      typedef union

                      {

                      uint mWords [BLOCK64_WORDS];

                      ulong mDWords [BLOCK64_DWORDS];

                      uint4 mQWords [BLOCK64_QWORDS]; <---- whithout this line everything works fine

                      } tBlock64;

                       

                      Can the error be due to automatic alignment?

                       

                      This time I'm sending a little extended version. I would be grateful for one more help. I just have a need to understand what the problem is