cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

martinb
Journeyman III

possible OpenCl compiler bug

Hi,

I have come across a possible compiler issue which is present in the attached kernels. The two kernels differ how work is mapped to the threads. Requiring a work group size of 256; the difference in the kernels is

210,211c210,211

<     int divIdx = get_local_id(0) / 8;

<     int modIdx = get_local_id(0) % 8;

---

>     int divIdx = get_local_id(0) % 32;

>     int modIdx = get_local_id(0) / 32;

For the first version i get correct reproducible results from an Radeon 6550 for the PowerBornBBKernel, as well as an r290x. For the latter version i still get correct reproducible results from the 6550. However i get incorrect results from r290x which differ between different runs of the kernel with the same input.

Name:                                      Hawaii
  Vendor:                                    Advanced Micro Devices, Inc.
  Device OpenCL C version:                   OpenCL C 1.2
  Driver version:                            1573.4 (VM)
  Profile:                                   FULL_PROFILE
  Version:                                   OpenCL 1.2 AMD-APP (1573.4)

  Name:                                      BeaverCreek
  Vendor:                                    Advanced Micro Devices, Inc.
  Device OpenCL C version:                   OpenCL C 1.2
  Driver version:                            1214.3
  Profile:                                   FULL_PROFILE
  Version:                                   OpenCL 1.2 AMD-APP (1214.3)
0 Likes
14 Replies
dipak
Big Boss

Hi,

Please provide a test project (with host code and if required, data set) such that we can run these two kernels at our end and verify the results [also suggest how to compare the results (i.e correct or not)]. A simpler test-case with same behavior would be very helpful for us. Please also let us know other setup details such as OS, SDK, Driver version etc..

Regards,

0 Likes

Hi,

Due to licensing i am not able to share the original code and I havent been able to extract a test case with the same behaviour so far. I have a more detailed output from the correct and incorrect running versions, which you may want to have a look at meanwhile, because i dont really understand the behaviour i observe there. The output comes from two additional print statements in the kernel, see the attach modified kernels. From the output it seems that either a part of the work group with modIdx == 6 does not execute, or that the other part does not respect the barriers.

Also note that the code fails to run Test 1, but passes Test 2 using the same kernel file.

The system producing the incorrect results runs:

lsb_release -a

LSB Version:    :core-4.1-amd64:core-4.1-noarch

Distributor ID: CentOS

Description:    CentOS Linux release 7.0.1406 (Core)

Release:        7.0.1406

Codename:       Core

  Name:                                     Hawaii
  Vendor:                                   Advanced Micro Devices, Inc.
  Device OpenCL C version:                  OpenCL C 1.2
  Driver version:                           1573.4 (VM)
  Profile:                                  FULL_PROFILE
  Version:                                  OpenCL 1.2 AMD-APP (1573.4)

fglrxinfo -display

display:   screen: 0

OpenGL vendor string: Advanced Micro Devices, Inc.

OpenGL renderer string: AMD Radeon R9 200 Series

OpenGL version string: 4.4.13084 Compatibility Profile Context 14.301.1001

0 Likes

here is  "bare bones" version of the kernel. I hope It does not have an obvious sync bug... and repros on your end. the nested ifs and this kind of thread assignment are used to fight generous register usage in original kernel.

I expect the resulting array to have values close to 255, and in fact it contains mostly 17 and 19, with some rare cases of other combinations.

run this with offset 0 global size 256, local size 256

I've tried it on a 260x win7 64bit machine with recent codeXL and catalyst packages.

code xl 1.5.6571.0

catalyst: 14.9 - downloaded from amd.com today.

__kernel void

BBKernel(__global float* result)

{

    __local float4 level3[64];

    int threadIdx = get_local_id(0);

    if (threadIdx < 64)

    {

        level3[threadIdx].x = 0;

        level3[threadIdx].y = 0;

        level3[threadIdx].z = 0;

        level3[threadIdx].w = 0;

    }

    barrier(CLK_LOCAL_MEM_FENCE| CLK_GLOBAL_MEM_FENCE);

    const unsigned int atomCnt = 288;

    int divIdx = (threadIdx & 0x1f);

    int modIdx = (threadIdx  >> 5);

    for(unsigned int atomid = 0; atomid < atomCnt; atomid += 32)

    {

        if (modIdx <4)

        {

            if (modIdx <2)

            {

                if (modIdx == 0)

                    level3[divIdx].x = 1;

                else

                    level3[divIdx].y = 2;

            }

            else

            {

                if (modIdx == 2)

                    level3[divIdx].z = 4;

                else

                    level3[divIdx].w = 8;

            }

        }

        else

        {

            if (modIdx < 6)

            {

                if (modIdx == 4)

                    level3[32 + divIdx].x = 16;

                else

                    level3[32 + divIdx].y = 32;

            }

            else

            {

                if (modIdx == 6)

                    level3[32 + divIdx].z = 64;

                else

                    level3[32 + divIdx].w = 128;

            }

        }

        barrier(CLK_LOCAL_MEM_FENCE| CLK_GLOBAL_MEM_FENCE);

        if (threadIdx < 32)

        {

            float br = ( level3[threadIdx].x

                + level3[threadIdx].y

                + level3[threadIdx].z

                + level3[threadIdx].w

                + level3[32 + (threadIdx)].x

                + level3[32 + (threadIdx)].y

                + level3[32 + (threadIdx)].z

                + level3[32 + (threadIdx)].w);

            result[atomid + threadIdx] = br;

        }

        barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);

    } 

}

0 Likes

Hi,

We finally have a reproducible test case with host code which produces the expected results on the CPU, but produces incorrect results on the GPU. Interestingly, the thread distribution

int divIdx = get_local_id(0) / 8;
int modIdx = get_local_id(0) % 8;

which works for the larger complete kernel, also fails the simple test case.

0 Likes

Hi,

Thanks for posting the reproducible test case. I'm on vacation, so, I'll try to test it next week. Meanwhile, if you have any update, please post.

Regards,

0 Likes

Hi,

any update on this? have you been able to repro the bug on your side?

0 Likes

My apologies for this delay.

I ran your code on following setups- 1) CPU, 2) Devastator (integrated-GPU) and  3) Capeverde (HD 7770). Please find the output of each setups attached herewith.

I guess, outputs of setup-1 and 2 are okay, but not setup-3. Output of both the kernels for setup-3 are different and neither of them matches with setup-1 and 2.

Can I assume setup-3 as the erroneous case that you wanted to point out? If so, I'll file an internal bug report against this test case.

Note:

Machine setup:

AMD A10-6800K APU

HD7770 D-Gpu

Windows 7 (64 bit)

Catalyst Driver: 14.9.2 beta

APP SDK 2.9-1

Regards,

0 Likes

thanks for looking into this.

Yes, 7770 output is wrong. I also have seen a different kind of wrong output on Bonair, and 290X

0 Likes

Thanks for this confirmation. An internal bug report has been filed against this issue. I'll let you know if get any update.

Regards,

0 Likes

Inwhat driver version this bug expected to be fixed?

We have issue described in this thread: Kernel with local memory usage gives different results on some hardware where symptoms very similar to those described here.

Similarly, older GPUs have no issues while some GCN models fail to compute properly on workgroups bigger than single wavefront.

Barriers involved. Please fix this issue ASAP, it was reported few months ago...

0 Likes

Hi Raistmer

As I checked, the bug has not been fixed yet. Our team is working on it. Sorry for this delay. Please keep patience.

Regards,

0 Likes

I have reports that bug affected my app fixed in 15.4 beta for Windows and 15.3 for Linux.

Would be good to check this one against those drivers too.

0 Likes

Thanks for the information. Will check this issue on 15.4 beta.

0 Likes
dipak
Big Boss

Hi,

Here is an update about this issue. The engg. team have identified root cause of the issue. The problem is due to a race condition where multiple work items modify the same vector at the same time. The compiler does optimizations under the assumption that there is no race condition. If a program contains race condition, the result is undefined based on what optimization is done. It may work or may not. In order to avoid such race condition, one should use some kind of prevention such as volatile type, atomic_store, or other atomic methods etc. So, please modify the code accordingly.

Regards,

0 Likes