cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

martinb
Journeyman III

possible OpenCl compiler bug

Hi,

I have come across a possible compiler issue which is present in the attached kernels. The two kernels differ how work is mapped to the threads. Requiring a work group size of 256; the difference in the kernels is

210,211c210,211

<     int divIdx = get_local_id(0) / 8;

<     int modIdx = get_local_id(0) % 8;

---

>     int divIdx = get_local_id(0) % 32;

>     int modIdx = get_local_id(0) / 32;

For the first version i get correct reproducible results from an Radeon 6550 for the PowerBornBBKernel, as well as an r290x. For the latter version i still get correct reproducible results from the 6550. However i get incorrect results from r290x which differ between different runs of the kernel with the same input.

Name:                                      Hawaii
  Vendor:                                    Advanced Micro Devices, Inc.
  Device OpenCL C version:                   OpenCL C 1.2
  Driver version:                            1573.4 (VM)
  Profile:                                   FULL_PROFILE
  Version:                                   OpenCL 1.2 AMD-APP (1573.4)

  Name:                                      BeaverCreek
  Vendor:                                    Advanced Micro Devices, Inc.
  Device OpenCL C version:                   OpenCL C 1.2
  Driver version:                            1214.3
  Profile:                                   FULL_PROFILE
  Version:                                   OpenCL 1.2 AMD-APP (1214.3)
Tags (2)
0 Likes
14 Replies
dipak
Staff
Staff

Re: possible OpenCl compiler bug

Hi,

Please provide a test project (with host code and if required, data set) such that we can run these two kernels at our end and verify the results [also suggest how to compare the results (i.e correct or not)]. A simpler test-case with same behavior would be very helpful for us. Please also let us know other setup details such as OS, SDK, Driver version etc..

Regards,

0 Likes
martinb
Journeyman III

Re: Re: possible OpenCl compiler bug

Hi,

Due to licensing i am not able to share the original code and I havent been able to extract a test case with the same behaviour so far. I have a more detailed output from the correct and incorrect running versions, which you may want to have a look at meanwhile, because i dont really understand the behaviour i observe there. The output comes from two additional print statements in the kernel, see the attach modified kernels. From the output it seems that either a part of the work group with modIdx == 6 does not execute, or that the other part does not respect the barriers.

Also note that the code fails to run Test 1, but passes Test 2 using the same kernel file.

The system producing the incorrect results runs:

lsb_release -a

LSB Version:    :core-4.1-amd64:core-4.1-noarch

Distributor ID: CentOS

Description:    CentOS Linux release 7.0.1406 (Core)

Release:        7.0.1406

Codename:       Core

  Name:                                     Hawaii
  Vendor:                                   Advanced Micro Devices, Inc.
  Device OpenCL C version:                  OpenCL C 1.2
  Driver version:                           1573.4 (VM)
  Profile:                                  FULL_PROFILE
  Version:                                  OpenCL 1.2 AMD-APP (1573.4)

fglrxinfo -display

display:   screen: 0

OpenGL vendor string: Advanced Micro Devices, Inc.

OpenGL renderer string: AMD Radeon R9 200 Series

OpenGL version string: 4.4.13084 Compatibility Profile Context 14.301.1001

0 Likes
mrrvlad
Adept I

Re: possible OpenCl compiler bug

here is  "bare bones" version of the kernel. I hope It does not have an obvious sync bug... and repros on your end. the nested ifs and this kind of thread assignment are used to fight generous register usage in original kernel.

I expect the resulting array to have values close to 255, and in fact it contains mostly 17 and 19, with some rare cases of other combinations.

run this with offset 0 global size 256, local size 256

I've tried it on a 260x win7 64bit machine with recent codeXL and catalyst packages.

code xl 1.5.6571.0

catalyst: 14.9 - downloaded from amd.com today.

__kernel void

BBKernel(__global float* result)

{

    __local float4 level3[64];

    int threadIdx = get_local_id(0);

    if (threadIdx < 64)

    {

        level3[threadIdx].x = 0;

        level3[threadIdx].y = 0;

        level3[threadIdx].z = 0;

        level3[threadIdx].w = 0;

    }

    barrier(CLK_LOCAL_MEM_FENCE| CLK_GLOBAL_MEM_FENCE);

    const unsigned int atomCnt = 288;

    int divIdx = (threadIdx & 0x1f);

    int modIdx = (threadIdx  >> 5);

    for(unsigned int atomid = 0; atomid < atomCnt; atomid += 32)

    {

        if (modIdx <4)

        {

            if (modIdx <2)

            {

                if (modIdx == 0)

                    level3[divIdx].x = 1;

                else

                    level3[divIdx].y = 2;

            }

            else

            {

                if (modIdx == 2)

                    level3[divIdx].z = 4;

                else

                    level3[divIdx].w = 8;

            }

        }

        else

        {

            if (modIdx < 6)

            {

                if (modIdx == 4)

                    level3[32 + divIdx].x = 16;

                else

                    level3[32 + divIdx].y = 32;

            }

            else

            {

                if (modIdx == 6)

                    level3[32 + divIdx].z = 64;

                else

                    level3[32 + divIdx].w = 128;

            }

        }

        barrier(CLK_LOCAL_MEM_FENCE| CLK_GLOBAL_MEM_FENCE);

        if (threadIdx < 32)

        {

            float br = ( level3[threadIdx].x

                + level3[threadIdx].y

                + level3[threadIdx].z

                + level3[threadIdx].w

                + level3[32 + (threadIdx)].x

                + level3[32 + (threadIdx)].y

                + level3[32 + (threadIdx)].z

                + level3[32 + (threadIdx)].w);

            result[atomid + threadIdx] = br;

        }

        barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);

    } 

}

0 Likes
martinb
Journeyman III

Re: Re: possible OpenCl compiler bug

Hi,

We finally have a reproducible test case with host code which produces the expected results on the CPU, but produces incorrect results on the GPU. Interestingly, the thread distribution

int divIdx = get_local_id(0) / 8;
int modIdx = get_local_id(0) % 8;

which works for the larger complete kernel, also fails the simple test case.

0 Likes
dipak
Staff
Staff

Re: possible OpenCl compiler bug

Hi,

Thanks for posting the reproducible test case. I'm on vacation, so, I'll try to test it next week. Meanwhile, if you have any update, please post.

Regards,

0 Likes
mrrvlad
Adept I

Re: possible OpenCl compiler bug

Hi,

any update on this? have you been able to repro the bug on your side?

0 Likes
dipak
Staff
Staff

Re: Re: possible OpenCl compiler bug

My apologies for this delay.

I ran your code on following setups- 1) CPU, 2) Devastator (integrated-GPU) and  3) Capeverde (HD 7770). Please find the output of each setups attached herewith.

I guess, outputs of setup-1 and 2 are okay, but not setup-3. Output of both the kernels for setup-3 are different and neither of them matches with setup-1 and 2.

Can I assume setup-3 as the erroneous case that you wanted to point out? If so, I'll file an internal bug report against this test case.

Note:

Machine setup:

AMD A10-6800K APU

HD7770 D-Gpu

Windows 7 (64 bit)

Catalyst Driver: 14.9.2 beta

APP SDK 2.9-1

Regards,

0 Likes
mrrvlad
Adept I

Re: Re: possible OpenCl compiler bug

thanks for looking into this.

Yes, 7770 output is wrong. I also have seen a different kind of wrong output on Bonair, and 290X

0 Likes
dipak
Staff
Staff

Re: possible OpenCl compiler bug

Thanks for this confirmation. An internal bug report has been filed against this issue. I'll let you know if get any update.

Regards,

0 Likes