cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

dstarke
Journeyman III

[BUG] local variables overwritten in OpenCL kernel

I have a custom OpenCL kernel with a while loop and various local variables.

These variables are sometimes overwritten (sometimes with 0, sometimes with NaN) when returning to the beginning of the loop.

The issue is reproducible when using the same input values.

The kernel works just fine with other vendors, thus I suspect a compiler bug.

I have tested the issue on the following systems:

- AMD Radeon HD 7800 Series

Driver Version 22.19.162.4

Windows 10 Education (Version 10.0.14393)

- AMD Radeon R9 200 Series

Driver Version 22.19.162.4

Windows 10 Pro 64-Bit (Version 1607)

- AMD Radeon 5800 Series

Driver Version 15.200.1062.1004

Windows 7 Home Premium (Version 6.1.7601 SP1 Build 7601)

I can provide the kernel in source and binary (AMD Radeon HD 7800 Series) if required, but preferable not public.

This would also include a Windows application to reproduce the issue and example outputs from other vendors.

Please let me also know if this is the right place or where I should address this issue.

0 Likes
15 Replies
dipak
Big Boss

Hi Daniel,

Do you observe the same issue on the latest AMD driver as well? If yes, please share the repro code and system details so our team could investigate it here. I've doubt that the concerned team will accept any issue generating on a custom kernel.

Btw, you have been whitelisted now.

Regards,

0 Likes

The driver version is the newest version provided by the automatic Windows update.

As for the kernel and test application, please see https://filebin.ca/3Zhov3oL768B.

You will find the output generated by AMD and by CPU (non-AMD vendor).

The test application produces 3 outputs:

- debug.txt (generated variable output)

- debug.png (rendered pixels)

- kernel.bin (binary kernel generated by the OpenCL driver)

In the debug.txt you will notice that column 15 (started counting with 1) is 0 at some places (starting at iteration i7) for AMD but not for CPU.

This is the value corresponds with mx in kernel.cl. When we pass line 509 in the kernel and go back to line 428 the value of mx changes even

though the variable was not changed by what we can find in the code. That means the content of the local variable mx (along with others)

changed due to the jump back to the beginning of the while loop.

0 Likes

Thanks for sharing the executable. As I ran it on a Carrizo, I didn't observe any erroneous zero values. In my case, values inside "debug.txt"(specially 15th column as you mentioned) were more similar to your debug.txt under cpu folder. At this moment, I don't know whether the issue is related to those cards or not. I'll  manage one of those cards and check it. Btw, it would be helpful if the executable could select the target device so it could be run on cpu too.

Regarding the driver version, please check  "AMD Radeon settings->Software" to see more details about the driver version / driver packaging version. Here is this latest one: Desktop.

Regards,

0 Likes

Thank you for testing. To rule out differences behaviors due to different rounding methods, please also try running the application with line 543-546 of the kernel removed. This will render the whole scene. You can find a reference scene at https://imagebin.ca/v/3a31XSEkxrik. There should be no real visible differences.

The application itself uses Boost Compute internally, therefore you can just change the device by defining the corresponding environment variables.

See boost/compute/system.hpp - 1.63.0.

The software version 17.1.1 was shown in the settings window for the driver.

0 Likes

Thanks. Actually, a more recent driver (17.9.1) is available here: Desktop. Please check with this and share your findings. Meanwhile, I'll try to reproduce it at my end.

Regards

0 Likes

Sorry for the late reply. Sadly, I have no means to update the drivers on the test system due to missing privileges.

How did it turn out on your side?

0 Likes

Running on HD7870 with 17.9.1, I observed similar zero values in debug.txt. However, I couldn't run it on cpu by setting the boost environmental variable (e.g. set BOOST_COMPUTE_DEFAULT_DEVICE_TYPE="CPU"). In this case, kernel.bin and debug.txt were always same.

Regarding removing the lines 543-546 from the kernel file, I can see following code segment. Did you want to point these lines?

const int cy = y - convert_int(self.height / 2);

const int cx = x - convert_int(self.width / 2);

float3 value = (float3)(0.0f);

if (self.light.y >= 0.0f) {

0 Likes

Sorry, it seems that I grabbed the wrong version of the kernel.cl file on my side. I meant to remove line 539-542, the following code segment:

if (x != 141 || y != 111) {
   out[(y * self.width) + x] = (uchar4)(0, 0, 0, 0);
   return;

    }

This will make the debug.txt quite useless, but will render the whole scene in debug.png. This should make it possible to see if there are visible differences to the reference image https://imagebin.ca/v/3a31XSEkxrik​. The code above selects a single pixel of the scene for deeper analysis. I can provide the reference values from debug.txt for a pixel in question, if you detect any differences.

Also, for me the following invocations produced different results (even thought quite marginal):

---------------------------------------------------------------------------------------------------------------------------------

set BOOST_COMPUTE_DEFAULT_DEVICE_TYPE=CPU

test.exe

---------------------------------------------------------------------------------------------------------------------------------

set BOOST_COMPUTE_DEFAULT_DEVICE_TYPE=GPU

test.exe

0 Likes

Thanks. Actually, I already tried this command-line option. Whenever the device type was set to CPU, the debug.txt file was empty, though didn't get any error on the command-line.

Regards,

0 Likes

Strange. It worked just fine for me. I also get possible build errors from the OpenCL driver printed out on the command-line.

Was a debug.png and kernel.bin generated? I suspect some kind of memory access violation if that is not the case.

However, that would depend on the OpenCL driver then.

Nevertheless, you were able to reproduce the issue seen on my side with the latest driver on a HD7870, right?

Is there anything I can do to help locating the cause of this issue? What would be the next steps to get this fixed?

0 Likes

From my tests, it seems that zero values are getting generated on SI card (HD 7870), not on Hawaii or Carrizo. I couldn't run it on cpu to verify the results (I can see non empty debug.png and kernel.bin files but empty debug.txt file).

The compiler tool chain used for those SI asics is almost stagnant and fixes only come for critical bugs like segfault, build error etc. I don't know about your issue, however I can file a bug report if required. To file a bug report, it would be helpful if you remove the unrelated code and provide a minimal test-case that manifests the above observation. Thanks.

0 Likes

Sorry for the late reply. I tried to shorten the example, but the result is still quite large. Currently, the issue still occurs in this version: https://filebin.ca/3bOosRN2sKO0. When I replace the expression in line 506 against const bool hit = false; for example, the issue is gone. Therefore, further simplification is not really possible for me.

Maybe there is some problem with the inline depth? Or the register allocation? I cannot tell. So please file a bug report. I hope there will be a patch...

As for the CPU run it seems that the kernel was not executed. Sadly, I cannot tell you why, because there seems to be no error from the OpenCL API.

0 Likes

Thanks for pointing to the suspected code segment. I'll file a bug report and mention about it there.

Regards,

0 Likes

Ok, thank you a lot for your time. How will I know about any progress there?

0 Likes

Once there is any update, I'll get a notification from the bug tracking system and I'll share it with you. Currently the system works internally only.

Regards,

0 Likes