Archives Discussions

pswwsp · ‎11-15-2012

Hello,

I’ve got the working OpenCL kernel that calculates SHA-256 hash of VERY long string and it takes too much time. I decided to split this kernel into several parts and save the intermediate results in global buffer. This means the same kernel is called several times and, if the calculation is not completed, the previous intermediate context is loaded from this buffer, more hash calculation is performed and new intermediate result is saved again.

Unfortunately, the new split kernel is not working – the global buffer with intermediate result has always zeros. But,

It’s not working on GCN architecture only (I’ve tested on Capeverde). I've tested Catalyst from 12.3 up to 12.11. It works fine on VLIW5 and NVIDIA GPU.
If I try to print using printf() the intermediate buffer, the code works ok.
If I comment some lines in the code , the buffer is also not zero.

The minimal sample is attached. If there is the right way to report such a bug, please let me know – I couldn’t find it. Thanks.

himanshu_gautam · ‎01-21-2013

I could reproduce it.

Looks like the compiler was optimizing out the code.

1. Making ‘HashRounds’ variable volatile solves the issue temporarily -- file zerobuf-gcn.cl, line 206

2. Alternatively disabling optimization by passing "-cl-opt-disable" to clbuildProgram() also solves the issue

View solution in original post

binying · ‎11-15-2012

and you are using the latest driver?

pswwsp · ‎11-15-2012

Yes, I've tried any versions from 12.3 to 12.11

yurtesen · ‎11-16-2012

What do you mean that it doesnt work? Does it crash? or you get wrong results?

pswwsp · ‎11-16-2012

It prints all zeros instead of real results. If you compile the sample, you will see either zeros (incorrect) or non-zero result (correct). If you use printf, the result is always correct.

yurtesen · ‎11-17-2012

I had a quick look at your code. If you use CL_MEM_ALLOC_HOST_PTR with clCreateBuffer, shouldnt you be using map/unmap? Also this memory object will be on host memory and not on device memory according to the AMD documentation (its perhaps alright unless if this is not what you want). Did you try making cglobal a pointer then use map/unmap to access it? This way you can avoid double allocating the memory and unnecessary copy operations in between.

Also if the device is doing async operations, the buffer could be read before kernel finishes execution, try a cl_wait after kernel enqueue. There is an example for this in the OpenCL PDF (see page 1-20)

http://developer.amd.com.php53-23.ord1-1.websitetestlink.com/wordpress/media/2012/10/AMD_Accelerated...

The printf might be adding enough time for the kernel to complete and correct values to be read.

binying · ‎11-19-2012

Indeed, I can reproduce this problem on 79xx (GCN Architecture) on Win7 & Catalyst 12.11.

I downloaded gcn-error.zip from the link pswwsp provided. The code compiles and runs well without any modification. It prints out non-zero "ctx", which is correct according to pswwsp.

After either of the two "printf"s (Line 200 and Line 220 in zerobuf-gcn.cl) is uncommented, the output shows that the values of "ctx" are all zeros, which are incorrect.

pswwsp · ‎11-19-2012

Thanks for reproducing this. But you probably mistyped - if printf is uncommented, the result should be correct (non-zero ctx).

pswwsp · ‎11-19-2012

Thanks for looking at my code. As for CL_MEM_ALLOC_HOST_PTR, it's really an incorrect flag, because the buffer should be placed on the device. Unfortunately, removing this flag doesn't help.

As for clWaitForEvents (did you mean this?) I guess it's not needed because the reading buffer via clEnqueueReadBuffer is blocking. Anyway, I've tried to insert clWaitForEvents before the buffer read, and it's doesn't fix the bug..

I've attached the fixed code, and it works exactly the same as previous. That's why I'm thinking it's not my bug, but the bug of OpenCL compiler or run-time.

binying · ‎11-19-2012

Removing CL_MEM_ALLOC_HOST_PTR or inserting clWaitForEvents doesn't help.

--Yes, this is what I've seen.

binying · ‎11-19-2012

I narrowed the problem a little bit. Please replace the zerobuf-gcn.cl inside the gcn-error.zip with the attached zerobuf-gcn.cl. You will find that it doesn't matter if we comment the "printf" inside the kernel or not with the modified zerobuf-gcn.cl.

I would say, sth. inside the kernel triggered the problem. It may or may not be a bug...

pswwsp · ‎11-20-2012

On my GPU (CapeVerde, 7770) your code works as expected in both cases (with or without printf). The buffer is not zero and contains the init values. (There is a small typo in line 179: [tid*8+0]).

yurtesen · ‎11-20-2012

AFAIK you need to wait for kernel execution to finish with clWaitForEvents or clFinish.

You have:

#define THREADS_PER_BLOCK 128

#define MAX_BLOCKS 512

in your .cpp file:

int cglobal [MAX_BLOCKS * THREADS_PER_BLOCK * 8];

Use calloc for cglobal... (I dont trust this allocation )

int grid = 32;

globalWorkSize = THREADS_PER_BLOCK * grid;

In your kernel:

#define SHA_LONG unsigned int

__global SHA_LONG *c_global,

First of all you allocate int and then use uint (I guess doesnt matter in this case but...)

Then you access it inside your kernel with tid * 8 + [0-7] which will be a maximum of 32776 while you allocated millions of int s. Sounds unnecessarily high? (unless I calculated something wrong?)

Of course probably none of these are the cause of the problem...

I guess I should compile your code and test it, but I am a bit busy, but I will try tomorrow.

pswwsp · ‎11-20-2012

Thanks for trying find a bug in my code. As for clWaitForEvents, this not helps (see above).

As for MAX_BLOCKS, the code is designed to use variable blocks amount up to 512. Unfortunately, the problem is somewhere else, probably in OpenCL compiler/runtime.

yurtesen · ‎11-20-2012

OK lets recap, I ran your code and I should see zeros? (the latest one?)

~/temp/test$ ./zerobuf
1 platforms detected
Platform 0:
        Vendor: Advanced Micro Devices, Inc.
        Name: AMD Accelerated Parallel Processing
1 devices detected
Device 0:
        Device: Advanced Micro Devices, Inc.
        Name: Tahiti
        Max threads: 256
        Max cores: 32
        Max Threads (by kernel): 256
        Multiply (by kernel): 64
        Compiled threads (by kernel): 0
Device #0, Block size is: 32 x 128 (-m32), step = 2
ctx host   = 2f0f1c 2f0f1c 30101d 2d0d1a
ctx host   = 2f0f1c 2f0f1c 30101d 2d0d1a
If ctx host is all zeros, this is the bug!

NOTE. ah right I got zeros second time I got it, is that the problem?

pswwsp · ‎11-20-2012

Not tested under Linux (will try in a few hours). Under Windows I've got zeros in every launch. You've got zeros only once, right?

yurtesen · ‎11-20-2012

also, am I suppose to be getting random results at each run?

yurtesen · ‎11-20-2012

Not really... I am getting different results at each run

eyurtese@extremum-desktop:~/temp/test$ ./zerobuf |grep 'ctx host '
ctx host   = 0000 0000 0000 01fe
ctx host   = 0000 0000 0000 01fe
eyurtese@extremum-desktop:~/temp/test$ ./zerobuf |grep 'ctx host '
ctx host   = e5382000 ffff8803 0000 01fe
ctx host   = e5382000 ffff8803 0000 01fe
eyurtese@extremum-desktop:~/temp/test$

yurtesen · ‎11-20-2012

	status = clEnqueueReadBuffer(cmdQueue, d_cglobal, CL_TRUE, 0,
	sizeof (int) * MAX_BLOCKS * THREADS_PER_BLOCK * 8, &cglobal,
	0, NULL, NULL);

You are reading the address of the pointer, not where it points... shouldnt this be cglobal instead of &cglobal? just saying. I think there are other problems in your code too, I wouldnt blame the sdk just yet

pswwsp · ‎11-20-2012

Maybe it's a small bug in the code, because this clEnqueueReadBuffer call is inserted just for debugging and printing purposes. As I said, I didn't testit under Linux/gcc. Please remove the '&' before cglobal.

yurtesen · ‎11-20-2012

Well, I give up. Maybe you are right, I tried on CPU (bulldozer) and it crashed, yet your program seems to work on intel cpu with AMD OpenCL Also with Intel OCL SDK on Intel CPU (did not test on bulldozer with this) and on some Tesla cards.

I tried to figure it out by not printing but putting values into output array

c_global[0] = HashC_end

c_global[1] = HashRounds

Funnily although these looked different, it didnt seem to enter

if (HashC_end != HashRounds) {

(where I had assigned something to c_global[2] )

You know, you can try to run it with CodeXL and see what it does (at least I hear it is suppose to be able to debug OpenCL code line by line)

Also perhaps you should use atomic_add or atomic_inc (according to khronos, these are 32bit versions).

Sorry that I couldnt be more help maybe you are right, it might be a bug perhaps. It is strange that it gets worse on CPU

You should definitely report this to AMD...

pswwsp · ‎11-21-2012

Thanks for you help. I've just tested on Linux (Ubuntu 12.04, 64-bit) and the code works!

So, it's not working only:

1) on AMD GCN

2) on Windows

3) if printf is not used

How to report this bug to AMD? I can't find the right URL.

yurtesen · ‎11-21-2012

Actually, it doesnt work for me (or maybe it was the & in front of cglobal, I am not sure if I ran it without it), I have exactly same setup as you have on Linux;. I also tested on Cypress and it appears to be running on Cypress.

You can start trying from:

http://developer.amd.com/support/

pswwsp · ‎01-20-2013

Unfortunately, 13.1 drivers have the same bug. I sent the bug-report using http://www.amdsurveys.com site, but seems this was not successfully. Anyone could help to submit this bug to AMD team?

himanshu_gautam · ‎01-20-2013

Will check this out and raise the issue with the engineering team, if needed (or if its not already being tracked)

himanshu_gautam · ‎01-21-2013

I could reproduce it.

Looks like the compiler was optimizing out the code.

1. Making ‘HashRounds’ variable volatile solves the issue temporarily -- file zerobuf-gcn.cl, line 206

2. Alternatively disabling optimization by passing "-cl-opt-disable" to clbuildProgram() also solves the issue

pswwsp · ‎01-21-2013

Thanks a lot, I'll make HashRounds volatile as temporarily solution.

pswwsp · ‎02-12-2013

Seems to be fixed in 13.02 beta. Thanks all for help!

himanshu_gautam · ‎02-12-2013

Great to know and Thanks for coming back on this.

Bugs and driver releases have their own life-cycle. So, fixes may not appear immediately....but eventually they do.

Thanks for your time,

Archives Discussions

Bug in OpenCL (GCN only)