Hello,
I’ve got the working OpenCL kernel that calculates SHA-256 hash of VERY long string and it takes too much time. I decided to split this kernel into several parts and save the intermediate results in global buffer. This means the same kernel is called several times and, if the calculation is not completed, the previous intermediate context is loaded from this buffer, more hash calculation is performed and new intermediate result is saved again.
Unfortunately, the new split kernel is not working – the global buffer with intermediate result has always zeros. But,
The minimal sample is attached. If there is the right way to report such a bug, please let me know – I couldn’t find it. Thanks.
Solved! Go to Solution.
I could reproduce it.
Looks like the compiler was optimizing out the code.
1. Making ‘HashRounds’ variable volatile solves the issue temporarily -- file zerobuf-gcn.cl, line 206
2. Alternatively disabling optimization by passing "-cl-opt-disable" to clbuildProgram() also solves the issue
and you are using the latest driver?
Yes, I've tried any versions from 12.3 to 12.11
What do you mean that it doesnt work? Does it crash? or you get wrong results?
It prints all zeros instead of real results. If you compile the sample, you will see either zeros (incorrect) or non-zero result (correct). If you use printf, the result is always correct.
I had a quick look at your code. If you use CL_MEM_ALLOC_HOST_PTR with clCreateBuffer, shouldnt you be using map/unmap? Also this memory object will be on host memory and not on device memory according to the AMD documentation (its perhaps alright unless if this is not what you want). Did you try making cglobal a pointer then use map/unmap to access it? This way you can avoid double allocating the memory and unnecessary copy operations in between.
Also if the device is doing async operations, the buffer could be read before kernel finishes execution, try a cl_wait after kernel enqueue. There is an example for this in the OpenCL PDF (see page 1-20)
The printf might be adding enough time for the kernel to complete and correct values to be read.
Indeed, I can reproduce this problem on 79xx (GCN Architecture) on Win7 & Catalyst 12.11.
I downloaded gcn-error.zip from the link pswwsp provided. The code compiles and runs well without any modification. It prints out non-zero "ctx", which is correct according to pswwsp.
After either of the two "printf"s (Line 200 and Line 220 in zerobuf-gcn.cl) is uncommented, the output shows that the values of "ctx" are all zeros, which are incorrect.
Thanks for reproducing this. But you probably mistyped - if printf is uncommented, the result should be correct (non-zero ctx).
Thanks for looking at my code. As for CL_MEM_ALLOC_HOST_PTR, it's really an incorrect flag, because the buffer should be placed on the device. Unfortunately, removing this flag doesn't help.
As for clWaitForEvents (did you mean this?) I guess it's not needed because the reading buffer via clEnqueueReadBuffer is blocking. Anyway, I've tried to insert clWaitForEvents before the buffer read, and it's doesn't fix the bug..
I've attached the fixed code, and it works exactly the same as previous. That's why I'm thinking it's not my bug, but the bug of OpenCL compiler or run-time.
Removing CL_MEM_ALLOC_HOST_PTR or inserting clWaitForEvents doesn't help.
--Yes, this is what I've seen.
I narrowed the problem a little bit. Please replace the zerobuf-gcn.cl inside the gcn-error.zip with the attached zerobuf-gcn.cl. You will find that it doesn't matter if we comment the "printf" inside the kernel or not with the modified zerobuf-gcn.cl.
I would say, sth. inside the kernel triggered the problem. It may or may not be a bug...
On my GPU (CapeVerde, 7770) your code works as expected in both cases (with or without printf). The buffer is not zero and contains the init values. (There is a small typo in line 179: [tid*8+0]).
AFAIK you need to wait for kernel execution to finish with clWaitForEvents or clFinish.
You have:
#define THREADS_PER_BLOCK 128
#define MAX_BLOCKS 512
in your .cpp file:
int cglobal [MAX_BLOCKS * THREADS_PER_BLOCK * 8];
Use calloc for cglobal... (I dont trust this allocation )
int grid = 32; |
globalWorkSize = THREADS_PER_BLOCK * grid; |
In your kernel:
#define SHA_LONG unsigned int
__global SHA_LONG *c_global, |
First of all you allocate int and then use uint (I guess doesnt matter in this case but...)
Then you access it inside your kernel with tid * 8 + [0-7] which will be a maximum of 32776 while you allocated millions of int s. Sounds unnecessarily high? (unless I calculated something wrong?)
Of course probably none of these are the cause of the problem...
I guess I should compile your code and test it, but I am a bit busy, but I will try tomorrow.
Thanks for trying find a bug in my code. As for clWaitForEvents, this not helps (see above).
As for MAX_BLOCKS, the code is designed to use variable blocks amount up to 512. Unfortunately, the problem is somewhere else, probably in OpenCL compiler/runtime.
OK lets recap, I ran your code and I should see zeros? (the latest one?)
~/temp/test$ ./zerobuf
1 platforms detected
Platform 0:
Vendor: Advanced Micro Devices, Inc.
Name: AMD Accelerated Parallel Processing
1 devices detected
Device 0:
Device: Advanced Micro Devices, Inc.
Name: Tahiti
Max threads: 256
Max cores: 32
Max Threads (by kernel): 256
Multiply (by kernel): 64
Compiled threads (by kernel): 0
Device #0, Block size is: 32 x 128 (-m32), step = 2
ctx host = 2f0f1c 2f0f1c 30101d 2d0d1a
ctx host = 2f0f1c 2f0f1c 30101d 2d0d1a
If ctx host is all zeros, this is the bug!
NOTE. ah right I got zeros second time I got it, is that the problem?
Not tested under Linux (will try in a few hours). Under Windows I've got zeros in every launch. You've got zeros only once, right?
also, am I suppose to be getting random results at each run?
Not really... I am getting different results at each run
eyurtese@extremum-desktop:~/temp/test$ ./zerobuf |grep 'ctx host '
ctx host = 0000 0000 0000 01fe
ctx host = 0000 0000 0000 01fe
eyurtese@extremum-desktop:~/temp/test$ ./zerobuf |grep 'ctx host '
ctx host = e5382000 ffff8803 0000 01fe
ctx host = e5382000 ffff8803 0000 01fe
eyurtese@extremum-desktop:~/temp/test$
status = clEnqueueReadBuffer(cmdQueue, d_cglobal, CL_TRUE, 0, | |
sizeof (int) * MAX_BLOCKS * THREADS_PER_BLOCK * 8, &cglobal, | |
0, NULL, NULL); |
You are reading the address of the pointer, not where it points... shouldnt this be cglobal instead of &cglobal? just saying. I think there are other problems in your code too, I wouldnt blame the sdk just yet
Maybe it's a small bug in the code, because this clEnqueueReadBuffer call is inserted just for debugging and printing purposes. As I said, I didn't testit under Linux/gcc. Please remove the '&' before cglobal.
Well, I give up. Maybe you are right, I tried on CPU (bulldozer) and it crashed, yet your program seems to work on intel cpu with AMD OpenCL Also with Intel OCL SDK on Intel CPU (did not test on bulldozer with this) and on some Tesla cards.
I tried to figure it out by not printing but putting values into output array
c_global[0] = HashC_end
c_global[1] = HashRounds
Funnily although these looked different, it didnt seem to enter
if (HashC_end != HashRounds) { |
(where I had assigned something to c_global[2] )
You know, you can try to run it with CodeXL and see what it does (at least I hear it is suppose to be able to debug OpenCL code line by line)
Also perhaps you should use atomic_add or atomic_inc (according to khronos, these are 32bit versions).
Sorry that I couldnt be more help maybe you are right, it might be a bug perhaps. It is strange that it gets worse on CPU
You should definitely report this to AMD...
Thanks for you help. I've just tested on Linux (Ubuntu 12.04, 64-bit) and the code works!
So, it's not working only:
1) on AMD GCN
2) on Windows
3) if printf is not used
How to report this bug to AMD? I can't find the right URL.
Actually, it doesnt work for me (or maybe it was the & in front of cglobal, I am not sure if I ran it without it), I have exactly same setup as you have on Linux;. I also tested on Cypress and it appears to be running on Cypress.
You can start trying from:
Unfortunately, 13.1 drivers have the same bug. I sent the bug-report using http://www.amdsurveys.com site, but seems this was not successfully. Anyone could help to submit this bug to AMD team?
Will check this out and raise the issue with the engineering team, if needed (or if its not already being tracked)
I could reproduce it.
Looks like the compiler was optimizing out the code.
1. Making ‘HashRounds’ variable volatile solves the issue temporarily -- file zerobuf-gcn.cl, line 206
2. Alternatively disabling optimization by passing "-cl-opt-disable" to clbuildProgram() also solves the issue
Thanks a lot, I'll make HashRounds volatile as temporarily solution.
Seems to be fixed in 13.02 beta. Thanks all for help!
Great to know and Thanks for coming back on this.
Bugs and driver releases have their own life-cycle. So, fixes may not appear immediately....but eventually they do.
Thanks for your time,