I do not have any idea about the ranlux algorithm, so maybe someone else can guide you better. Anyhow, this error of Invalid work group size as per opencl spec is :
CL_INVALID_WORK_GROUP_SIZE if local_work_size is specified and number of workitems
specified by global_work_size is not evenly divisible by size of work-group given
by local_work_size or does not match the work-group size specified for kernel using the
__attribute__((reqd_work_group_size(X, Y, Z))) qualifier in program
It maybe helpful .
Thank you so much for your comment. I will try to find help from others and if I can make it, I will post it over here.
I ran the code on NVIDIA GPU machine and it works okay. However, ATI machine still makes errors. As you have indicated, it might be driver/SDK machine. I heard that NVIDIA GPU is more complex than ATI machine. But in case of Mac, it makes more errors with ATI machine. I hope this problem will be solved in the future time. I really appreciate your comment and help.
Are you trying it on MAC Book. The OpenCL implementation for MAc Books is being written and maintained by apple.
Originally posted by: joohongyee Hi Himanshu,
I ran the code on NVIDIA GPU machine and it works okay. However, ATI machine still makes errors. As you have indicated, it might be driver/SDK machine. I heard that NVIDIA GPU is more complex than ATI machine. But in case of Mac, it makes more errors with ATI machine. I hope this problem will be solved in the future time. I really appreciate your comment and help. Joohong
Thank you for your interest and comments joohongyee. I've found out that with the latest SDK (AMD APP SDK 2.3) I get wrong (but still seemingly pseudo-random) sequences on my HD 5870. My CPU still works correctly though. Using the older ATI Stream SDK 2.0.1 with Catalyst 10.1 my GPU is producing correct results, and since you got it to work on an Nvidia machine it sounds like there's a problem with the more recent AMD implementations only.
I'll have to do some more tests to figure out with which SDK/driver the problems started. But it seems like a very difficult bug to find since the algorithm seems to run fine for a while, and then it suddenly diverges from the known good numbers, but it continues generating seemingly pseudorandom numbers.
Did you found what is causing this issue? Is it a Precision issue or some implementation bug?
Is the issue reproducible from the code you posted in the first post of this thread?
What I've found out is that with my HD 5870:
SDK 2.01 with Catalyst 10.1 works
SDK 2.1 with Catalyst 10.4 works
SDK 2.2 with Catalyst 10.8 works
SDK 2.3 with Catalyst 11.1 produces wrong numbers after a while.
I've also tried it on a Nvidia T10 that I have access to at the university, and it too generates the correct numbers.
The problem does show up in the program I posted. For example running it as "prngtest.exe 0 1 1" or "prngtest 4 1 1" will show that the last numbers checked are incorrect. Running on the CPU with "prngtest.exe 0 0 1" works as expected. It can also be seen by printing or writing to file the PRNs array in the program, where the CPU (at least on my computer) is producing correct results, and seeing that after a while the GPU will start generating different values. I've checked this against 10000 values from the Fortran implementation and my OpenCL implementation generates the sequence correctly on the CPU, and on the GPU except with the newest SDK.
My guess would be that there is a (small) calculation error at some point which then causes the generated sequence to diverge. Even an error in the least significant bit would do it I think. As I understand it this would be a bug in the SDK since I'm only using addition and multiplication, which should be correctly rounded according to the specification.
Since the error seems to happen at the same place each time, I'll try to see if I can isolate the exact operation that's failing.