I will post the details later. Could any AMD staff give me an email address so I can send you the files to reproduce the bug. A exe file (running on Windows 8/10 x86_64, no dll required) and an opencl source file are enough (since I don't wanna send host side code)?
The correct results should be six numbers (returned by queue.enqueueReadBuffer). Hawaii returns zero results when printf (in opencl source) removed. If run again, it returns correct results. If the printf was reserved, both running return correct results. Spectre integrated GPU (A10-7870K) returns correct results whatever whether printf is removed.
printf was removed: (comment out one line in opencl source: if (env->flags) printf("found: GID=%ld\n", GID);)
Hawaii:
build time: 10,372ms.
run time: 187,969 microseconds.
Found 0 results:
run again: 5,587 microseconds.
Found 6 results: 6,590 4,830 6,625 4,785 1,931 1,846
Hawaii:
build time: 10,366ms.
run time: 189,547 microseconds.
Found 0 results:
run again: 5,499 microseconds.
Found 6 results: 6,590 6,625 4,785 4,830 1,931 1,846
Spectre:
build time: 10,356ms.
run time: 202,173 microseconds.
Found 6 results: 1,846 1,931 4,830 4,785 6,590 6,625
run again: 24,345 microseconds.
Found 6 results: 1,846 1,931 4,785 4,830 6,590 6,625
printf was reserved: (env->flags differs in first run and second run, so, only printf once for each GPU)
Hawaii:
build time: 9,831ms.
found: GID=1846
found: GID=1931
found: GID=4785
found: GID=4830
found: GID=6590
found: GID=6625
run time: 807,821 microseconds.
Found 6 results: 1,931 1,846 4,830 4,785 6,590 6,625
run again: 594,972 microseconds.
Found 6 results: 1,846 1,931 4,830 4,785 6,590 6,625
Hawaii:
build time: 10,906ms.
found: GID=1846
found: GID=1931
found: GID=4785
found: GID=4830
found: GID=6590
found: GID=6625
run time: 860,503 microseconds.
Found 6 results: 1,931 1,846 4,830 4,785 6,590 6,625
run again: 634,349 microseconds.
Found 6 results: 1,931 1,846 4,785 4,830 6,590 6,625
Spectre:
build time: 10,818ms.
found: GID=1846
found: GID=1931
found: GID=4785
found: GID=4830
found: GID=6590
found: GID=6625
run time: 759,770 microseconds.
Found 6 results: 1,931 1,846 4,830 4,785 6,590 6,625
run again: 570,127 microseconds.
Found 6 results: 1,931 1,846 4,830 4,785 6,590 6,625
Solved! Go to Solution.
- if (GID == 0) {
- env->null = env->self - env->delta;
- //env->null = env->self;
- dest[0] = 0;
- }
- barrier(CLK_GLOBAL_MEM_FENCE);
- if (GID == 1846) {
- ++dest[0];
- }
barrier() function can be only applicable to work-items within a work-group, not among work-items belonging from different work-groups.
In your case, work-item 0 (i.e GID == 0) and work-item 1846 (i.e GID == 1846) are from two different work-groups, so, there is no effect of barrier statement between these two work-items. However, both the work-items are trying to update the same memory value i.e. dest[0], which is totally undefined and the final value depends on the order of their execution. It also depends on the capability of particular GPU. Say, you may get the expected result from a low-end GPU which can run only 4 WGs at a time. Whereas a more powerful gpu, which can run 20 WGs at a time, may produce completely unpredictable result.
So, in order to produce expected result, both the work-items under the same barrier statement should belong to same work-group.
Regards,
I've sent you a private message and shared my email address. Kindly check your message box.
Regards,
Hi dipak,
Fortunately I find the root cause and reproduce it in a simple VS2013 project. I attched it.
Note:
1, Comment out the line env->null = env->self - env->delta; and enable the line env->null = env->self; the two results are different for Hawaii/Spectre, but both are wrong for Hawaii/Spectre/Cypress. The correct behaviour is both lines should not change the results (value of dest[0]);
2, The wrong results are different for each device, even between two Hawaii devices (R9 295X2), and unstable;
3, Change GID == 1846 to be GID == 0 or comment out both env->null lines, the results are correct now (dest[0]=1);
4, The OpenCL driver version is lastest 1800.5. I compiled in x86_64 and opencl 1.2.
Please run and explain the results. Thank you, dipak~
I found a way to work around the issue.
Change the line if (GID == 0) to be if (get_local_id(0) == 0) . In this way Hawaii always returns correct results in actural code. However this time Spectre may return wrong results in low probability (see attached picture). In the simple project above, Spectre returns wrong results in far smaller probability but wrong results still can be observed.
Since get_global_id(0) was changed to be get_local_id(0), there is no globally synchronization between work items, I have to change the code which allocates memory only once for all items to be allocating memory {work_item_number} times for each item itself. That means atomicity operations change from once to {work_item_number} times: A performace issue is waiting there: Why Hawaii/Spectre (R9 290X/A10-7870K R7) slower five ~ ten times than Intel/NVidia on atomic adding...
Hi, dipak Have you tried the project?
Sorry, I couldn't check it last week as I was busy with some other stuff. I'll check it this week.
Regards,
- if (GID == 0) {
- env->null = env->self - env->delta;
- //env->null = env->self;
- dest[0] = 0;
- }
- barrier(CLK_GLOBAL_MEM_FENCE);
- if (GID == 1846) {
- ++dest[0];
- }
barrier() function can be only applicable to work-items within a work-group, not among work-items belonging from different work-groups.
In your case, work-item 0 (i.e GID == 0) and work-item 1846 (i.e GID == 1846) are from two different work-groups, so, there is no effect of barrier statement between these two work-items. However, both the work-items are trying to update the same memory value i.e. dest[0], which is totally undefined and the final value depends on the order of their execution. It also depends on the capability of particular GPU. Say, you may get the expected result from a low-end GPU which can run only 4 WGs at a time. Whereas a more powerful gpu, which can run 20 WGs at a time, may produce completely unpredictable result.
So, in order to produce expected result, both the work-items under the same barrier statement should belong to same work-group.
Regards,
Thank you, @dipak. It seems I misunderstand the meanings of CLK_GLOBAL_MEM_FENCE. OpenCL cannot synchronize work items beyond work group, right?
Yeah, most of the synchronize operations are limited to work-group only. It makes OpenCL program flexible and scalable. Though there are few atomic functions which work at global level i.e. across multiple work groups.