The second loop should not work. You are correct.
How many workitems are you using in your workgroup? Is it 64? or something greater than that?
I hope "Minvals" and "Maxvals" are in Local Memory. Please confirm.
- for(int offset = local_size/2;offset>0;offset/=2)
- Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);
- Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);
This code looks innocent. This should actually work. I dont see any possibility of a race.
Please post the following details for reproducing here
Please post a copy of your code (as zip file) so that we can reproduce here.
Please include the following details as well.
1. Platform - win32 / win64 / lin32 / lin64 or some other?
Win7 or win vista or Win8.. Similarly for linux, your distribution
2. Version of driver
3. CPU or GPU Target?
4. CPU/GPU details of your hardware
I can give you the kernel, but for the remaining source code it is complicated. I need to use ITK Toolkit(http://www.itk.org/), which encapsulates many parts of OpenCL-API in its own functions and methods. Creating an test project is very complicated, because I rely on the ITK functionalities.
I need to parallelise and existing segmentation algorithm(FSL-Fast) with Opencl. One part of that is to check the min and max element of a voxel field.
Here are some parts of the Code:
According to image size and work-group size the globalSize of the kernel is generated. Then a #define statement is added in front of the kernel, to make work-group size dynamically and controllable(not letting OpenCl determine right work-group size). After that the kernel is build and all parameters are set. Then the Kernel is executed. For example, one of my voxel images is 160, 256, 173 in each direction. With a work-group size of 4,4,4, i get a globalsize of 160, 256, 176. The voxel data is saved in a 1D-array. Therefore it is needed to calculate the index(gid_x + gid_y*sizex + gid_z*sizex*sizez).
I have three machines.
Dell Vostro 3450
Intel Core I5 2410m
2. Desktop PC:
AMD Phenom II 945 x4
3. Machine at university:
Intel Xeon X5675
AMD Firepro V5900
Nvidia Tesla C2075
One the Notebook very simple kernels only work. For more information: http://devgurus.amd.com/thread/160282 . It looks like debugging won't enter any If oder For-Statements. I am using modded drivers(Unifl Leshcat), because Mobility Drivers don't work(CCC is not starting). The original Dell drivers are from 2011 and still have the ATI logo integrated(is not recognized as AMD graphics card).
That kernel(the second one, which should not work...) runs correctly on that Tesla-machine, but only if workgroup size is 64 or below. The same kernel crashes on my GTS450. Because I find debugging extremely important and Nvidias approach on OpenCL is very complicated, an AMD GCN(HD 7750) is on the way. AMDs CodeXL still looks very promising.
I hope this is enough information. I just need to know, if my kernel is workable.
Request you to post your kernel (at least the kernel source) as a ZIP here. Just explain me the args. I can do a sample repro case and take this forward.
PS: At the moment, the URL you have provided is blocked by my company's websense. Also, a ZIP will be a permanent record.
WORKGROUPSIZE is defined to 64. Is this what you use?
THats actually the wavefront size. That probably (not sure) explains why the second code-snippet actually does not hang.
If you incease this to 128 or so, the second one should hang at the barrier.
Anyway, Thanks for posting. Will try it out
You mean, that my workgroup size is the same as the wavefront size? This definitely true for AMD Hardware. For Nvidia it should be 32. I wanted to make some performance tests according the size of local memory and arrays, in order to find out, which implementation is best.
€: I've done some additional testing. The kernel, which should work, works on the Tesla system, but neither on my Notebook(propably driver issues), nor on my desktop(GTS450, ???). I will do some additional testing, after I get my GCN. Then I will report back.
for(int i =0; i < group_count; i=i+2)
mint = fmin(mint,output[i]);
maxt = fmax(maxt,output[i+1]);
//write to output -> end
output = mint;
output = maxt;
I fear this snippet of code assumes "global synchronization" of all workgroups -- which is "not" possible in OpenCL.
This will not work - will not give correct results.
Yeah, you are right. Which means, I need a other method to perform reduction on global level. This is the reason, why that kernel is not working. Thanks a lot.
Is it possible to perform any other global reduction with more than one workgroup? It looks like, that using only one work group is the key to my problem.
So that the Tesla machine returned the right values, was just a coincidence. I will rewrite the code.
Thanks a lot, again.Now I know, what I shouldn't do in the future.
This kernel does not compile as "group_id" is not declared at all.
May be, you mean "get_group_id"?? But even this group_id has to be mapped from 3D to 1D.
Can you provide the code for that? (meanwhile, I will fill up my own logic there... but I just want to be sure. Thats why i am asking you)
The logic I am using is - Please Confirm this.
int group_idx = get_group_id(0);
int group_idy = get_group_id(1);
int group_idz = get_group_id(2);
int group_id = group_idz * get_num_groups(0) * get_num_groups(1) +
group_idy * get_num_groups(0) +
I am able to run your code without any problems on Tahiti 7970 on Linux 64 box.
The only problem is the bug(?) in your code that assumes global synchronization of workgroups.
As long as I am spawning only 1 workgroup - the code is returning correct result.
If I have more than 1, the code does not give correct output - This is as expected.
Also, I hope you are using the correct drivers for your notebook.
Please download and install the drivers from AMD directly - http://support.amd.com/us/gpudownload/Pages/index.aspx
Mhh, what do you mean by "spawning only one workgroup"? You mean I spawn only one work group, which has the size of the array(therefore there is only one)? If I do this, my device runs out of memory.
It looks like I need a complete different approach, or I let the cpu do the final reduction.
1 of 1 people found this helpful
No... What i meant was this:
I wrote some host code that will allocate memory , build program, set arguments and call your kernel.
In this host code, if I spawn only 1 workgroup, your code works fine.
The code-snippet that you complained about works fine on 7970 card here.
However, if I spawn multiple workgroups (i.e. increase the image size), the code does not return correct output.
This is because of the bug in the code that assumes global synchronization
So, as you rightly said, you can allow the CPU to do the final reduction.
(or) Spawn one another kernel with only 1 workgroup to do the final reduction alone
Thanks again. I think, all problems were cleared up.
Glad to know! Good luck!
Yeah, this is right!