cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

nou
Exemplar

incorect result on GPU

i have problem that on CPU it gives right results. but if run my code on GPU it return random results.

i pinpoint that problem is in this function. event_vector array did not contain proper values.

you can find whole code here http://orwell.fiit.stuba.sk/git?p=cellula.git;a=blob;f=kernels/asyn.cl;h=95f69b0872a356b035bc6200851cd7da1bdc3713;hb=da36c88bdc666c12c8fe5131a6c59ca8e48eae30

#define NEIGHBOR 8 #define QUEUE_LEN 8 #define COORD(c) (c.y*get_global_size(0) + c.x) typedef float _time; typedef float _state; typedef struct { _time x; _state y; }_event; typedef struct { int start,end; int start2,end2; _event state; _event queue[QUEUE_LEN]; }_cell; void create_event_vector(__global _cell *cells, _event *event_vector, int2 *coord, int *index) { for(int i=0;i<NEIGHBOR;i++) event_vector = cells[COORD(coord)].queue[index]; }

0 Likes
18 Replies
jhabig
Journeyman III

incorect result on GPU

Hi nou!

I just saw your topic by did't have the time to go trough the code. I did have the same behaviour and it all came down to the following.

My Buffer did have the flag CL_MEM_USE_HOST_PTR. For the CPU it turned out that the CL kernl just used that bit of memory my HostBuffer was defined on. When using the GPU the HostBuffer and the kernel memory are not the same,  and I had to query clEnqueueReadBuffer to get the data. That command need's to be called in the blocking way or you need to wait for an event befor using the results.

I hope that helps, but you have possibly already thought of this. As I stated above I haven't had the time to look into your code, yet.

0 Likes
nou
Exemplar

incorect result on GPU

no i do not use any CL_MEM_*_HOST_PTR. i tried move that array into __local. interesting is that now it return another random results. i think this bug in AMD OpenCL.

0 Likes
gaurav_garg
Adept I

incorect result on GPU

What is your group and NDrange size? It seems multiple threads are writing to the same place in event_vector array.

0 Likes
nou
Exemplar

incorect result on GPU

but event_vector is private address space. soany overwrite shold not be possible.

global and local work size is both 16x16.

0 Likes
gaurav_garg
Adept I

incorect result on GPU

How do you allocate event_vector?

0 Likes
nou
Exemplar

incorect result on GPU

i have in kernel

_event event_vectot[NEIGHBOR];

then i call create_event_vector(); i tried even using local space for this array but withou suces.

0 Likes
MicahVillmow
Staff
Staff

incorect result on GPU

Nou,
There are issues with structures in private/local memory in SDK 2.01. This should be fixed in the upcoming release. However, as an optimization, the approach you are taking for passing arguments to a kernel is not optimal.
This:
__kernel void takeStep(__global _cell *cells, _time start_time, _time time_window, __global float16 *debug)
Will perform a lot better if you do this:
__kernel void takeStep(__global _cell *cells, constant _time* start_time, constant _time* time_window, __global float16 *debug)

The reason being that private memory is mapped to the hardware private memory when it is dynamically indexed or a structure is passed by value to a kernel. The hardware private memory is stored in on-device ram and is uncached. If you turn the pass by value into pass by pointer of constant address space, the structures are then loaded into the constant cache, which at peak rate is about 3/5th's the speed of register access.

There also is a feature being introduced in our upcoming release that will allow more efficient access into the constant buffer with the attribute max_constant_size(#Bytes).

So this kernel would be something like:
__kernel void takeStep(__global _cell *cells, __constant _time* start_time __attribute__(max_constant_size(sizeof(_time)),
__constant _time* time_window__attribute__(max_constant_size(sizeof(_time)),
__global float16 *debug)

What this will do is pack both _start_time and time_window into the same constant cache for better resource management as only 2 constant caches can be used at a time in a single ALU CF clause.

I hope this helps. If there is still an issue with this kernel after the upcoming release, please let me know.
0 Likes
nou
Exemplar

incorect result on GPU

thanks for reply Micah. as you suggest i try replace my _event to float2 (in future i need to chose types dynamic in this structures like int,float or int,float,int etc.) but as i replace it now it freeze whole system. i remove that while(1) loop just for sure but it do not work. on CPU it is still working normaly.

so i realy looking forward to next release.

0 Likes
MicahVillmow
Staff
Staff

incorect result on GPU

nou,
A workaround for this until the next release is to use the __attribute__((max_reqd_workgroup_size(64, 1, 1))) to force smaller work-groups at compile time. The freeze is most likely caused by spilling register to scratch memory and the loop counter getting corrupted.
0 Likes