Archives Discussions

nou · ‎04-23-2010

i have problem that on CPU it gives right results. but if run my code on GPU it return random results.

i pinpoint that problem is in this function. event_vector array did not contain proper values.

you can find whole code here http://orwell.fiit.stuba.sk/git?p=cellula.git;a=blob;f=kernels/asyn.cl;h=95f69b0872a356b035bc6200851cd7da1bdc3713;hb=da36c88bdc666c12c8fe5131a6c59ca8e48eae30

#define NEIGHBOR 8 #define QUEUE_LEN 8 #define COORD(c) (c.y*get_global_size(0) + c.x) typedef float _time; typedef float _state; typedef struct { _time x; _state y; }_event; typedef struct { int start,end; int start2,end2; _event state; _event queue[QUEUE_LEN]; }_cell; void create_event_vector(__global _cell *cells, _event *event_vector, int2 *coord, int *index) { for(int i=0;i<NEIGHBOR;i++) event_vector = cells[COORD(coord)].queue[index]; }

jhabig · ‎04-23-2010

Hi nou!

I just saw your topic by did't have the time to go trough the code. I did have the same behaviour and it all came down to the following.

My Buffer did have the flag CL_MEM_USE_HOST_PTR. For the CPU it turned out that the CL kernl just used that bit of memory my HostBuffer was defined on. When using the GPU the HostBuffer and the kernel memory are not the same, and I had to query clEnqueueReadBuffer to get the data. That command need's to be called in the blocking way or you need to wait for an event befor using the results.

I hope that helps, but you have possibly already thought of this. As I stated above I haven't had the time to look into your code, yet.

nou · ‎04-23-2010

no i do not use any CL_MEM_*_HOST_PTR. i tried move that array into __local. interesting is that now it return another random results. i think this bug in AMD OpenCL.

gaurav_garg · ‎04-26-2010

What is your group and NDrange size? It seems multiple threads are writing to the same place in event_vector array.

nou · ‎04-26-2010

but event_vector is private address space. soany overwrite shold not be possible.

global and local work size is both 16x16.

gaurav_garg · ‎04-26-2010

How do you allocate event_vector?

nou · ‎04-26-2010

i have in kernel

_event event_vectot[NEIGHBOR];

then i call create_event_vector(); i tried even using local space for this array but withou suces.

MicahVillmow · ‎04-26-2010

Nou,
There are issues with structures in private/local memory in SDK 2.01. This should be fixed in the upcoming release. However, as an optimization, the approach you are taking for passing arguments to a kernel is not optimal.
This:
__kernel void takeStep(__global _cell *cells, _time start_time, _time time_window, __global float16 *debug)
Will perform a lot better if you do this:
__kernel void takeStep(__global _cell *cells, constant _time* start_time, constant _time* time_window, __global float16 *debug)

The reason being that private memory is mapped to the hardware private memory when it is dynamically indexed or a structure is passed by value to a kernel. The hardware private memory is stored in on-device ram and is uncached. If you turn the pass by value into pass by pointer of constant address space, the structures are then loaded into the constant cache, which at peak rate is about 3/5th's the speed of register access.

There also is a feature being introduced in our upcoming release that will allow more efficient access into the constant buffer with the attribute max_constant_size(#Bytes).

So this kernel would be something like:
__kernel void takeStep(__global _cell *cells, __constant _time* start_time __attribute__(max_constant_size(sizeof(_time)),
__constant _time* time_window__attribute__(max_constant_size(sizeof(_time)),
__global float16 *debug)

What this will do is pack both _start_time and time_window into the same constant cache for better resource management as only 2 constant caches can be used at a time in a single ALU CF clause.

I hope this helps. If there is still an issue with this kernel after the upcoming release, please let me know.

nou · ‎04-26-2010

thanks for reply Micah. as you suggest i try replace my _event to float2 (in future i need to chose types dynamic in this structures like int,float or int,float,int etc.) but as i replace it now it freeze whole system. i remove that while(1) loop just for sure but it do not work. on CPU it is still working normaly.

so i realy looking forward to next release.

MicahVillmow · ‎04-26-2010

nou,
A workaround for this until the next release is to use the __attribute__((max_reqd_workgroup_size(64, 1, 1))) to force smaller work-groups at compile time. The freeze is most likely caused by spilling register to scratch memory and the loop counter getting corrupted.

MicahVillmow · ‎04-26-2010

nou,
Also the environment variable GPU_MAX_WORKGROUP_SIZE=64 should achieve the same behavior.
This should allow you to continue developing until the release comes out.

nou · ‎04-26-2010

reqd_work_group_size() solve freezing. but wrong result remains. i tried it on nVidia. and then i will be angry. on AMD because it have buggy OpenCL or on myself that i can not discover bug in my SW.

MicahVillmow · ‎04-26-2010

nou,
The problem is most likely in the codegen and how it handles private memory and structures. Try passing everything in global memory to get around a bug in our SW stack. There should be no problem in our upcoming release as I've spent a lot of time making sure this path works.

nou · ‎04-28-2010

today i tried my code on nVidia card. and it works after some modifications.

nVidia do not like this. __constant int2 neighbors[] = {-1,1, 1,1, -1,-1, 1,-1, 0,1, -1,0, 1,0, 0,-1};

it refuses compile wirk some strange build error. so i switch from int2 to int. then it work.

MicahVillmow · ‎04-28-2010

nou,
the correct way to do that is the following:
__constant int2 neighbors[] = {(int2)(-1,1), (int2)(1,1), (int2)(-1,-1), (int2)(1,-1), (int2)(0,1), (int2)(-1,0), (int2)(1,0), (int2)(0,-1)};

nou · ‎04-28-2010

that did't work either.so maybe nvidia bug.

Micah can't you send me new SDK on email? i write thesis and i have only two weeks left.

nou · ‎05-03-2010

ok tried my code with SDK 2.1 and it works.

faruk123 · ‎12-31-2012

Hi

3d mark 11 consequences part appear to demonstrate my GPU core trimmest speed wrongly .Do this mean mean that 3d mark 11 is by my GPU at that know clock pace ,or is it just being display wrongly ?My card are x3 xFX 6970 in crossfire and the supply clock speed be supposed clock speed supposed to be 880MHZ and in reality i'have overclocked theme to 997mhz.

heman · ‎01-09-2013

Hi faruk,

Can you please provide the system confiuration you are running. CPU, GPU, SDK, Driver.

also furnish some steps, on what all needs to be installed( in what order) in order to reproduce this issue.

Thanks

workitem7

Archives Discussions

incorect result on GPU