Archives Discussions

nou · ‎12-21-2009

i found that now there is some support for extensions on GPU.

double support for basic operators +-*/ thoug only on 58xx and 59xx series wonder if CPU is supported now.

OpenGL/DX10 interoperability

ICD support nvidia and ati in the same system as two platforms.

Stream Analyzer Profiler - hope there will be some tool to developer which do not use VS

and new documentation like performance optimizations ISA for 5xxx card.

Stib · ‎12-21-2009

I tought, that you are speaking of extensions, like khr_byte_addressable_store, but i run my test program, and found nothing new there.

I was hoping, because the new driver and so...maybe next time.

EDIT: I think, i missed something. There is a NEW SDK out!

Stib · ‎12-21-2009

I found something interesting. There is a line in the Release Notes, page 3, 6.2 Runtime, it says:

"On 7xx-based GPUs, the maximum work-group size is 64. Specifying a larger size can result
in undefined behavior."

I have an EHA4850, wich is an ATI RV770 GPU. I have an applikation, that runs with a work group of 230, and i have no problems there. Wich cards are meant then? The 3xxx series?

MicahVillmow · ‎12-21-2009

Stib,
This is one of the many constraints of running OpenCL on the RV7XX based GPU's because the hardware was not created with OpenCL in mind. The problem in this specific instance is that barrier is not a hardware instruction and is a software based barrier. If you run a kernel with a group size larger than the wavefront size of the chip, then an implicit barrier is added to the end of every kernel. If you run the group size equal to the wavefront size, this barrier is optimized away. The barrier itself works in most situation, but not all, which is why the behavior is undefined.

Stib · ‎12-21-2009

I see. A tried now something. I used a work group size of 50 instead of 230, and i got not so much prformance loss, as i tought, in fact, i think its almost nothing. Is it because, i the implicit barrier is holding back the run of the other solution? Or is the barrier ONLY optimized away by a group size of 64?

Another question. How come, that the max work group size is 256 if i can only use 64 of it?

MicahVillmow · ‎12-21-2009

Stib,
Any size smaller than the wavefront size will optimize the barrier away. You can use what the runtime reports as correct, the wording in the doc is off and I'm attempting to get it corrected. It should be 'recommended' not 'maximum'. The maximum size is returned via the query CL_KERNEL_WORK_GROUP_SIZE.

edit: misspelled the ENUM value.

Stib · ‎12-21-2009

Ok, becaus i got the 230 group size from the runtime. I did set the group size to NullRange, and so.

With 230 i get ~7,27 ms kernel time, and with 50 ~8,13 ms.

So, is it because the barrier, that there is so little between the two performances, or because the kernel does not so much work?

I attach my kernel if you'd like to see, its a simple masking algorithm, for pictures.

And, thank you, for the quick, and accurate answer!

__kernel void maszkol ( __global unsigned int *rOut, __global unsigned int *gOut, __global unsigned int *bOut, __global unsigned int *rMid, __global unsigned int *gMid, __global unsigned int *bMid, __global unsigned int *N, __global unsigned int *width, __global unsigned int *height, __global int *seged_t, __global float *maszk ) { unsigned int seged=*width*(*height); unsigned int hatar=*N*(*N); unsigned int cim=0; unsigned int i=0; float ertek=0; switch(get_global_id(0)/seged) { case 0: cim=((*N/2)*(*N-1+*width))+((get_global_id(0)/(*width))*(*N-1))+(*N/2)+get_global_id(0); for(i=0;i<hatar;++i) { ertek+=rMid[cim+seged_t]*maszk; } rOut[get_global_id(0)]=ertek; break; case 1: cim=((*N/2)*(*N-1+*width))+(((get_global_id(0)-seged)/(*width))*(*N-1))+(*N/2)+get_global_id(0)-seged; for(i=0;i<hatar;++i) { ertek+=gMid[cim+seged_t]*maszk; } gOut[get_global_id(0)-seged]=ertek; break; case 2: cim=((*N/2)*(*N-1+*width))+(((get_global_id(0)-(2*seged))/(*width))*(*N-1))+(*N/2)+get_global_id(0)-(2*seged); for(i=0;i<hatar;++i) { ertek+=bMid[cim+seged_t]*maszk; } bOut[get_global_id(0)-(2*seged)]=ertek; break; } }

Archives Discussions

new features in AMD OpenCL