Archives Discussions

oscarbarenys1 · ‎01-14-2010

Hi I have written some kernels for getting near to max theoretical perf on 5xxx series (5850)

I have written codes for FP single pre, FP double prec., integer and integer 24 bit..

I write mainly kernels using OCL native mad instructions where apropiate:

mad: for floating point and for doubles

mad24: uses integer 24 bit multiplies

for integers as not exist a OpenCL imad instruction I write a*b+c

The problem lies all programs compile but I can't get mad hardware instructions used as seeing AMD IL v2 and 5xxx assembly reveals excepting single precision..

Well for double precision it crashes so I have to use a*b+c form..

Altough double prec. is experimental I hope you can add mad and fma instructions as fast as you can.. this would enable some n-body example a attack to GTC09 nbody doubles Fermi perf 🙂

So briefly:

Integer mad: no exists ocl instruction i get this isa:

9 t: MULLO_INT ____, PV8.w, R0.x
10 y: ADD_INT T0.y, T0.w, PS9

Single FP: correct

MULADD_e x,w,z,y

Double precision: using native double mad or fma crashes and using a*b+c i get (il):

dmul r177.xy__, r178.xyxy, r177.xyxy
dadd r177.xy__, r177.xyxy, r178.xyxy

Integer mad24:

imul+ iadd +ishl+ ishr (at amd il but assembly is the same horribly situtation)

Note that 5850 supports MULADD_UINT24 native isa instruction

so note I can't obtain better than half theoretical ops/s in DPFP, integer and integer 24..

in fact last case is 4x slower (assuming similar time for each instruction)

One problem I see for mad24 is that amd il 2.0 seems to not expose mul24 instruction so as opencl seems to generate amd il first how this is going to be solved.. isa exists MULADD_UINT24

Also I can't believe AMD is so in that early stages for that special instruction as OpenCL and DirectCompute can use to accelerate threadid index calculations for blocks/grids less than 16m elements.. CUDA programs do it a lot..I think it's a reason that CUDPP limits some functions to 16m elements..

Also the problem with integers and general code using a*b+c instead of special mad instruction could be resolved if AMD opencl compiler understands "-cl-mad-enable"

but it says:

Warning: invalid option: -cl-mad-enable

Note I have tested kernels in Nvidia OCL using a*b+c for all suported data types and they use two instruction (mul+add) but instead if I use -cl-mad-enable it uses native hardware mad instructions..

Also one note also I put a lot of mad instructions inside a loop and AMD opencl compiler crashes and before crashing it starts to use a high time for compiling .. using some moderate length mad instructions I remember CUDA compiler eats perfectly this test..

Some argument to instruct the compiler not optimizing at all.. since a block of mad instructions can't be optimized..

Also it's a problem of compiler expanding the loop? How I can control loop expansion I think Nvidia OpenCL compiler recognizes #pragma unroll..

If i publish this code as a benchmark AMD cards will be damaged..

More questions coming..

Thanks

MicahVillmow · ‎01-14-2010

Oscar,
Is it possible to paste the code for the compiler crash on the message board? If not, can you send it to streamdeveloper@amd.com CC: Micah Villmow?

Also, only basic arithmetic functions are supported experimentally with double, whereas math function calls and some conversions are not.

moozoo · ‎01-14-2010

>this would enable some n-body example a attack to GTC09 nbody doubles Fermi perf :-

Not in OpenCL and not using doubles but you might find this interesting. It uses optimised CAL

http://galaxy.u-aizu.ac.jp/trac/note/wiki/Astronomical_Many_Body_Simulations_On_RV770

It links to results for the Radeon 5870

http://galaxy.u-aizu.ac.jp/trac/note/wiki/Tests_With_RV870

Both papers on this site make for great reading.

Especially the Oct-tree Method on GPU paper.

They do a 2,874,551 particle Cosmological Simmulation.

Archives Discussions

Questions #1: about getting peak flops on amd opencl sdk: getting ISA MADs instructions.. and max kernel length..