cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

debdatta_basu
Journeyman III

Suggest Feature you want in AMD APP

Here is my list:

1. Global Variables. Its a pain to pass down parameters from the kernel function to every function that needs it.

2.  Function Pointers.

3. Better support for complex data structures. Unified Address Space.

4. Templates.

5. Something similar to CUDA warp Vote functions. Useful for doing custom scheduling.

6. Population count instruction. Can be used to implement blazingly fast warp wise prefix sum for binary digits. Algorithms like Radix sort would benifit greatly. Also useful for doing custom scheduling.

7. Multisampled texture support for OpenCL. Useful for Compute based  deferred rendering. This can already be done in DirectCompute.

 

Debdatta Basu.

0 Likes
MicahVillmow
Staff
Staff

Suggest Feature you want in AMD APP

debdatta,basu,
For #6 please see the cl_amd_popcnt extension.
0 Likes
debdatta_basu
Journeyman III

Suggest Feature you want in AMD APP

Dear Micah,

I am aware of that extension. However, I wanted it in the core spec, or at least as a khr extension, as Nvidia doesnt have anything similar for opencl yet.

 

Regards,

Debdatta Basu.

0 Likes
corry
Adept III

Suggest Feature you want in AMD APP

Can we get a byte order reveral instruction?  We have bitalign, and bytealign to let us do byte/bit rotations, shouldn't be hard to add a byte order reversal.  In OpenCL this could be exposed much like it is in MSVC for x86 processors (though I guess it would have to have an AMD specific extension attached to it) but from the MSDN page:

unsigned short _byteswap_ushort (
   unsigned short val
);
unsigned long _byteswap_ulong (
   unsigned long val
);
unsigned __int64 _byteswap_uint64 (
   unsigned __int64 val
);

Heck with that, could you give us those openCL instructions, but implement it on the GPU like the x86 SSSE3 instruction pshufb?  I can think of a lot of situations pshufb has come in handy!  Would absolutly love to have it on the GPU!

0 Likes
ufimtsev
Journeyman III

Suggest Feature you want in AMD APP

Originally posted by: debdatta.basu Here is my list:

 

4. Templates.

 

5. Something similar to CUDA warp Vote functions. Useful for doing custom scheduling.

 

 

I too vote for templates and warp voting functions. Currenly have to implenet things like __all() and __any() via __local arrays and parallel reduction.

 

0 Likes
k1942t
Journeyman III

Suggest Feature you want in AMD APP

New ISA level instructions using bank conflict detection hardware.

p: uint pointer of LDS or GDS.

port_id: 0-31

 

 

bool lock_bankport(uint *p,int port) { uint tid=get_id_in_wavefront(); __local uint bank_is_used=0; bool ret=false; for (int i=0;i<4;i++) { if (0+16*i<=tid && tid <=15+16*i) { if (((bank_is_used >> port)&1)==0 && is_first_tid_in_port(port) //return true if tid is the first thread id of the same port // using bank conflict detection hardware. ) { ret=true; atomic_or(&bank_is_used,1<< port); } } } __barrier(); __local uint bank_is_used_old; if (tid==0) { atomic_or(p,bank_is_used); } __barrier(); if ((bank_is_used_old>>port)&1) { ret=false; } return false; } void unlock_bankport(uint *p,int port) { uint tid=get_id_in_wavefront(); __local uint bank_is_used=0; atomic_or(&bank_is_used,1<<port); __barrier(); if (tid==0) { atomic_and(p,~bank_is_used); } usage: __GDS sync_obj=0; __kernel void test(double* dest,uint* index,double* value) { uint gid=get_global_id(0); uint dest_index=index[gid]; uint port=dest_index&31; while (1) { if (lock_bank_port(&sync_obj,port)) { dest[dest_index]+=value[gid]; unlock_bankport(&sync_obj,port); } } }

0 Likes
tweenk
Journeyman III

Suggest Feature you want in AMD APP

Originally posted by: corry Can we get a byte order reveral instruction?


There is no need to add another function that uses this instruction, just emit it when compiling code like this:

unsigned int swapped = as_uint(as_char4(input).wzyx);

Note that it crashed for me on SDK 2.3 and R700 series card when I tried to use it to byteswap floats. I haven't re-tested since then as I converted this to bitwise operations on uints.

0 Likes
corry
Adept III

Suggest Feature you want in AMD APP

Originally posted by: tweenk
Originally posted by: corry Can we get a byte order reveral instruction?


There is no need to add another function that uses this instruction, just emit it when compiling code like this:

unsigned int swapped = as_uint(as_char4(input).wzyx);

Note that it crashed for me on SDK 2.3 and R700 series card when I tried to use it to byteswap floats. I haven't re-tested since then as I converted this to bitwise operations on uints.

 

That might work for single dword byte reversal, but where the register components are 32 bits wide, and the entire register is 128 bits wide, there would be a massive increase in byte order reversal performace swapping 128 bits at a time, like how SSE does it with pshufb.  with 32 bit componants, a move swizzle just reverses dword orders, which is usless in byteswapping. 

In SSE, this is trivial with this

movdqa xmm1, XMMBSWAPVAL;
movdqa xmm0, [rsp+myBufferOffset]
pshufb xmm0, xmm1
movdqa [rsp+myReversedBufferOffset], xmm0
.DATA
align 16
XMMBSWAPVAL:
    DD 00010203h
    DD 04050607h
    DD 08090a0bh
    DD 0c0d0e0fh

Or with intrinsics I suppose

__m128i Source, Dest;
__m128i bSwapVal = { 0x00010203, 0x04050607, 0x08090a0b, 0x0c0d0e0f };
Dest=_mm_shuffle_epi8(Source, bSwapVal);

Of course, that would normally be in a loop over some largeish data, and thats for packed 32 bit integers, but you get the idea.  128 bits at a time, and capable of arbitrary sized inputs, (2, 4, 8, or heck, even 16 byte integers)

Would be nice for when you use the SIMD as an SIMD, as in I have all registers full of 32 bit data, and want to byte reverse each individual 32 bit component.  Byte swapping a buffer in some algorithms can account for up to 20% of the time spent on it, just because you happen to be receving in network byte order (the way you're supposed to do it). 

 

0 Likes
laobrasuca
Journeyman III

Suggest Feature you want in AMD APP

another suggestion would be: a sort function for vector types, like, int4 a_sort = sort_int4(a), with a_sort.x <= a_sort.y <= a_sort.z <= a_sort.w. I would be useful for kernels where only a small number of values need to be sorted (up to 16 values). And using low level sort on vector types would be way faster than running regular C code sort algorithm for arrays, specially in cases where the array is not stored in the private memory space due to the lack of memory space.

 

and a question: what's the main reason why it is not allowed to access vector type values with indices, like: int16 a; a[12] = 5; Is this related to performance? Would the new GNC architecture (and consequently the way compiler and runtime behave, i.e., hardware runtime instruction scheduling vs software compilation-time scheduling) make it viable?

0 Likes
MicahVillmow
Staff
Staff

Suggest Feature you want in AMD APP

laobrasuca,
a vector is a native data type like an integer, it is not an array of scalar types, so it is not index-able. This is not related to performance, but related to the nature of the data type. Asking to index into a vector is no different than asking to index into an 32bit-integer. It makes no sense from a hardware perspective or a language perspective. If you want index-able data, use arrays.
0 Likes