Archives Discussions

mihzaha · ‎06-26-2010

All threads access the same memory location

Hi!

I have a HD5850 card and I want to start to program it in openCL. I'm a beninner in gpu programming and in openCL.

I want to start with a search in a tree structure, the problem is that all the threads access the root, many threads access the children of the rood, and finally only few threads access a leaf. Does the card support broadcasting, and to what extent (broadcast to members of a work-group or to all work-items that run concurrently?).

If there is no broadcast, is there a better way of doing the traversal; the tree is big, doesn't fit in local memory.

(each thread is a cube that has a position in 3d space and a dimension and I want to find the closest voxel that fits inside the cube by dividing the space in 8 cubes until it's small enough)

Thank you

n0thing · ‎06-27-2010

Broadcasting is supported in the local memory of 5xx series, 32 threads reading the same memory location will get their request processed in 1 cycle.

Constant buffers give a bandwidth of 600 GB/s when all threads access same memory location - and when the index is dynamic.

Global memory gives a bandwidth of 250 GB/s as there is a bit of cache-reuse - global buffer memory operations translate to VFETCH instruction which says that the accesses are through L1 texture cache.

So you will definately get a higher-bandwidth when all threads are accessing the same memory location.

mihzaha · ‎06-28-2010

Thank you

solved:

1) kernels that compress the first k levels (one long vector of nulls or addresses)

2) kernels that access the nodes starting from level k directly (from the vector, like a hash map)

Archives Discussions

broadcasting?