cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

mihzaha
Journeyman III

broadcasting?

All threads access the same memory location

Hi!

I have a HD5850 card and I want to start to program it in openCL. I'm a beninner in gpu programming and in openCL.

I want to start with a search in a tree structure, the problem is that all the threads access the root, many threads access the children of the rood, and finally only few threads access a leaf. Does the card support broadcasting, and to what extent (broadcast to members of a work-group or to all work-items that run concurrently?).

If there is no broadcast, is there a better way of doing the traversal; the tree is big, doesn't fit in local memory. 

(each thread is a cube that has a position in 3d space and a dimension and I want to find the closest voxel that fits inside the cube by dividing the space in 8 cubes until it's small enough)

 

Thank you

0 Likes
2 Replies
n0thing
Journeyman III

Broadcasting is supported in the local memory of 5xx series, 32 threads reading the same memory location will get their request processed in 1 cycle.

Constant buffers give a bandwidth of 600 GB/s when all threads access same memory location - and when the index is dynamic.

Global memory gives a bandwidth of 250 GB/s as there is a bit of cache-reuse - global buffer memory operations translate to VFETCH instruction which says that the accesses are through L1 texture cache.

So you will definately get a higher-bandwidth when all threads are accessing the same memory location.

0 Likes

Thank you

solved:

1) kernels that compress the first k levels (one long vector of nulls or addresses)

2) kernels that access the nodes starting from level k directly (from the vector, like a hash map)

 

0 Likes