2 Replies Latest reply on Jun 28, 2010 3:33 PM by mihzaha


      All threads access the same memory location


      I have a HD5850 card and I want to start to program it in openCL. I'm a beninner in gpu programming and in openCL.

      I want to start with a search in a tree structure, the problem is that all the threads access the root, many threads access the children of the rood, and finally only few threads access a leaf. Does the card support broadcasting, and to what extent (broadcast to members of a work-group or to all work-items that run concurrently?).

      If there is no broadcast, is there a better way of doing the traversal; the tree is big, doesn't fit in local memory. 

      (each thread is a cube that has a position in 3d space and a dimension and I want to find the closest voxel that fits inside the cube by dividing the space in 8 cubes until it's small enough)


      Thank you

        • broadcasting?

          Broadcasting is supported in the local memory of 5xx series, 32 threads reading the same memory location will get their request processed in 1 cycle.

          Constant buffers give a bandwidth of 600 GB/s when all threads access same memory location - and when the index is dynamic.

          Global memory gives a bandwidth of 250 GB/s as there is a bit of cache-reuse - global buffer memory operations translate to VFETCH instruction which says that the accesses are through L1 texture cache.

          So you will definately get a higher-bandwidth when all threads are accessing the same memory location.

            • broadcasting?

              Thank you


              1) kernels that compress the first k levels (one long vector of nulls or addresses)

              2) kernels that access the nodes starting from level k directly (from the vector, like a hash map)