cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

mjharvey
Adept I

Questions about RV870's execution of work-groups.

coming from an Nvidia perspective...

Hi,

I've been programing NVidia hardware up until now and I'm trying to understand extactly how the RV870 differs in repect to some aspects of thread/work group execution. If anyone could answer these question, it would be much appreciated!

*) According to the OpenCL device info, the maximum work-group size is 256 threads, so this means that each stream processor cluster/compute unit is capable of executing 4 wavefronts concurrently?

*) Can these work-groups be allocated to different work-groups, or is there a limitation on the number of work-groups that an spc can execute at a given time? (eg only 1 work-group at once, irrespective of thread count)

*) According to the Evergreen docs, each spc has a register file of only 128 128bit registers. This is much smaller than Nvidia's register file, so how are these distributed amongst threads? Is it in any way like the NVidia case where a kernel's register use affects the occupancy/ max number of threads per work-group or groups/compute unit?

Ta,

Matt

0 Likes
2 Replies
genaganna
Journeyman III

Originally posted by: mjharvey Hi,

*) According to the OpenCL device info, the maximum work-group size is 256 threads, so this means that each stream processor cluster/compute unit is capable of executing 4 wavefronts concurrently?

      As per the OpenCL implementation,  maximum threads in WorkGroup is 256 threads. so you can have maximum of 4 wavefronts.  I am not sure about the capabilities of device.

*) Can these work-groups be allocated to different work-groups, or is there a limitation on the number of work-groups that an spc can execute at a given time? (eg only 1 work-group at once, irrespective of thread count)

  Compute unit can execute more than one workGroup if resources are sufficient like registers and shared memory.  I am not sure about that maximum number of workGroups can run concurrently.

*) According to the Evergreen docs, each spc has a register file of only 128 128bit registers. This is much smaller than Nvidia's register file, so how are these distributed amongst threads? Is it in any way like the NVidia case where a kernel's register use affects the occupancy/ max number of threads per work-group or groups/compute unit?

 

I think it is 256 KB/compute unit

0 Likes
eduardoschardong
Journeyman III

Originally posted by: mjharvey Hi,

*) According to the Evergreen docs, each spc has a register file of only 128 128bit registers. This is much smaller than Nvidia's register file, so how are these distributed amongst threads? Is it in any way like the NVidia case where a kernel's register use affects the occupancy/ max number of threads per work-group or groups/compute unit?

IIRC it's 256 128-bit x 64 register file (so 256kb) per SIMD, it's much bigger than nVidia's and likely big enough to not be a problem for you, also you don't need as many wavefronts active to hide all latencies.

 

0 Likes