If we query an OpenCL device, we can discover the maximum number of compute units. This compute units consist of processing element (PE) according to OpenCL spec. How to know the number of PE in a compute unit?
I was looking at this to do a slightly hacky peak flops calculation for Bullet the other week and failed. I don't think there is a way. It isn't particularly meaningful, anyway, because there's no way to take into account how you might utilise the device. I would just check the device name and from that have a lookup table to do a mapping.
How can you know at all about compute units and such? Some specification pages do not mention them http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-4000/hd-4550/Pages/ati-radeon-hd-4550-specifications.aspx
It looks to me like the compute units are determined by the chip's physical and logical organization, my hd5650 reports 5 (it could be just a bug of course), which would make them quite "fat" compared to smaller chips, and I hope they would produce more work than the smaller compute units in other chips. Now, has AMD described the trade offs that dictate compute unit organization? I cannot decide at this stage if I want more or less compute units, was the ideal perhaps to have lean CU's with single SPs? Or can you actually speed up processing with the CU abstraction? I do "feel" that programs will behave differently with different CU sizes, and, for example, moving to bigger or smaller chips may provide some surprises.
Single SPs in a CU would be low throughput. Low power GPUs work that way (ARM's Mali, say) but it's hard to reach very high compute throughput without SIMD. Much as a single SP CU would be idea from a programming perspective, nobody really likes programming for SIMD.
The 5650 has 400 "pipelines" as per spec, so 400 ALUs or 5 x 80-ALU (16 VLIW lane) wide SIMD units. That looks right to me.
To allow scaling down without sacrificing parallelism on the very low end parts (ie to still execute a vertex and pixel shader simultaneously without blowing the transistor budget) we also narrow the SIMD units. The cost is that control logic increases relative to ALU logic, but that's a reasonable trade to hit the very low power point.
Programs will behave differently if you start to drop barriers (I have a habit of doing that). Really you should aim your workgroup size to the wave size and let the shader compiler drop barriers for you. That way you stay within spec and should get the same behaviour, if not the same performance.
Retrieving data ...