Single SPs in a CU would be low throughput. Low power GPUs work that way (ARM's Mali, say) but it's hard to reach very high compute throughput without SIMD. Much as a single SP CU would be idea from a programming perspective, nobody really likes programming for SIMD.
The 5650 has 400 "pipelines" as per spec, so 400 ALUs or 5 x 80-ALU (16 VLIW lane) wide SIMD units. That looks right to me.
To allow scaling down without sacrificing parallelism on the very low end parts (ie to still execute a vertex and pixel shader simultaneously without blowing the transistor budget) we also narrow the SIMD units. The cost is that control logic increases relative to ALU logic, but that's a reasonable trade to hit the very low power point.
Programs will behave differently if you start to drop barriers (I have a habit of doing that). Really you should aim your workgroup size to the wave size and let the shader compiler drop barriers for you. That way you stay within spec and should get the same behaviour, if not the same performance.