There are some limitation on ATI hardware with scatter. Brook+ runtime tries to virtualize these limitation on cost of some performance overhead. If you want to avoid these virtualizations, you should use a 128-bit (float4, int4, double2) 1D stream with size < 8192 as scatter stream.
Let me know what changes you see in performance if you use the above mentioned configuration for scatter stream.