There are some limitation on ATI hardware with scatter. Brook+ runtime tries to virtualize these limitation on cost of some performance overhead. If you want to avoid these virtualizations, you should use a 128-bit (float4, int4, double2) 1D stream with size < 8192 as scatter stream.
Let me know what changes you see in performance if you use the above mentioned configuration for scatter stream.
I've tried. It seems like the performance is improved if i use 1-D stream.
Thanks for ur help. I'll keep on my work.