or you can compute four value in one work unit and then write out result like this
out[gid] = a<<24 | b<<16 | c<<8 | d;
Yaa byte-addressable extension was the problem..i removed and it ia working fine in GPU and i converted my buffers to uint.
thanks for the finding.