Is there a way to do parallel reduction without using local memory?

I studied the sample provided in AMD APP SDK 2.4. It uses local memory and only recieve one vector as its input

I want to do parallel reduction and without using local memory. For example, the kernel receives 3 input vectors and outputs three values (each is the reduction value of each vector). Is it possible write this kind of kernel? Any hints, helps?

Thank you