Reduction kernels runs very slow on my system.
I simply tried the sample reduction code with default size (64x64) and got the result:
Width Height Iterations GPU Total Time
64 64 1 1.781000
And I got segment fault when I try './reduction -i 2 -t'.
I wonder if the segment-fault problem is also related to the second-iteration problem I just posted.