I always think that in partial reductions, the hardware splitting the domain of execution and do reduction parallel to the number of splitting.
Thus it should be faster that regular reduction.
Is that correct?