MatrixMulImage example

Why exactly MatrixMulImage example provided with the new SDK is so much faster than standard MatrixMultiplication ? It is something like 2 or 3 time faster.

It looks like the only difference is in memory access done via new image support. Memory pinning, texture cache, etc, what is the source of the huge performance boost in using image instead of normal memory buffer ?