0 Replies Latest reply on Sep 7, 2011 6:30 AM by settle

    Interpreting output from AMD APP KernelAnalyzer on saxpy kernel

    settle

      I'm getting a grasp of how to use AMD APP KernelAnalyzer, but there is one part in the attached kernel code that I can't understand the outputted performance metrics.  If I check the bounds before performing the saxpy operation (i.e., CHECK_BOUNDS != 0) I would expect the throughput to be slower than if I don't check the bounds (i.e., CHECK_BOUNDS == 0); however the quoted throughput is opposite of my expectations.  Am I doing something wrong or is my expectation wrong?  I can only reason that checking bounds forces one work-item to be paused and another be swapped in, which subsequently hides some memory fetches.  But then why wouldn't such a switch be performed without the bounds checking?

      #define SIZE_T uint #define FLOAT float #define PTRDIFF_T int #define CHECK_BOUNDS #define N 65536 #define ALPHA alpha #define INCX 1 #define INCY 1 #define GET_GLOBAL_OFFSET get_global_offset(0) #define GET_GLOBAL_SIZE get_global_size(0) __kernel void cl_blas_saxpy_kernel( const SIZE_T n, const FLOAT alpha, const __global FLOAT * restrict x, const PTRDIFF_T incx, __global FLOAT * restrict y, const PTRDIFF_T incy) { #if ((INCX) < 0 && (INCY) < 0) const size_t i = 2 * (GET_GLOBAL_OFFSET) + (GET_GLOBAL_SIZE) - 1 - get_global_id(0); #elif ((INCX) > 0 && (INCY) > 0) const size_t i = get_global_id(0); #else #if !defined(INCX) #define INCX incx #endif #if !defined(INCY) #define INCY incy #endif const size_t i = ((INCX) < 0 && (INCY) < 0) ? 2 * (GET_GLOBAL_OFFSET) + (GET_GLOBAL_SIZE) - 1 - get_global_id(0) : get_global_id(0); #endif #if !defined(IX) const size_t ix = (INCX) > 0 ? 0 : (1 - (N)) * (INCX); #define IX ix #endif #if !defined(IY) const size_t iy = (INCY) > 0 ? 0 : (1 - (N)) * (INCY); #define IY iy #endif #if (CHECK_BOUNDS) if (i < (N)) #endif { y[(IY) + i * (INCY)] += (ALPHA) * x[(IX) + i * (INCX)]; } }