Hi,

I'm working on a framework for classification and scheduling of computations on heterogeneous multi-device platforms.

I recently added a simple computation to the training samples set which performs the sum-of-cols of a matrix (output[i] = Sum(input[r, i]) forall r).

The kernel code is the following (it looks a bit strange cause it's generated from and F# function):

kernel void SumCols(global float*matA, global float*c, int matA_length_0, int matA_length_1, int c_length_0) { int r = get_global_id(0); float accum = 0; for(int i = 0; i <= (matA_length_1) - (1);i++) { accum = (accum) + (matA[((r) * (matA_length_0)) + (i)]); } c[r] = accum; }

I'm getting a weird fluctuating completion time by varying the input matrix size from 64x64 to 2048x2048 (element type is float32) with a step of 64.

The integrated GPU is a 7660D in the A10-5800K APU.

The following graph shows the completion time by varying the input size. A CSV with numbers is available here: featureBasedScheduling/Sum By Cols-Table 1.csv at master · morellid/featureBasedScheduling · GitHub.

Any hint about what may cause this strange behaviour?

Hi,

The execution time depends on various factors along with the global data size (in this case matrix size). Please can you share the complete code (with Host) to check and test. As matrix size increases, stride of memory accessing may play an important role. Do you find similar characteristics for other GPU cards also? Or is it only for A10-5800K APU? [The numbers in the above link look confusing. Please can you share in other formats, say in excel)

Regards,