cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

laobrasuca
Journeyman III

problem with clAmdBlasSgemm

incorrect results

hi all,

I'm in need of a Sgemm implementation on GPU and I've tried the AMD proposition clAmdBlasSgemm (win7 pro 64bits, clAmdBlas-1.2.78, Radeon HD5770, driver 11.5, sdk 2.4). While comparing results i've notice that things are not really correct. In the sample "example_sgemm.c" you pack with the library there's this C = 10*A*B + 20*C , with:

A[] = {

11, 12, 13, 14, 15,
21, 22, 23, 24, 25,
31, 32, 33, 34, 35,
41, 42, 43, 44, 45}

B[] = {11, 12, 13,
    21, 22, 23,
    31, 32, 33,
    41, 42, 43,
    51, 52, 53}

c[] = {

11, 12, 13,
21, 22, 23,
31, 32, 33,
41, 42, 43}

with result

14420, 15790, 17060,
25120, 27490, 29760,
35820, 39190, 42460,
46520, 50890, 55160

 

while the correct result is:

21370, 22040, 22710, 
37070, 38240, 39410, 
52770, 54440, 56110, 
68470, 70640, 72810

 

I've also tested some other results, here's a simple one:

A[] = {

1, 2, 3,
1, 2, 3,
1, 2, 3}

B[] = {

1, 1, 1,
2, 2, 2,
3, 3, 3}

with trasA = transB = clAmdBlasNoTrans, M = N = K = lda = ldb = ldc = 3 and alpha = 1, beta = 0, with result

6, 12, 18,
6, 12, 18,
6, 12, 18

 

while it should be 14 everywhere. Is it that I'm not using it correctly or?

 

0 Likes
7 Replies
galmok
Journeyman III

First of all, when you write:

A[] = { series of number }

is that how you enter them in C/C++ or how they are output from a math program like Matlab?

Matrices in BLAS is in column-major format. A matrix like this:

1 2 3

4 5 6

is stored in linear memory like this:

1 4 2 5 3 6

 

If I have try to multiply your matrices, I get a problem in that the dimensions do not match. This needs to match:

A [m x k] * B [k x n] = C [m x n]

Your A indicates you think the dimension is [4 x 5]. If you enter that to sgemm, your A actually looks like this:

11 15 24 33 42

12 21 25 34 43

13 22 31 35 44

14 23 32 41 45

Also, you last example should not give 14 in every field of the matrix, but it should also not give what you showed.

0 Likes

Originally posted by: galmok First of all, when you write:

 

A[] = { series of number }

 

is that how you enter them in C/C++ or how they are output from a math program like Matlab?

 

Matrices in BLAS is in column-major format. A matrix like this:

 

1 2 3

 

4 5 6

 

is stored in linear memory like this:

 

1 4 2 5 3 6

 

 



it's C, 1D constant size array, breaking lines has no influence (it's just for the eyes). With clAmdBlasSgemm() you can set the order of your matrices (1st parameter), in this case it is set to row-wise (clAmdBlasRowMajor).

 

0 Likes

Before we can debug this, we really need all input to the function...

0 Likes

laobrasuca,

Can you please post a test case showing the issue?

 

0 Likes

it really is from the sample, see code attached.

/*********************************************************************** ** Copyright (C) 2010,2011 Advanced Micro Devices, Inc. All Rights Reserved. ***********************************************************************/ #include <sys/types.h> #include <stdio.h> #include <string.h> /* Include CLBLAS header. It automatically includes needed OpenCL header, * so we can drop out explicit inclusion of cl.h header. */ #include <clAmdBlas.h> /* This example uses predefined matrices and their characteristics for * simplicity purpose. */ static const clAmdBlasOrder order = clAmdBlasRowMajor; static const size_t M = 4; static const size_t N = 3; static const size_t K = 5; static const cl_float alpha = 10; static const clAmdBlasTranspose transA = clAmdBlasNoTrans; static const cl_float A[] = { 11, 12, 13, 14, 15, 21, 22, 23, 24, 25, 31, 32, 33, 34, 35, 41, 42, 43, 44, 45 }; static const size_t lda = 5; /* i.e. lda = K */ static const clAmdBlasTranspose transB = clAmdBlasNoTrans; static const cl_float B[] = { 11, 12, 13, 21, 22, 23, 31, 32, 33, 41, 42, 43, 51, 52, 53 }; static const size_t ldb = 3; /* i.e. ldb = N */ static const cl_float beta = 20; static cl_float C[] = { 11, 12, 13, 21, 22, 23, 31, 32, 33, 41, 42, 43 }; static const size_t ldc = 3; /* i.e. ldc = N */ static void printResult(void) { size_t i, j, nrows; printf("Result:\n"); nrows = (sizeof(C) / sizeof(cl_float)) / ldc; for (i = 0; i < nrows; i++) { for (j = 0; j < ldc; j++) { printf("%d ", (int)C[i * ldc + j]); } printf("\n"); } } int main(int argc, const char *argv[]) { cl_int err; cl_platform_id platform; cl_device_id device; cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 }; cl_context ctx; cl_command_queue queue; cl_mem bufA, bufB, bufC; cl_event event = NULL; cl_ulong imgA = 0, imgB = 0; int useImages = 0; int ret = 0; /* parse the command line */ if ((argc > 1) && !strcmp(argv[1], "--use-images")) { useImages = 1; } /* Setup OpenCL environment. */ err = clGetPlatformIDs(1, &platform, NULL); err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL); props[1] = (cl_context_properties)platform; ctx = clCreateContext(props, 1, &device, NULL, NULL, &err); queue = clCreateCommandQueue(ctx, device, 0, &err); /* Setup clAmdBlas. */ err = clAmdBlasSetup(); if (err != CL_SUCCESS) { printf("clAmdBlasSetup() failed with %d\n", err); clReleaseCommandQueue(queue); clReleaseContext(ctx); return 1; } /* Prepare OpenCL memory objects and place matrices inside them. */ bufA = clCreateBuffer(ctx, CL_MEM_READ_ONLY, M * K * sizeof(*A), NULL, &err); bufB = clCreateBuffer(ctx, CL_MEM_READ_ONLY, K * N * sizeof(*B), NULL, &err); bufC = clCreateBuffer(ctx, CL_MEM_READ_WRITE, M * N * sizeof(*C), NULL, &err); err = clEnqueueWriteBuffer(queue, bufA, CL_TRUE, 0, M * K * sizeof(*A), A, 0, NULL, NULL); err = clEnqueueWriteBuffer(queue, bufB, CL_TRUE, 0, K * N * sizeof(*B), A, 0, NULL, NULL); err = clEnqueueWriteBuffer(queue, bufC, CL_TRUE, 0, M * N * sizeof(*C), C, 0, NULL, NULL); if (useImages) { /* add scratch images to evaluate with an image based kernel */ imgA = clAmdBlasAddScratchImage(ctx, 16, 64, NULL); imgB = clAmdBlasAddScratchImage(ctx, 16, 64, NULL); } /* Call clAmdBlas function. */ err = clAmdBlasSgemm(order, transA, transB, M, N, K, alpha, bufA, lda, bufB, ldb, beta, bufC, ldc, 1, &queue, 0, NULL, &event); if (err != CL_SUCCESS) { printf("clAmdBlasSgemm() failed with %d\n", err); ret = 1; } else { /* Wait for calculations to be finished. */ err = clWaitForEvents(1, &event); /* Fetch results of calculations from GPU memory. */ err = clEnqueueReadBuffer(queue, bufC, CL_TRUE, 0, M * N * sizeof(*C), C, 0, NULL, NULL); /* At this point you will get the result of SGEMM placed in C array. */ printResult(); } /* Remove scratch OpenCL images */ if (useImages) { clAmdBlasRemoveScratchImage(imgA); clAmdBlasRemoveScratchImage(imgB); } /* Release OpenCL memory objects. */ clReleaseMemObject(bufC); clReleaseMemObject(bufB); clReleaseMemObject(bufA); /* Finalize work with clAmdBlas. */ clAmdBlasTeardown(); /* Release OpenCL working objects. */ clReleaseCommandQueue(queue); clReleaseContext(ctx); return ret; }

0 Likes

Thank you for helping to debug our example programs.

At first we thought there was a horrible problem with our blas routines, but we really do have exhaustive testing, and that testing could not have passed with such a problem.

Try changing matrix B to an identity matrix and run the example.  You get the SAME WRONG RESULT!

Look at the clEnqueueWriteBuffer call for matrix B.  See the problem?

 

This example be fixed in a future release.

0 Likes

xD

copy-paste wins! lol

indeed, same result as I had observed before, but I could not imagine that the problem was such a thing! I had wasted so much time on trying to see what I was doing wrong before posting here that I've just let I go.

thx for posting

0 Likes