Archives Discussions

galmok · ‎05-15-2011

Is there any information on when we can expect the performance of the following functions to be increased (significantly)?

clEnqueueReadBufferRect

clEnqueueWriteBufferRect

(clEnqueueCopyBufferRect) <- didn't really test this one and it may be ok.

A short test, comparing a clEnqueueRead/WriteBuffer sequence with clEnqueueRead/WriteBufferRect transferring the same amount of memory show significant difference in transfer speed.

4096*4096*8 = 128MB upload and downloaded with non-rect takes about 7.5ms. This includes copying host memory to linear host buffer, uploading linear to device, running a kernel to copy linear device memory to destination array (strided).

The same array uploaded and downloaded with rect takes about 0.48 seconds. This is about 6.5 times slower and as it is, takes longer than the kernel operating on the data.

himanshu_gautam · ‎05-16-2011

Thanks for reporting it.

dragonxi4amd · ‎05-16-2011

Hi,

Thank you for your performance request !

AMD has decided the following approach (according their answer):

Step 1: to create a working OpenCL implementation

- does too slow equal to working from developers/users point of view ?

Step 2: to improve performance after Step 1

- is it always possible to improve performance as much as requied ?

- is there a risk that Step 1 has produced a structure which will require

big changes (which create new bugs) to become fast enough ?

What's your opinion about AMD's strategy ?

About your short tests:

1) OS ?

2) GPU ?

3) CPU ?

4) system load (processes, threads, other apps during your test)

5) was the system load the same in all tests ?

6) were you able to reproduce the results ?

7) have you tried clEnqueueCopyBufferRect test ?

Few guys in this forum have expressed their concern about AMD's OpenCL implementation performance !

Thanks !

galmok · ‎05-16-2011

Well, to be honest, AMD has said they will focus on being feature correct before optimising for performance.

I just hope the they do allocate a bit of resources for this particular area as my workaround, while much faster, requires much more memory and when comparing to the performance on nVidia platform, should have much room for improvement. On the nVidia platform, using the *Rect function are always faster than using the workaround I described (host:strided->linear, upload linear, device:linear->strided).

However, I have kernels that freeze the driver (another bug report is on its way here) and I'd rather they focus on that then optimising performance. 😉

himanshu_gautam · ‎05-16-2011

hi all,

Thanks for your views.

galmok,

Thanks for reporting the previous bug. actually it was also posted by someone else some time ago, and investigations are going on.

Please post the other bug you are talking about.

galmok · ‎05-16-2011

I haven't isolated the bug yet, but the driver freezes and the watchdog timer doesn't kick in. I can remotely log in (via cygwin ssh/sshd), start programs and such, but can't initiate shutdown. I can get halfway through a remote desktop connection. The screen remains frozen. Once found, I'll post it in this forum.

galmok · ‎05-16-2011

I have isolated part of the bug. The difference between lockup and not-lockup is in the kernel function definitions and 2-3 kernel lines it seems. If have implemented much of the DTRSM blocked triangle solver and it works fine using my OpenCL 1.1 kernel definitions. With my OpenCL 1.0 kernel definitions, I eventually get a lockup. The lockup first occurs after some iterations, meaning the kernels being run complete for the first about 18 iterations and then suddenly they cause a lockup.

All of it runs fine on nVidia hardware and on cpu. Also, if the matrix is small enough, there are not enough iterations for the driver to lockup and the correct result is returned. Only when enough iterations are being made will the driver lock up.

My OpenCL 1.1 and 1.0 kernel definitions: CLVERSION_1_1 is defined at runtime if OpenCL 1.1 has been compiled in and is enabled at runtime. #ifdef CLVERSION_1_1 kernel void dtrsmL(int upper, int transa, int nounit, int m, int n, double alpha, global double *A, int lda, global double *B, int ldb) { #else kernel void dtrsmL(int upper, int transa, int nounit, int m, int n, double alpha, global double *A, int aoff, int lda, global double *B, int boff, int ldb) { A += aoff; B += boff; #endif double b[64]; int j=get_global_id(0); // column to calculate int i,k; double temp; ... and #ifdef CLVERSION_1_1 __kernel void dgemmNN_AMD(__global double4 *matrixA, int lda, __global double4 *matrixB, int ldb, __global double4* matrixC, int ldc, int widthA, double alpha, double beta, __local double4 *blockB) { #else __kernel void dgemmNN_AMD(__global double4 *matrixA, int aoff, int lda, __global double4 *matrixB, int boff, int ldb, __global double4* matrixC, int coff, int ldc, int widthA, double alpha, double beta, __local double4 *blockB) { matrixA += aoff; matrixB += boff; matrixC += coff; #endif ... Host code to setup kernel argument for dtrsmL: #ifdef CL_VERSION_1_1 if (OpenCL11) { // create pointer to subbuffer (if either row or col offset != 0) subA = createSubBuffer(A, CL_MEM_READ_ONLY); subB = createSubBuffer(B, CL_MEM_READ_WRITE); // set kernel arguments ccSetKernelArgs(kernel, 10, "cl_int", upper, "cl_int", trans, "cl_int", nounit, "cl_int", m, "cl_int", n, "cl_double", alpha, "cl_mem", &subA, "cl_uint", A.lda, "cl_mem", &subB, "cl_uint", B.lda); } else #endif { ccSetKernelArgs(kernel, 12, "cl_int", upper, "cl_int", trans, "cl_int", nounit, "cl_int", m, "cl_int", n, "cl_double", alpha, "cl_mem", &A.mem, "cl_int", calculateOffset(A), "cl_uint", A.lda, "cl_mem", &B.mem, "cl_int", calculateOffset(B), "cl_uint", B.lda); } Host code to setup kernel arguments for dgemmNN_AMD: #ifdef CL_VERSION_1_1 if (OpenCL11) { // create pointer to subbuffer (if either row or col offset != 0) subA = createSubBuffer(A, CL_MEM_READ_ONLY); subB = createSubBuffer(B, CL_MEM_READ_ONLY); subC = createSubBuffer(C, CL_MEM_READ_WRITE); // set kernel arguments //kernel void dgemmNN( int m, int n, global double *A, int lda, global double *B, int ldb, global double* C, int ldc, int k, double alpha, double beta ) { ccSetKernelArgs(kernel, 9, "cl_mem", &subA, // A (&A.mem) "cl_int", A.lda, // lda "cl_mem", &subB, // B (&B.mem) "cl_int", B.lda, // ldb "cl_mem", &subC, // C (&C.mem) "cl_int", C.lda, // ldc "cl_int", k, // k "cl_double", alpha, // alpha "cl_double", beta); // beta clSetKernelArg(kernel.kernel, 9, (blockSize*blockSize)*(4*4)*sizeof(double), NULL); } else #endif { ccSetKernelArgs(kernel, 12, "cl_mem", &A.mem, // A (&A.mem) "cl_int", calculateOffset(A), "cl_int", A.lda, // lda "cl_mem", &B.mem, // B (&B.mem) "cl_int", calculateOffset(B), "cl_int", B.lda, // ldb "cl_mem", &C.mem, // C (&C.mem) "cl_int", calculateOffset(C), "cl_int", C.lda, // ldc "cl_int", k, // k "cl_double", alpha, // alpha "cl_double", beta); // beta clSetKernelArg(kernel.kernel, 12, (blockSize*blockSize)*(4*4)*sizeof(double), NULL); } The calculateOffset function calculates a valid offset to use (verified with CPU and on nVidia OpenCL platform).

himanshu_gautam · ‎05-30-2011

galmok,

have you been able to isolate when the issue happens. Can you send a testcase.

galmok · ‎05-30-2011

I have not been able to find out why or where this problem happens, but AMD already has my code that will trigger this issue.

For AMD:

Enable the test_gpu_dtrsm() call around line 616 in invMatrixLU.cpp and call the program with -0 which will make it use only OpenCL 1.0 compatible calls. Without -0, the test will succeed without crash/lockup. The test should run with a large size, e.g. 4096,4096 (init_test(4096,4096).

For forum-readers:

I cannot publish my code right now and I haven't created a smaller test-case as AMD already has the code.

himanshu_gautam · ‎05-30-2011

Hi galmok,

Did you filed a ticket or mailed it somewhere. It will be easy to track it down in that case.

galmok · ‎05-30-2011

It was mailed to gputools.support@amd.com at May 19. with the subject: "Reg. project that the profiler will not profile".

My project could trigger 3 AMD errors (1 profiler error and 2 crash/freeze errors).

awatry · ‎08-12-2011

Originally posted by: galmok

4096*4096*8 = 128MB upload and downloaded with non-rect takes about 7.5ms. This includes copying host memory to linear host buffer, uploading linear to device, running a kernel to copy linear device memory to destination array (strided).

The same array uploaded and downloaded with rect takes about 0.48 seconds. This is about 6.5 times slower and as it is, takes longer than the kernel operating on the data.

.48 seconds = 480ms

480ms / 7.5ms = 65 times slower, which is much worse than 6.5 times.

I've got an OpenCL program that I'm working on that was using a memcpy kernel in OpenCL 1.0 but will transition to clEnqueueCopyBufferRect with 1.1.... so hopefully this can be sped up a bit.

Archives Discussions

Peformance of EnqueueRead/Write/CopyBufferRect