11 Replies Latest reply on Aug 12, 2011 6:10 PM by awatry

    Peformance of EnqueueRead/Write/CopyBufferRect

    galmok

      Is there any information on when we can expect the performance of the following functions to be increased (significantly)?

      clEnqueueReadBufferRect

      clEnqueueWriteBufferRect

      (clEnqueueCopyBufferRect) <- didn't really test this one and it may be ok.

      A short test, comparing a clEnqueueRead/WriteBuffer sequence with clEnqueueRead/WriteBufferRect transferring the same amount of memory show significant difference in transfer speed.

      4096*4096*8 = 128MB upload and downloaded with non-rect takes about 7.5ms. This includes copying host memory to linear host buffer, uploading linear to device, running a kernel to copy linear device memory to destination array (strided).

      The same array uploaded and downloaded with rect takes about 0.48 seconds. This is about 6.5 times slower and as it is, takes longer than the kernel operating on the data.

       

        • Peformance of EnqueueRead/Write/CopyBufferRect
          himanshu.gautam

          Thanks for reporting it.

          • Peformance of EnqueueRead/Write/CopyBufferRect
            dragonxi4amd

            Hi,

            Thank you for your performance request !

            AMD has decided the following approach (according their answer):

            Step 1: to create a working OpenCL implementation

            - does too slow equal to working from developers/users point of view ? 

            Step 2: to improve performance after Step 1

            - is it always possible to improve performance as much as requied ?

            - is there a risk that Step 1 has produced a structure which will require

            big changes (which create new bugs) to become fast enough ?    

             

            What's your opinion about AMD's strategy ?

            About your short tests:

            1) OS ?

            2) GPU ?

            3) CPU ?

            4) system load (processes, threads, other apps during your test)

            5) was the system load the same in all tests ?

            6) were you able to reproduce the results ?

            7) have you tried clEnqueueCopyBufferRect test ?

            Few guys in this forum have expressed their concern about AMD's OpenCL implementation performance !  

            Thanks !

             

             

             

             

              • Peformance of EnqueueRead/Write/CopyBufferRect
                galmok

                Well, to be honest, AMD has said they will focus on being feature correct before optimising for performance.

                I just hope the they do allocate a bit of resources for this particular area as my workaround, while much faster, requires much more memory and when comparing to the performance on nVidia platform, should have much room for improvement. On the nVidia platform, using the *Rect function are always faster than using the workaround I described (host:strided->linear, upload linear, device:linear->strided).

                However, I have kernels that freeze the driver (another bug report is on its way here) and I'd rather they focus on that then optimising performance. ;-)

                  • Peformance of EnqueueRead/Write/CopyBufferRect
                    himanshu.gautam

                    hi all,

                    Thanks for your views.

                    galmok,

                    Thanks for reporting the previous bug. actually it was also posted by someone else some time ago, and investigations are going on.

                    Please post the other bug you are talking about.

                     

                      • Peformance of EnqueueRead/Write/CopyBufferRect
                        galmok

                        I haven't isolated the bug yet, but the driver freezes and the watchdog timer doesn't kick in. I can remotely log in (via cygwin ssh/sshd), start programs and such, but can't initiate shutdown. I can get halfway through a remote desktop connection. The screen remains frozen. Once found, I'll post it in this forum.

                          • Peformance of EnqueueRead/Write/CopyBufferRect
                            galmok

                            I have isolated part of the bug. The difference between lockup and not-lockup is in the kernel function definitions and 2-3 kernel lines it seems. If have implemented much of the DTRSM blocked triangle solver and it works fine using my OpenCL 1.1 kernel definitions. With my OpenCL 1.0 kernel definitions, I eventually get a lockup. The lockup first occurs after some iterations, meaning the kernels being run complete for the first about 18 iterations and then suddenly they cause a lockup.

                            All of it runs fine on nVidia hardware and on cpu. Also, if the matrix is small enough, there are not enough iterations for the driver to lockup and the correct result is returned. Only when enough iterations are being made will the driver lock up.

                            My OpenCL 1.1 and 1.0 kernel definitions: CLVERSION_1_1 is defined at runtime if OpenCL 1.1 has been compiled in and is enabled at runtime. #ifdef CLVERSION_1_1 kernel void dtrsmL(int upper, int transa, int nounit, int m, int n, double alpha, global double *A, int lda, global double *B, int ldb) { #else kernel void dtrsmL(int upper, int transa, int nounit, int m, int n, double alpha, global double *A, int aoff, int lda, global double *B, int boff, int ldb) { A += aoff; B += boff; #endif double b[64]; int j=get_global_id(0); // column to calculate int i,k; double temp; ... and #ifdef CLVERSION_1_1 __kernel void dgemmNN_AMD(__global double4 *matrixA, int lda, __global double4 *matrixB, int ldb, __global double4* matrixC, int ldc, int widthA, double alpha, double beta, __local double4 *blockB) { #else __kernel void dgemmNN_AMD(__global double4 *matrixA, int aoff, int lda, __global double4 *matrixB, int boff, int ldb, __global double4* matrixC, int coff, int ldc, int widthA, double alpha, double beta, __local double4 *blockB) { matrixA += aoff; matrixB += boff; matrixC += coff; #endif ... Host code to setup kernel argument for dtrsmL: #ifdef CL_VERSION_1_1 if (OpenCL11) { // create pointer to subbuffer (if either row or col offset != 0) subA = createSubBuffer(A, CL_MEM_READ_ONLY); subB = createSubBuffer(B, CL_MEM_READ_WRITE); // set kernel arguments ccSetKernelArgs(kernel, 10, "cl_int", upper, "cl_int", trans, "cl_int", nounit, "cl_int", m, "cl_int", n, "cl_double", alpha, "cl_mem", &subA, "cl_uint", A.lda, "cl_mem", &subB, "cl_uint", B.lda); } else #endif { ccSetKernelArgs(kernel, 12, "cl_int", upper, "cl_int", trans, "cl_int", nounit, "cl_int", m, "cl_int", n, "cl_double", alpha, "cl_mem", &A.mem, "cl_int", calculateOffset(A), "cl_uint", A.lda, "cl_mem", &B.mem, "cl_int", calculateOffset(B), "cl_uint", B.lda); } Host code to setup kernel arguments for dgemmNN_AMD: #ifdef CL_VERSION_1_1 if (OpenCL11) { // create pointer to subbuffer (if either row or col offset != 0) subA = createSubBuffer(A, CL_MEM_READ_ONLY); subB = createSubBuffer(B, CL_MEM_READ_ONLY); subC = createSubBuffer(C, CL_MEM_READ_WRITE); // set kernel arguments //kernel void dgemmNN( int m, int n, global double *A, int lda, global double *B, int ldb, global double* C, int ldc, int k, double alpha, double beta ) { ccSetKernelArgs(kernel, 9, "cl_mem", &subA, // A (&A.mem) "cl_int", A.lda, // lda "cl_mem", &subB, // B (&B.mem) "cl_int", B.lda, // ldb "cl_mem", &subC, // C (&C.mem) "cl_int", C.lda, // ldc "cl_int", k, // k "cl_double", alpha, // alpha "cl_double", beta); // beta clSetKernelArg(kernel.kernel, 9, (blockSize*blockSize)*(4*4)*sizeof(double), NULL); } else #endif { ccSetKernelArgs(kernel, 12, "cl_mem", &A.mem, // A (&A.mem) "cl_int", calculateOffset(A), "cl_int", A.lda, // lda "cl_mem", &B.mem, // B (&B.mem) "cl_int", calculateOffset(B), "cl_int", B.lda, // ldb "cl_mem", &C.mem, // C (&C.mem) "cl_int", calculateOffset(C), "cl_int", C.lda, // ldc "cl_int", k, // k "cl_double", alpha, // alpha "cl_double", beta); // beta clSetKernelArg(kernel.kernel, 12, (blockSize*blockSize)*(4*4)*sizeof(double), NULL); } The calculateOffset function calculates a valid offset to use (verified with CPU and on nVidia OpenCL platform).

                    • Peformance of EnqueueRead/Write/CopyBufferRect
                      awatry

                       

                      Originally posted by: galmok

                       

                      4096*4096*8 = 128MB upload and downloaded with non-rect takes about 7.5ms. This includes copying host memory to linear host buffer, uploading linear to device, running a kernel to copy linear device memory to destination array (strided).

                       

                      The same array uploaded and downloaded with rect takes about 0.48 seconds. This is about 6.5 times slower and as it is, takes longer than the kernel operating on the data.

                       

                      .48 seconds = 480ms

                      480ms / 7.5ms = 65 times slower, which is much worse than 6.5 times.

                      I've got an OpenCL program that I'm working on that was using a memcpy kernel in OpenCL 1.0 but will transition to clEnqueueCopyBufferRect with 1.1.... so hopefully this can be sped up a bit.