11 Replies Latest reply on Jun 23, 2011 1:49 PM by divij

    Interface bewteen CAL and openCL

    divij
      Calling a CAL routine from openCL

      I have to call a routine written in CAL from openCL.

      Is there a way to link/interface a CAL and openCL code?

        • Interface bewteen CAL and openCL
          jeff_golds

          No, we don't allow interop between CAL and OpenCL.  What do you need from CAL?

          Jeff

            • Interface bewteen CAL and openCL
              divij

              Hey Jeff,

              I just want to use a highly efficient sgemm routine in an openCL code.

              The matrix multiplication code bundeled as a sample in the AMD-APP-SDK is able to give ~500 GFlops on HD699 for single precision.

              However, I have found codes developed by people such as this which give upto 2TFlops. But all the codes are written using CAL and kernal written using IL assembly programming.

              So, If I want to use openCL, does that mean I'd not be able to use highly optimised codes?


              If we talk of openCL as the standard of the future, at least it should be able to achieve optimizations as compared to other languages.

               

               

                • Interface bewteen CAL and openCL
                  rick.weber

                  If you're a god, you can edit the ELF file spit out of the OpenCL program binary, inject your IL code and it should run that on the fly. I have no idea what the parameter mapping between OpenCL and its IL backend is though.

                  • Interface bewteen CAL and openCL
                    jeff_golds

                     

                    Originally posted by: divij Hey Jeff,

                     

                    I just want to use a highly efficient sgemm routine in an openCL code.

                     

                    The matrix multiplication code bundeled as a sample in the AMD-APP-SDK is able to give ~500 GFlops on HD699 for single precision.

                     

                    However, I have found codes developed by people such as this which give upto 2TFlops. But all the codes are written using CAL and kernal written using IL assembly programming.

                     

                    So, If I want to use openCL, does that mean I'd not be able to use highly optimised codes?

                     

                    If we talk of openCL as the standard of the future, at least it should be able to achieve optimizations as compared to other languages.



                    I believe you are encountering a limit of the sample, not the API.  The blocksize on the sample is much smaller than in the optimized CAL version.

                    Jeff

                      • Interface bewteen CAL and openCL
                        divij

                         

                        I believe you are encountering a limit of the sample, not the API.  The blocksize on the sample is much smaller than in the optimized CAL version.

                         

                        Jeff

                         

                        I have timed the sample provided in the SDK after increasing the blocksize also. The result still remains ~500GFlops

                        Yes, I agree that it is a limit of sample and not the API. But to optimize GEMM further than the provided sample, we'd have to use registers, texture cache etc. because these are the ones used in the optimized CAL version. But direct user control over the VLIW and other hardware details is not possible using openCL.

                        Jeff, Can you suggest any other optimization other than that already used in the sample which is possible using openCL only?

                        Thanks for your replies.

                        P.S. Do you work at AMD?

                  • Interface bewteen CAL and openCL
                    MicahVillmow
                    divij,
                    Try this change:
                    __kernel void mmmKernel(__global float4 *matrixA,
                    __global float4 *matrixB,
                    to:
                    __kernel void mmmKernel(const __global float4 * restrict matrixA, const __global float4 * restrict matrixB)

                    This will give you caching on matrixA and matrixB and should improve performance.

                    Second, the IL code uses an 8x8 outer product, the SDK sample uses a 4x4. That alone will dramatically reduce what you can get.
                      • Interface bewteen CAL and openCL
                        divij

                         

                        Originally posted by: MicahVillmow divij, Try this change: __kernel void mmmKernel(__global float4 *matrixA, __global float4 *matrixB, to: __kernel void mmmKernel(const __global float4 * restrict matrixA, const __global float4 * restrict matrixB) This will give you caching on matrixA and matrixB and should improve performance. Second, the IL code uses an 8x8 outer product, the SDK sample uses a 4x4. That alone will dramatically reduce what you can get.


                        Thank you for your reply.

                        1) I'd soon make the modifications you suggest and come back with the performance details.

                        2) Please correct me if I am wrong but the blockSize for the sample can be changed using the parameter -b while executing. I have tested the code with the maximum blocksize that my card supports i.e. 8x8 and it gives 465 GFlops.

                         

                          • Interface bewteen CAL and openCL
                            himanshu.gautam

                            Have you tried matrixmulImage sample. It is expected to have better GFLOPS value.

                              • Interface bewteen CAL and openCL
                                divij

                                 

                                Originally posted by: himanshu.gautam Have you tried matrixmulImage sample. It is expected to have better GFLOPS value.

                                 

                                Thank you again Himanshu.

                                While using tile sizes of 4x8 it is able to achieve 1.6TFlops and I guess it can be further optimized to use 8x8 tile size.

                                I am very curious to know about what all happened here. What is the difference between the two implementations and where did the performance boost came from?

                                Any suggestion on reading material on these aspects?

                          • Interface bewteen CAL and openCL
                            MicahVillmow
                            divij,
                            You can find some information here:
                            http://forum.beyond3d.com/showthread.php?t=54842

                            Basically going from a 4x4 -> 4x8 -> 8x8 decreases the amount of bandwidth that is required to do the calculation, thus increasing the performance of the algorithm.
                            For example, prunedtree showed that with a 8x4, on RV770, the peak is 600 GFlops, but with 8x8, the peak is 960GFlops.