7 Replies Latest reply on Feb 12, 2013 8:08 PM by himanshu.gautam

    Calling APPML BLAS functions from the kernel

    bulibuta

      Good morning everyone,

       

      Is there a way to call the BLAS implementation from my own kernel?

       

      So, not from the host program, but directly from the OpenCL kernel that I'm developing.

       

      I would like to try and replace my hand-rolled BLAS kernels with the ones provided by AMD and start using them in my kernels (not in my host programs).

       

      Hope my intentions are clear,

      Paul

        • Re: Calling APPML BLAS functions from the kernel
          himanshu.gautam

          APPML APIs are host based just like any other BLAS library.

          You cannot access the Kernels (or) call them from your kernel.

           

          Nonetheless, APPML APIs are OpenCL aware. So, you can pass your matrices and vectors as "cl_mem" objects, you can even pass the command queue and event related information. This should be helpful in your endeavors.

            • Re: Calling APPML BLAS functions from the kernel
              bulibuta

              himanshu.gautam wrote:

               

              Nonetheless, APPML APIs are OpenCL aware. So, you can pass your matrices and vectors as "cl_mem" objects, you can even pass the command queue and event related information. This should be helpful in your endeavors.

               

              Right, but that would mean exiting the kernel context and reentering it with updated data for each BLAS operation within my algorithm, which I think would be a major slow down.

               

              I might try it on a rainy day.

                • Re: Calling APPML BLAS functions from the kernel
                  himanshu.gautam

                  Not really. You don't need to exit the kernel context. You can create the context and re-use it as much times as possible.

                  You can also check out USE_HOST_PTR  - which can simplify your code (beware of any performance loss - you may need to check out for yourself)

                   

                  btw, APPML was written with the philosophy that all BLAS operations happen within the OpenCL device. Transferring data back and forth host-memory is not something that APPML is designed for.

                  Do you have such a requirement?

                  Can you talk a bit about your application?

                  It will help AMD understand developer requirements as well. Thanks,

                   

                  Also, Please have a look at http://devgurus.amd.com/message/1286929#1286929

                   

                  Overlapping mem transfers with kernel execution will "fit" your bill because of compute-complexity involved in BLAS (assuming blas3). Right now, there are no SDK samples that demonstrates how this works... But this should be on the way...sometime in future. But for the moment, you can check out the link above and German's detailed reply on this.

                    • Re: Calling APPML BLAS functions from the kernel
                      bulibuta

                      himanshu.gautam wrote:

                       

                      btw, APPML was written with the philosophy that all BLAS operations happen within the OpenCL device. Transferring data back and forth host-memory is not something that APPML is designed for.

                      Do you have such a requirement?

                      Can you talk a bit about your application?

                      It will help AMD understand developer requirements as well. Thanks,

                       

                      Sorry for getting back to this so late.

                       

                      Yes, my application needs to transfer data back and forth.

                       

                      My application might be a bit atypical, in that it deals with signal processing not game development.

                      It has to do with sparse representations via large dictionary sets. The idea being that, using a greedy pursuit, the algorithm tries to find as fewer atoms as possible from the dictionary to give a 'good enough' sparse representation of the original signal. The 'good enough' part is controlled through error checking and/or a minimal sparsity level of the new representation.

                       

                      So what it does is that it picks an atom, does some computation to see how far off the current representation is from the error bound and the sparsity level, and if it needs more atoms it fetches a new one and represents the signal with the two atoms and then redoes the computations to check the current error against the accepted error and sparsity level. Repeat until satisfied.

                       

                      All this picking of atoms and comparing is in fact modifying the data that the algorithms, including the BLAS ones, are working on.

                       

                      So this is why I think I need to do a lot of back and forth with the current API. I don't see how I could do it otherwise.

                       

                      Does this clear up my use-case? If not feel free to ask for more explanations.

                        • Re: Calling APPML BLAS functions from the kernel
                          himanshu.gautam

                          Thanks for the details. Your work seems pretty interesting.

                           

                          So, Each iteration is a kernel launch and the host transfers results after each iteration, examines and then continues to next iteration. Is that correct?

                            • Re: Calling APPML BLAS functions from the kernel
                              bulibuta

                              When using the clAmdBlas library, every iteration calls multiple kernels corresponding to the BLAS operations in use. Each iteration depends on the previous iteration's results (e.g. selecting atoms from the dictionary, extending a couple of matrices by another column etc.).

                               

                              My approach so far, w/o clAmdBlas, has been to keep the entire loop in a kernel that calls different functions inside the cl context (like my hand-rolled OpenCL BLAS level-2 and level-3 functions). So one kernel, multiple functions, within the same .cl file if you will.

                              That way I only transfer the initial data once. This is pretty bad for big sets of data though so I might need to optimize a few things there, but that's a different topic.