10 Replies Latest reply on Jul 19, 2011 12:36 PM by jaidotsh

    Optimum performance

    jaidotsh

      I wanted to know whether my code( CSR Matrix multiplication ) will give optimum performance if I use all of the below optimizations together??

      1. float to float4 (current implementation)

      2. Blocking (Yet to add. i.e., grouping into warp sized blocks)

       Does the 2nd optimization matter much in terms of performance?. 

        • Optimum performance
          jaidotsh

          Also, Can u suggest any other kind of optimization technique that I can use?

          • Optimum performance
            maximmoroz

            Honestly, using float4 is the last thing which would came to my mind when optimizating CSR matrix multiplication. May I ask you, how did you benefit from float4?

            What is your current position and target? I mean what is the current efficiency of your kernel against theoretical ALU and memory bandwidth and what would you like to achieve?

              • Optimum performance
                jaidotsh

                Actually, I am improving on the sparse matrix library for OpenCL - ViennaCL. In fact they have used float4, float8, float16 in their implementation. I'm developing a custom kernel using warp based optimization techniques which I believe would give better performance, but I'm not quite sure about it.

                Using float4 I can take 4 non-zeros together which executes it's own instruction. Current implenentation is fairly efficient, but as I said, I want to improve. So can you suggest me some other optimization technique?

                  • Optimum performance
                    maximmoroz

                    I have just downloaded ViennaCL and failed to see any compressed matrix multiplication (C=AxB) OpenCL code. What file the kernel is located in, please? Or maybe you will show the kernel here? It will be helpful.

                      • Optimum performance
                        jaidotsh

                        It's located in viennacl/linalg/kernels/compressed_matrix_source.h

                         

                          • Optimum performance
                            maximmoroz

                            Thanks. So it is just matrix by vector multiplication. Here are my thoughts:

                            1) What is the reason to have single work-item calculating bunch of output elements (get_global_size(0)/size)? What is size here?

                            2) Try replacing "__global const float * vector" with "const __global float * restrict vector". This will enable cached reads from "vector" buffer. Actually, you might also specify "restrict" for any buffer in the parameters.

                            3) Yes, it makes sense to use float4 (or even float8) as one is unable to unroll the loop with dynamic upper value. Actually, you get speed up here not due to vector ALU operations and not due to reading 4 subsequent floats, but due to reading these 4 floats (or 8, or think about 16) in single TEX clause.

                            4) For the time being I don't see any way local memory would help here. So I would suggest the following approach:

                            a) Minimize amount of clauses (you are doing it with using float4 and float8)

                            b) Minimize register usage, thus you will be able to run the maximum amount of wavefronts at the single CU.

                            c) You might need to implement some logic for determining the best local worksize which will not introduce its own limitation on the number of active wavefronts.

                              • Optimum performance
                                jaidotsh

                                "size" is the number of work-items per work-group. But when I am multiplying it with a 1D array I need not cache anything because there is only one column.

                                Can you tell me what is 1.Vector ALU Operations 2.TEX clause ??

                                And how do I minimize the register usage?

                                  • Optimum performance
                                    himanshu.gautam

                                    clauses are the group of instructions which can execute in one go. GPUs generally divide kernels into clauses like ALU clause, control clause, fetch clause.

                                    Refer to opencl Programming guide to know details about clauses.

                                    • Optimum performance
                                      maximmoroz

                                      Honestly, I don't understand the partition of the work in the kernel.

                                      Below is the code for the non-vectorized kernel.

                                      Outer cycle limits "row" to be less than "size". Thus the kernel writes data to "result" buffer for indexes [0, size - 1]. It looks like "size" is the number of elements in the resulting vector. And now you say that the "size" is the number of work-items per work-group. These two are quite different entities.

                                      "Vector ALU Operations" - I used this phrase to indicate that using float4 operations helps filling VLIW slots. Kind of bad word usage, sorry.

                                      "Minimize register usage" - Use Kernel Analyzer, check ISA-code generated.

                                      Read AMD OpenCL Programming Guide.

                                      __kernel void vec_mul( __global const unsigned int * row_indices, __global const unsigned int * column_indices, __global const float * elements, __global const float * vector, __global float * result, unsigned int size) { for (unsigned int row = get_global_id(0); row < size; row += get_global_size(0)) { float dot_prod = 0.0f; unsigned int row_end = row_indices[row+1]; for (unsigned int i = row_indices[row]; i < row_end; ++i) dot_prod += elements[i] * vector[column_indices[i]]; result[row] = dot_prod; } }