4 Replies Latest reply on Jan 25, 2011 3:53 PM by rick.weber

    brook vs. OpenCL performance enhancers


      There was a couple of things we did in Brook+ which greatly improved performance, and I would like to get some feedback on the OpenCL analogy (if any). Please note that all the following is based on experience with "double" DP:

      1- We used to have the output in the "double out<>" form rather than scatter "double out [idx]" form. It did not degrade the performance if some of the RHS input where gather streams. The closest thing I found in OpenCL was the image objects, but this is only for four 32bit .xyzw components of a 128b word. In brook, it worked great in double DP.

      The brook scatter out out[idx]=.... was horribaly slow, which leads me to think this is the current mode of OpenCL output.

      I did try a simple "out[global_id]  = in [global_id]" kernel in both Brook& OpenCL, and I believe I did not see any significant improvements, compared to a Brook "out<> = in <>"

      2- We used to have 2D streams, again in double precision. This used to get a factor of 5x improvement in some cases. It was not clear then if this was because of the data being put in memory in 2D form, or because the trheads were put in 2D form. In this mode of brook streams, there was one-to-one analogy between thread indices & data indices. This was the mode again which did not need threadi IDs, and did not have scatter outputs: "out <dim1, dim2>. I tried 2D threads in OpenCL, and did not seem to improve anything.

      The CAL & OpenCL documentation when it talks about memory access patterns, such as tiled patterns, does not clarify what is the size of the data elements (32b or 64b). My guess it was 32b, which does not explain any speedup for DP.

      I hope the doumentation will give detailed explanations to DP data arrangements. I think DP is the one used in all numerical packages


        • brook vs. OpenCL performance enhancers

          Ahh the bad old days...

          OpenCL has no equivilent of "out <>." Every kernel is a scatter kernel, even when you write to images. As long as you're coalescing global writes, this shouldn't imapact performance. By the way, you can store doubles in images; put the top and bottom 32 bits of the double in two different channels of the image then read them using as_double().

          I suspect the reason 2D streams were faster in Brook+ was because 1D streams were emulated using 2D streams. This required some modulo and division arithmetic (or, at best, ands and bitshifts) to get the 2D index from a 1D index. The dimensionality of threads in OpenCL shouldn't really affect performance so long as you have 64 threads per block. So 8*8 should perform as well as 64. Threads in OpenCL are really just a lightweight index, and you just need to do enough of them to fill a wavefront.

          Tiling memory accesses applies for both single and double; you want neighboring threads reading neighboring addresses. It doesn't matter whether they're float, double, or int4. However, to truly maximize bandwidth, you need to use 128-bit data types.

          • brook vs. OpenCL performance enhancers

            thanks rick for pointing out the as_double(). Sounds intersting and should give it a try

            Somehow I tend to disagree about the scatter outputs. CAL as well has this kind of stream non-scatter outputs, and if CAL is still the backbone for OpenCL (correct?), then there must be a way to direct the compiler to use it?

            As for 2D indices, I believe for shader memory (or global memory then) it is one 1D, then the 2D should be slower (actually 2D is slower in OpenCL).

            Can someone from AMD confirm?


              • brook vs. OpenCL performance enhancers

                I don't think you can get OpenCL to use the "o" register in CAL, since all textures and memory objects are inherently "scatter" in OpenCL.

                Since global memory is inherently 1D, you are correct. Any higher dimensional indexing must be emulated using 1D indexing. In fact, C's support for multi-dimensional arrays is pretty abysmal so you'll usually write the indexing yourself.

              • brook vs. OpenCL performance enhancers
                OpenCL is not a streaming language, so there is no way to target the streaming outputs like existed in Brook+ from OpenCL. It is possible to hit close to peak performance using the scatter outputs as shown in the performance doc.