66 Replies Latest reply on Jan 28, 2012 2:57 AM by Skysnake

    OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?

    Nikolai_
      anyone have any results?

      Hello!

      i just want to see how close to theoretical flops the radeon/firestream cards get... i figure sgemm results would be a measure

        • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
          Nikolai_

          no one knows of any blas results?

            • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
              hazeman

              Here you have some results http://forum.beyond3d.com/showthread.php?t=54842 .

              EDIT: Prunedtree achived 1 TFlops on 4890 and 2+ TFlops on 58xx.

              But it should be noted that it isn't real sgemm. Input matrices are diveded into 2 parts ( even and odd colums ). Also one matrix is in the row format ( C style ) and the other in the column format ( fortran style ).

              I'm including almost exact rewrite of prunedtree code in more readable form ( he gave ISA code only ). The only difference is that i calculate block position and prunedtree used precomputed data.

              PS. The code is written in C++ to IL compiler/converter. Normal C++ if/while/for are used for controling code generation and loop unroling ( something like really advanced preprocessor ). The il_whileloop, il_if, il_breakc give coresponding instructions in IL.

              #define BX 8 #define BY 8 #define BX4 (BX/4) #define BY4 (BY/4) void kernel_matrixmul( input2d<float4>& A0, input2d<float4>& A1, input2d<float4>& B0, input2d<float4>& B1, global<float4>& C, float1 xsize, float1 ysize ) { float4 R[BY][BX4],ta[2][BY4],tb[2][BX4],k; float2 p; int i,j; assert( BX==8 && BY==8 ); p = (named_variable<float2>("vWinCoord0.xy")-float2(0.5))*float2(BX4,BY); for(i=0;i<BY;i++) { for(j=0;j<BX4;j++) R[i][j]=float4(0); } k = float4( p.y(), p.x(), float1(-2), float1(-1) ); il_whileloop { k += float4(0,0,2,2); il_breakc( k.z()>=ysize ); ta[0][0] = A0[k.xz()]; ta[0][1] = A1[k.xz()]; tb[0][0] = B0[k.yz()]; tb[0][1] = B1[k.yz()]; ta[1][0] = A0[k.xw()]; ta[1][1] = A1[k.xw()]; tb[1][0] = B0[k.yw()]; tb[1][1] = B1[k.yw()]; for(i=0;i<BY4;i++) { for(j=0;j<BX4;j++) { R[4*i+0][j].x() = mad( ta[i][0].x(),tb[j][0].x() , mad( ta[i][1].x(),tb[j][1].x(), R[4*i+0][j].x() ) ); R[4*i+0][j].y() = mad( ta[i][0].x(),tb[j][0].y() , mad( ta[i][1].x(),tb[j][1].y(), R[4*i+0][j].y() ) ); R[4*i+0][j].z() = mad( ta[i][0].x(),tb[j][0].z() , mad( ta[i][1].x(),tb[j][1].z(), R[4*i+0][j].z() ) ); R[4*i+0][j].w() = mad( ta[i][0].x(),tb[j][0].w() , mad( ta[i][1].x(),tb[j][1].w(), R[4*i+0][j].w() ) ); R[4*i+1][j].x() = mad( ta[i][0].y(),tb[j][0].x() , mad( ta[i][1].y(),tb[j][1].x(), R[4*i+1][j].x() ) ); R[4*i+1][j].y() = mad( ta[i][0].y(),tb[j][0].y() , mad( ta[i][1].y(),tb[j][1].y(), R[4*i+1][j].y() ) ); R[4*i+1][j].z() = mad( ta[i][0].y(),tb[j][0].z() , mad( ta[i][1].y(),tb[j][1].z(), R[4*i+1][j].z() ) ); R[4*i+1][j].w() = mad( ta[i][0].y(),tb[j][0].w() , mad( ta[i][1].y(),tb[j][1].w(), R[4*i+1][j].w() ) ); R[4*i+2][j].x() = mad( ta[i][0].z(),tb[j][0].x() , mad( ta[i][1].z(),tb[j][1].x(), R[4*i+2][j].x() ) ); R[4*i+2][j].y() = mad( ta[i][0].z(),tb[j][0].y() , mad( ta[i][1].z(),tb[j][1].y(), R[4*i+2][j].y() ) ); R[4*i+2][j].z() = mad( ta[i][0].z(),tb[j][0].z() , mad( ta[i][1].z(),tb[j][1].z(), R[4*i+2][j].z() ) ); R[4*i+2][j].w() = mad( ta[i][0].z(),tb[j][0].w() , mad( ta[i][1].z(),tb[j][1].w(), R[4*i+2][j].w() ) ); R[4*i+3][j].x() = mad( ta[i][0].w(),tb[j][0].x() , mad( ta[i][1].w(),tb[j][1].x(), R[4*i+3][j].x() ) ); R[4*i+3][j].y() = mad( ta[i][0].w(),tb[j][0].y() , mad( ta[i][1].w(),tb[j][1].y(), R[4*i+3][j].y() ) ); R[4*i+3][j].z() = mad( ta[i][0].w(),tb[j][0].z() , mad( ta[i][1].w(),tb[j][1].z(), R[4*i+3][j].z() ) ); R[4*i+3][j].w() = mad( ta[i][0].w(),tb[j][0].w() , mad( ta[i][1].w(),tb[j][1].w(), R[4*i+3][j].w() ) ); } il_breakc( xsize<float1(0) ); // hack to reduce register usage } } il_endloop uint1 s,step; s = convert_uint1(p.y()*xsize + p.x()); step = convert_uint1(xsize); for(i=0;i<BY;i++) { for(j=0;j<BX4;j++) { C[s+j] = R[i][j]; } s += step; } }

                  • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                    oscarbarenys1

                    Can you post also the C++ to IL compiler/converter for playing in binary or source form..

                    Can you elaborate how easy is to port that to OpenCL?

                     

                      • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                        hazeman

                        The performance results are the same as in the prunedtree post. ISA generated from the included C++ code is almost the same as in the prunedtree's case ( including the hack to reduce register usage ).

                        I think writing this in OpenCL is almost impossible at the time. First of all you don't have enough control over IL code generated from OCL ( without the register usage hack you loose 20-30% of speed ). Also matrixmul uses TU's to access memory. ATI probably won't deliver image support in OCL for a long time ( they need huge change in CAL and OCL compiler architecture - they went easy way to provide OCL compiler <sarcasm on>in such a short time <sarcasm off>. Without TU's you need explicit address computation (increase in register usage) and you loose memory caching ( at least on 4xxx ).

                        As to C++ to IL compiler/converter/generator ( I really don't know what it should be called ). It's set of include files to C++ ( huge use of templates ) which enables to write kernels directly in C++ in OCL like language. It also includes C++ bindings to CAL ( which are very similar to OCL C++ bindings ). I'm really thinking about making it available ( for over a month now ), but I don't have much spare time to do it. Also I should probably write some docs for it. Ohh and one more thing - I don't have name for it .

                         

                          • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                            oscarbarenys1

                             

                            Well I have been holding this trick on my head for over a month now..

                            More info on my blog coming soon: oscarbg.blogspot.com

                            Really you can enable image support set:

                            set GPU_IMAGES_SUPPORT=1

                            or export

                            tested on 5870 and amd stream 2.0 and hotfix 9.12 only works for

                            2d images.. also you can get generated IL code so only lacking is building programs from that.. Nvidia supports this..

                             

                            Similarly you can enable byte_addresable_support (but seems is not using RAW UAVs) and some Nvidia samples work (histogram64) with GPU_BYTE_ADDRESSABLE_STORE

                            Also doules extension reporting

                            GPU_DOUBLE_PRECISION

                            and gl

                            CL_KHR_GL_SHARING

                            I strongly encourage to do so.. (release C++ to IL compiler/converter/generator)

                            Regarding documentation as far as you document how to compile this matmul code is enough!

                            Please don't wait for full documentation as perhaps I can contribute to improve your code if you want..

                            Do you support doubles?

                             

                  • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                    cjang

                    Here is a chart of sgemm performance on a HD 5870 using OpenCL SDK 2.0.

                    http://golem5.org/bucket/gatlas/sgemm.png

                    There are three parameterized kernel implementations. One uses memory buffers (blue curve). The other two use images (red curves). Dual images means even/odd column storage of A and B matrices.

                    For each (square) matrix size, the kernels are combinatorially optimized over: 1) local work item dimension; 2) Nx4 quad inner product blocking; 3) row/column versus tiled matrix data layout.

                    Is anyone aware of faster OpenCL performance for sgemm? I've run out of ideas but feel that higher performance is possible as CAL IL/ISA performance is so much better.

                    Also, is it possible (or practical) to interoperate OpenCL with CAL IL/ISA? I wonder if they can co-exist and run using the same memory buffers. That would be the best of both worlds.

                      • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                        hazeman

                        I think you should try to compare ISA generated from CAL++/prunedtree version and OpenCL version.

                        This would give you idea where OpenCL generates bad code and maybe you could find a workaround.

                        Also to achive highest performance you need to run 8 hardware threads on each simd ( so 512 gpu threads per simd ). But if you try to force it on kernel with too much used registers then CAL uses stack ( to store registers ) and result is really big slowdown.

                         

                          • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                            cjang

                            Your suggestion helped. After reordering the image reads and inner product MADs to match CAL++/prunedtree, performance is somewhat improved for the dual image version in OpenCL.

                            I've captured IL from /tmp/OCL*.il files generated by the OpenCL runtime and compared them with CAL++/prunedtree and AMD simple_matmult IL. Though I have no experience with ATI GPU IL/ISA, it's obvious the OpenCL compiler is generating less efficient code. It appears to use registers less efficiently than it could. Everything is slower as there are lots of unnecessary MOVs to shift data in and out of common registers around the MADs and I/O reads. I don't know what to do about this short of some very ugly hacking to get around the LLVM OpenCL compiler - in that case, why not use CAL/IL? (which leads to CAL++ )

                            Here's something that is kind of interesting. It's a graph comparing six different implementations of matrix multply on a HD 5870. Note that there is only a single timing trial so these numbers are not average values. I've noticed that the faster implementations tend to have higher variance so these numbers may be higher or lower than expected. The slower OpenCL implementations have low variance, only one or two percent at most.

                            http://golem5.org/bucket/gatlas/matmult6.png

                            • 1627 GFLOPS - CAL++/prunedtree at 3840x3840
                            • 1182 GFLOPS - CAL simple_matmult at 1280x1280
                            • 915 GFLOPS - OpenCL dual images at 2560x2560
                            • 727 GFLOPS - OpenCL single images at 3840x3840
                            • 486 GFLOPS - OpenCL memory buffers at 2560x2560
                            • 172 GFLOPS - CAL memimport_matmult at 1280x1280

                            At this point, I see IL as between 50% and 100% faster than OpenCL. The difference increases as kernels shift towards becoming compute bound. OpenCL is more competitive when using UAV memory buffers as the bottleneck is then local memory bandwidth. With images (texture sampling) as the data source, the memory bandwidth is much higher so the inner loops matter more. Then OpenCL falls further behind IL.

                              • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                n0thing

                                You are getting 486 GFLops using global memory? Is this possible? As peak bandwidth from global is only 153 GB/s IIRC. My kernel using shared memory gets to about 475 GFlops on 5870.

                                Can you explain a bit more about dual images? I guess its to increase the cache hit ratio but haven't used this technique.

                                I have an internal version using single images and I get around 975 GFLops on 5850.

                                Can you share your kernels?

                                  • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                    cjang

                                    Here are the top five global to local kernels found at 2560x2560 for a test run used to make the matmult6 chart above. The numbers at the end are: <L> <N> <0|1> where: the 2D local work item dimension is LxL; the output matrix blocking is Nx4; 0 means row/column and 1 means tiled matrix data layout.

                                    1. microseconds: 68966 gflops: 486.441 2560 8 10 1
                                    2. microseconds: 87164 gflops: 384.882 2560 8 5 1
                                    3. microseconds: 87823 gflops: 381.994 2560 8 8 1
                                    4. microseconds: 93356 gflops: 359.354 2560 8 10 0
                                    5. microseconds: 106210 gflops: 315.864 2560 8 5 0

                                    Note the matrix blocking is 10x4. I had this idea to parameterize over arbitrary block sizes (not only even quads) after reading an explanation posted by prunedtree. Also note that the data layout is tiled to faciliate coalesced reads. It's basically a block outer product friendly layout.

                                    The dual images are just what the CAL++/prunedtree matrix multiply implementations do. This better hides the latency in texture reads. Take a look at hazeman's CAL++/prunedtree kernel_matrixmul source code in an earlier post on this thread. Note the two MADs per line, even/odd, in the inner product accumulation.

                                    After looking at the IL generated by the OpenCL runtime, I believe higher performance is possible using memory buffers. The inner loop accumulating the inner product from registers is not as efficient as the hand crafted IL implementations I've seen. Even when the memory reads are removed so only the inner loop and final output write are left, performance reaches around 1600 GFLOPS only. This seems low but makes sense after looking at the IL generated by OpenCL.

                                    Is your 975 GFLOPS single image version in OpenCL? I would be interested in this. How does it work? Presently, I am stuck with single images at around 720 GFLOPS with an OpenCL kernel.

                                    I do plan on sharing soon. Sorry... It's just not ready yet. The approach I've taken is auto-tuning and empirical optimization in the style of ATLAS. So it's not just kernels but a framework around them that finds optimal specializations (to use C++ terminology) for any hardware/software configuration. I want a design that is practical and useful, not only a benchmarking harness.

                                    OpenCL allows many more design possibilities for this optimization approach compared with a purely static compilation toolchain. I mean, it is possible that optimization could be lazy as well as eager and could exploit profiling statistics. That makes it more like a JIT or Oracle optimizing execution planner. However, going down this road also leads to much greater complexity.

                                      • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                        hazeman

                                         

                                        Originally posted by: cjang

                                         

                                        I do plan on sharing soon. Sorry... It's just not ready yet. The approach I've taken is auto-tuning and empirical optimization in the style of ATLAS. So it's not just kernels but a framework around them that finds optimal specializations (to use C++ terminology) for any hardware/software configuration. I want a design that is practical and useful, not only a benchmarking harness.

                                         

                                        OpenCL allows many more design possibilities for this optimization approach compared with a purely static compilation toolchain. I mean, it is possible that optimization could be lazy as well as eager and could exploit profiling statistics. That makes it more like a JIT or Oracle optimizing execution planner. However, going down this road also leads to much greater complexity.

                                         



                                        I think that this is one quite big flaw of OpenCL specification.

                                        OpenCL kernels are supposed to be working on many different architectures/devices. And yet goal of OpenCL is to deliver maximum performance. It's simply impossible to write one kernel to fit them all.

                                        IMHO OpenCL should have been designed with dynamic kernel generation in mind from the beginning. Adding some kind of scripting language of top of current C code would be good start.  Also tools for best kernel selection should be included in specification.

                                        I know that this can be achived by using additional library independend of OpenCL. But imho it so important to achive main goal of OpenCL ( speed ) that it should have been included in specs.

                                         

                                          • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                            cjang

                                            You are right about the need for kernel generation and optimization. It's necessary for good performance. If it's not part of the OpenCL specification, then everyone must re-invent the wheel. I find it quite amusing that everyone builds an internal DSL in C++ using some form of metaprogramming in order to achieve good performance in OpenCL or IL. We are forced to write custom languages which does seem very wasteful.

                                            The state of OpenCL reminds me of CORBA when it was first released. The 1.0 specification was the minimal requirement for source code compatibility. There was no common wire protocol. The 2.0 specification included IIOP which addressed this and made vendor interoperability possible. I kind of expect the evolution of OpenCL (assuming it catches on) to follow this. The next major release should hopefully address important practical concerns.

                                            I would vote for the metaprogramming kernel generation and optimization separate from the core OpenCL specification. This seems to be the trend in language design (e.g. the JVM based languages like Scala and Clojure). It's probably still yet early for OpenCL. That's why we are forced to write our own internal DSLs.

                                              • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                n0thing

                                                Performance on 5850 (OCed to 900/1300)-

                                                Executing kernel for 1 iterations
                                                -------------------------------------------

                                                KernelTime (ms) : 62.9427
                                                GFlops achieved : 1041.2

                                                MatrixA MatrixB Time(sec) KernelTime(sec)
                                                3200x3200 3200x3200 0.876951 0.175642

                                                I am using an internal version of ATI's OpenCL binaries so it may be optimized a bit from SDK 2.0.

                                                I have also tried using 4x8 register size to compute 32 values by each thread but performance is lesser (around 900 GFlops)

                                                #define TILEX 4 #define TILEX_SHIFT 2 #define TILEY 4 #define TILEY_SHIFT 2 __constant sampler_t imageSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST; __kernel void mmmKernel(__read_only image2d_t matrixA, __read_only image2d_t matrixB, __write_only image2d_t matrixC, uint widthA, uint widthB) { int2 pos = (int2)(get_global_id(0), get_global_id(1)); float4 sum0 = (float4)(0); float4 sum1 = (float4)(0); float4 sum2 = (float4)(0); float4 sum3 = (float4)(0); widthB /= 4; for(int i = 0; i < widthA; i=i+4) { float4 tempA0 = read_imagef(matrixA, imageSampler, (int2)(i/4, pos.y << TILEY_SHIFT)); float4 tempA1 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 1)); float4 tempA2 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 2)); float4 tempA3 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 3)); float4 tempB0 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i)); float4 tempB1 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i + 1)); float4 tempB2 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i + 2)); float4 tempB3 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i + 3)); sum0.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x + tempA0.w * tempB3.x; sum0.y += tempA0.x * tempB0.y + tempA0.y * tempB1.y + tempA0.z * tempB2.y + tempA0.w * tempB3.y; sum0.z += tempA0.x * tempB0.z + tempA0.y * tempB1.z + tempA0.z * tempB2.z + tempA0.w * tempB3.z; sum0.w += tempA0.x * tempB0.w + tempA0.y * tempB1.w + tempA0.z * tempB2.w + tempA0.w * tempB3.w; sum1.x += tempA1.x * tempB0.x + tempA1.y * tempB1.x + tempA1.z * tempB2.x + tempA1.w * tempB3.x; sum1.y += tempA1.x * tempB0.y + tempA1.y * tempB1.y + tempA1.z * tempB2.y + tempA1.w * tempB3.y; sum1.z += tempA1.x * tempB0.z + tempA1.y * tempB1.z + tempA1.z * tempB2.z + tempA1.w * tempB3.z; sum1.w += tempA1.x * tempB0.w + tempA1.y * tempB1.w + tempA1.z * tempB2.w + tempA1.w * tempB3.w; sum2.x += tempA2.x * tempB0.x + tempA2.y * tempB1.x + tempA2.z * tempB2.x + tempA2.w * tempB3.x; sum2.y += tempA2.x * tempB0.y + tempA2.y * tempB1.y + tempA2.z * tempB2.y + tempA2.w * tempB3.y; sum2.z += tempA2.x * tempB0.z + tempA2.y * tempB1.z + tempA2.z * tempB2.z + tempA2.w * tempB3.z; sum2.w += tempA2.x * tempB0.w + tempA2.y * tempB1.w + tempA2.z * tempB2.w + tempA2.w * tempB3.w; sum3.x += tempA3.x * tempB0.x + tempA3.y * tempB1.x + tempA3.z * tempB2.x + tempA3.w * tempB3.x; sum3.y += tempA3.x * tempB0.y + tempA3.y * tempB1.y + tempA3.z * tempB2.y + tempA3.w * tempB3.y; sum3.z += tempA3.x * tempB0.z + tempA3.y * tempB1.z + tempA3.z * tempB2.z + tempA3.w * tempB3.z; sum3.w += tempA3.x * tempB0.w + tempA3.y * tempB1.w + tempA3.z * tempB2.w + tempA3.w * tempB3.w; } write_imagef(matrixC, (int2)(pos.x, pos.y * 4), sum0); write_imagef(matrixC, (int2)(pos.x, pos.y * 4 + 1), sum1); write_imagef(matrixC, (int2)(pos.x, pos.y * 4 + 2), sum2); write_imagef(matrixC, (int2)(pos.x, pos.y * 4 + 3), sum3); }

                                                  • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                    n0thing

                                                    Update - The 8x4 kernel uses 36 registers so using 256 group size resulted in only 4 active wavefronts, reducing the group-size to 64 gives 6 active wavefronts hence performance increases a bit.

                                                    Performance on a Oced 5850 -

                                                    Executing kernel for 5 iterations
                                                    -------------------------------------------
                                                    KernelTime (ms) : 51.9327
                                                    GFlops achieved : 1261.94

                                                    KernelTime (ms) : 52.0417
                                                    GFlops achieved : 1259.3

                                                    KernelTime (ms) : 52.0071
                                                    GFlops achieved : 1260.14

                                                    KernelTime (ms) : 52.0453
                                                    GFlops achieved : 1259.21

                                                    KernelTime (ms) : 52.1003
                                                    GFlops achieved : 1257.88

                                                    MatrixA MatrixB Time(sec) KernelTime(sec)
                                                    3200x3200 3200x3200 0.861197 0.152418
                                                    Press any key to continue . . .

                                                    #define TILEX 4 #define TILEX_SHIFT 2 #define TILEY 8 #define TILEY_SHIFT 3 __kernel void mmmKernel2(__read_only image2d_t matrixA, __read_only image2d_t matrixB, __write_only image2d_t matrixC, uint widthA, uint widthB) { int2 pos = (int2)(get_global_id(0), get_global_id(1)); float4 sum0 = (float4)(0); float4 sum1 = (float4)(0); float4 sum2 = (float4)(0); float4 sum3 = (float4)(0); float4 sum4 = (float4)(0); float4 sum5 = (float4)(0); float4 sum6 = (float4)(0); float4 sum7 = (float4)(0); widthB /= 4; for(int i = 0; i < widthA; i=i+4) { float4 tempA0 = read_imagef(matrixA, imageSampler, (int2)(i/4, pos.y << TILEY_SHIFT)); float4 tempA1 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 1)); float4 tempA2 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 2)); float4 tempA3 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 3)); float4 tempA4 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 4)); float4 tempA5 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 5)); float4 tempA6 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 6)); float4 tempA7 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 7)); float4 tempB0 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i)); float4 tempB1 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i + 1)); float4 tempB2 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i + 2)); float4 tempB3 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i + 3)); sum0.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x + tempA0.w * tempB3.x; sum0.y += tempA0.x * tempB0.y + tempA0.y * tempB1.y + tempA0.z * tempB2.y + tempA0.w * tempB3.y; sum0.z += tempA0.x * tempB0.z + tempA0.y * tempB1.z + tempA0.z * tempB2.z + tempA0.w * tempB3.z; sum0.w += tempA0.x * tempB0.w + tempA0.y * tempB1.w + tempA0.z * tempB2.w + tempA0.w * tempB3.w; sum1.x += tempA1.x * tempB0.x + tempA1.y * tempB1.x + tempA1.z * tempB2.x + tempA1.w * tempB3.x; sum1.y += tempA1.x * tempB0.y + tempA1.y * tempB1.y + tempA1.z * tempB2.y + tempA1.w * tempB3.y; sum1.z += tempA1.x * tempB0.z + tempA1.y * tempB1.z + tempA1.z * tempB2.z + tempA1.w * tempB3.z; sum1.w += tempA1.x * tempB0.w + tempA1.y * tempB1.w + tempA1.z * tempB2.w + tempA1.w * tempB3.w; sum2.x += tempA2.x * tempB0.x + tempA2.y * tempB1.x + tempA2.z * tempB2.x + tempA2.w * tempB3.x; sum2.y += tempA2.x * tempB0.y + tempA2.y * tempB1.y + tempA2.z * tempB2.y + tempA2.w * tempB3.y; sum2.z += tempA2.x * tempB0.z + tempA2.y * tempB1.z + tempA2.z * tempB2.z + tempA2.w * tempB3.z; sum2.w += tempA2.x * tempB0.w + tempA2.y * tempB1.w + tempA2.z * tempB2.w + tempA2.w * tempB3.w; sum3.x += tempA3.x * tempB0.x + tempA3.y * tempB1.x + tempA3.z * tempB2.x + tempA3.w * tempB3.x; sum3.y += tempA3.x * tempB0.y + tempA3.y * tempB1.y + tempA3.z * tempB2.y + tempA3.w * tempB3.y; sum3.z += tempA3.x * tempB0.z + tempA3.y * tempB1.z + tempA3.z * tempB2.z + tempA3.w * tempB3.z; sum3.w += tempA3.x * tempB0.w + tempA3.y * tempB1.w + tempA3.z * tempB2.w + tempA3.w * tempB3.w; sum4.x += tempA4.x * tempB0.x + tempA4.y * tempB1.x + tempA4.z * tempB2.x + tempA4.w * tempB3.x; sum4.y += tempA4.x * tempB0.y + tempA4.y * tempB1.y + tempA4.z * tempB2.y + tempA4.w * tempB3.y; sum4.z += tempA4.x * tempB0.z + tempA4.y * tempB1.z + tempA4.z * tempB2.z + tempA4.w * tempB3.z; sum4.w += tempA4.x * tempB0.w + tempA4.y * tempB1.w + tempA4.z * tempB2.w + tempA4.w * tempB3.w; sum5.x += tempA5.x * tempB0.x + tempA5.y * tempB1.x + tempA5.z * tempB2.x + tempA5.w * tempB3.x; sum5.y += tempA5.x * tempB0.y + tempA5.y * tempB1.y + tempA5.z * tempB2.y + tempA5.w * tempB3.y; sum5.z += tempA5.x * tempB0.z + tempA5.y * tempB1.z + tempA5.z * tempB2.z + tempA5.w * tempB3.z; sum5.w += tempA5.x * tempB0.w + tempA5.y * tempB1.w + tempA5.z * tempB2.w + tempA5.w * tempB3.w; sum6.x += tempA6.x * tempB0.x + tempA6.y * tempB1.x + tempA6.z * tempB2.x + tempA6.w * tempB3.x; sum6.y += tempA6.x * tempB0.y + tempA6.y * tempB1.y + tempA6.z * tempB2.y + tempA6.w * tempB3.y; sum6.z += tempA6.x * tempB0.z + tempA6.y * tempB1.z + tempA6.z * tempB2.z + tempA6.w * tempB3.z; sum6.w += tempA6.x * tempB0.w + tempA6.y * tempB1.w + tempA6.z * tempB2.w + tempA6.w * tempB3.w; sum7.x += tempA7.x * tempB0.x + tempA7.y * tempB1.x + tempA7.z * tempB2.x + tempA7.w * tempB3.x; sum7.y += tempA7.x * tempB0.y + tempA7.y * tempB1.y + tempA7.z * tempB2.y + tempA7.w * tempB3.y; sum7.z += tempA7.x * tempB0.z + tempA7.y * tempB1.z + tempA7.z * tempB2.z + tempA7.w * tempB3.z; sum7.w += tempA7.x * tempB0.w + tempA7.y * tempB1.w + tempA7.z * tempB2.w + tempA7.w * tempB3.w; } write_imagef(matrixC, (int2)(pos.x, pos.y * 8), sum0); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 1), sum1); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 2), sum2); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 3), sum3); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 4), sum4); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 5), sum5); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 6), sum6); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 7), sum7); }

                                                      • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                        cjang

                                                        Your matrix multiply kernel utilizes the texture cache well (at least better than my kernels were). After changing this to follow your example, performance jumps by 33%, from around 720 GFLOPS to 966 GFLOPS. Changing the output matrix C from a buffer to an image adds another 3% to 980 GFLOPS. Thank you very much!

                                                        http://golem5.org/bucket/gatlas/goodbadmatmult.png

                                                        I am using a stock HD 5870. And I think your kernels are "tighter" as they are written for specific matrix blocking and data layout. You have less miscellaneous boilerplate arithmetic. So our numbers are consistent as your HD 5850 is overclocked 25% memory and 30% core.

                                                        The documentation says the texture cache has 2D locality. I didn't really know what this meant until now. There is obviously an underlying texture cache structure of quads in squares. For the benefit of anyone else, here's what I changed to get this jump in performance. For many with experience, this is probably obvious. It wasn't for me. (note that the code below is semi-pseudocode for clarity)

                                                        The outermost innerproduct loop is: for (int idx = 0; idx < matrixSize / VECTOR_LENGTH; idx++) so idx is traversing over blocks of A and B.

                                                        How it was before when it was slow:

                                                        valB[j] = read_imagef(matB, sampler, idx, VECTOR_LENGTH * (localSize() * blockCol + col) + j);

                                                        How it is after when it is fast:

                                                        valB[j] = read_imagef(matB, sampler, localSize() * blockCol + col, VECTOR_LENGTH * idx + j);

                                                        This matches n0thing's traversal of matrix B. Also note how the traversal of matrix A is transposed. If the texture for A is changed to match the same pattern as B, performance is less.

                                                        One thing that may help you push performance higher. I've noticed the fastest kernels have a blocking of 5x4. However, this is for the HD 5870 which has a different core configuration than the HD 5850. What I mean is that a non-even quad blocking may be optimal for you. This is also an example of why I think auto-tuning and empirical optimization is very useful. It can find these unusual kernel solutions.

                                                          • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                            n0thing

                                                            Nice to see a performance jump there!

                                                            I haven't really benchmarked the texture cache yet, but since the cache line is 256-bit(from beyond3d architectureal review) and its 2D so when you read a vec4 then the cacheline would contain current vec4 and the one in the vertical direction (above or below it).

                                                             This gives a good texture hit rate when the access pattern is like matrixA. I didn't transpose matrixB because it gives a good global memory bandwidth as the read pattern is linear.

                                                            I will try using 5x4 and 9x4 blocking-sizes also as they should result in same active wavefronts  as 4x4 and 8x4 sizes assuming this is correct -

                                                             

                                                            Registers per thread x (group-size / 64) x number of groups <= 240



                                                              • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                cjang

                                                                I am noticing that very subtle code changes significantly affect performance.

                                                                After I changed to use your texture cache friendly inner block traversal of matrix B, I noticed that the squares of quads are of course transposed. I am reading down columns instead of across rows. So I changed the inner product accumulation loop to match this and performance increases by 5 GFLOPS. For a dual image version, performance increases by 25 GFLOPS (now up to 1045 GFLOPS on my HD 5870).

                                                                Then in the CAL++/prunedtree matrixmult.cpp example, I reversed the order of nested for loops for the inner product to gain 50 GFLOPS at 2048x2048 and 130 GFLOPS at 5120x5120.

                                                                This gives: "Device 0: execution time 1114.64 ms, achieved 1541.29 gflops"

                                                                for (i = 0; i < BY4; i++) {
                                                                for (j = 0; j < BX4; j++) {
                                                                // lots of MADs
                                                                }
                                                                }

                                                                This gives: "Device 0: execution time 1081.06 ms, achieved 1589.17 gflops"

                                                                for (j = 0; j < BX4; j++) {
                                                                for (i = 0; i < BY4; i++) {
                                                                // lots of MADs
                                                                }
                                                                }

                                                                I've seen other more counter-intuitive effects on performance from using the global ID versus group/local IDs. Performance changes in unexpected ways. It's very difficult to predict what will happen with any given change.

                                                                This makes me think that kernel designs should be represented as game trees. Then the combinatorial search over the design space can be a recursive evaluation of variations, kind of like computer chess. The design space is just too strange and unknown to find optimal kernels by hand from first principles.

                                                                  • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                    esotericeric

                                                                    Hi folks,

                                                                    Thanks for a really interesting discussion.  I'd like to follow up with a question (hopefully someone can help answer).  I've mostly studied Nvidia implementations of SGEMM, and was really impressed to see the kind of performance numbers you guys are getting.   

                                                                    One concern that was raised earlier by Hazeman was the specific memory layout requirements (e.g., breaking the input matrices into even and odd columns) and having mixed row/column formats.  In terms of reshuffling memory to do this, how does that factor into your performance results?

                                                                    My follow-up question to that is:  are there any SGEMM implementations that handle the standard row-based BLAS format (i.e., input and output matrices are in row format).  And if so, does anyone know of the best implementation and how do those performance compare?

                                                                    Thanks!

                                                                     

                                                                      • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                        n0thing

                                                                        According to cjang''s results having mixed row/column format (dual-images version) gives about 20% performance advantage over single-images version.

                                                                        The kernels I posted above use standard-row based format. I get around 1260 GFlops on a Overclocked Radeon 5850 on the 8x4 blocking kernel in OpenCL.

                                                                          • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                            cjang

                                                                            The advantage of unusual data layouts has diminished as
                                                                            optimization has improved. Any advantage is either at
                                                                            certain matrix sizes (by moving to another size, the
                                                                            advantage disappears) or not enough to justify the
                                                                            complexity (the performance gain may be larger in IL/ISA
                                                                            but in OpenCL is not very much). So I've abandoned tiled
                                                                            and even/odd row data layouts. I'm only working with
                                                                            row/column major layout now.

                                                                            The kernels support transposing A and B and have been
                                                                            tested against random data against a simple for-loop CPU
                                                                            implementation. They also support rectangular M/N/K
                                                                            matrix dimensions. However, I've not observed performance
                                                                            to increase by moving away from square shapes. I'm pretty
                                                                            confident in algorithmic correctness (round-off numerical
                                                                            error and stability is another issue, especially in single
                                                                            precision). So this is good for general SGEMM now and does
                                                                            not impose any awkward data layout constraints on the
                                                                            input matrices.

                                                                            Below are the top three kernel specializations found for
                                                                            un-transposed A and B for memory buffer and image based
                                                                            matrix multiply kernels at 4800x4800 (ten trials).

                                                                            ./bmm_buffer -d gpu -n 4800 -t 10 -x 10
                                                                            [0] 4603984 usec avg: 480.394 stddev: 3.50936 4800 4800 4800 8 10 10 inlineMNK klj
                                                                            [1] 4689310 usec avg: 471.629 stddev: 0.589461 4800 4800 4800 8 5 8 inlineMNK ljk
                                                                            [2] 4742431 usec avg: 466.346 stddev: 0.542847 4800 4800 4800 8 5 10 inlineMNK klj

                                                                            ./bmm_image -d gpu -n 4800 -t 10 -x 10
                                                                            [0] 2125086 usec avg: 1040.72 stddev: 0.343981 4800 4800 4800 16 5 18 inlineMNK jkl group/local
                                                                            [1] 2145931 usec avg: 1030.61 stddev: 0.514469 4800 4800 4800 8 6 2 argsMNK ljk global
                                                                            [2] 2146057 usec avg: 1030.55 stddev: 0.405767 4800 4800 4800 8 6 6 argsMNK jkl group/local

                                                                            Performance depends on if A and B are transposed or not.
                                                                            Here are the top three for a memory buffer kernel at
                                                                            4800x4800 when B is transposed.

                                                                            ./bmm_buffer -d gpu -n 4800 -t 10 -x 10 -b
                                                                            [0] 4518746 usec avg: 489.453 stddev: 3.33775 4800 4800 4800 8 10 2 argsMNK ljk
                                                                            [1] 4550961 usec avg: 485.974 stddev: 2.0284 4800 4800 4800 8 10 5 argsMNK lkj
                                                                            [2] 4650525 usec avg: 475.563 stddev: 0.767519 4800 4800 4800 8 5 8 inlineMNK ljk

                                                                            So it is a little faster for memory buffers when B is
                                                                            transposed.

                                                                            The kernels are parameterized in different ways including
                                                                            work group size, inner blocking height, inner product
                                                                            accumulation loop order and matrix dimension inlining. For
                                                                            given M/N/K matrix dimensions, kernels are specialized
                                                                            over many combinations and individually timed. There is
                                                                            some simple pruning in the combinatorial search so this
                                                                            takes a reasonable amount of time.

                                                                            It's still not ready yet. A small command line application
                                                                            does the search right now. It needs to be in a re-usable
                                                                            component and more "directional" to increase speed. That
                                                                            way, it will be cheap enough to (sort of) run on demand
                                                                            instead of requiring hours of GPU time (although map-reduce
                                                                            over a cluster or fleet would alleviate this) to find good
                                                                            kernels.

                                                                              • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                cjang

                                                                                Here are some more complete charts of matrix multiply performance and solutions found by the auto-tuning. They include variants with transposed matrices for both memory buffer and image based kernel implementations.

                                                                                http://golem5.org/bucket/gatlas/

                                                                                I am packaging what I have into an initial release. There is still some work to do in that regard. I do not wish to dump a mess of source code and be inconsiderate. I will try and hurry up.

                                                                                One question I have, if anyone has strong opinions... where would automatic tuning and optimization of GPU kernels have the most benefit for applications today? The obvious areas involve signal and image processing. In past experience, I saw others invest substantial effort in machine learning to design classifiers. This sort of thing often becomes compute bound. I am quite new to the GPGPU world and still trying to grasp the scope of what is possible.

                                                                                  • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                    hazeman

                                                                                    Looking at your results I have one comment.

                                                                                    Final implementation of (A^T)*(B^T) should have almost identical speed as A*B. You should remember that A^T*B^T=(B*A)^T and transposing output of multiplication kernel is rather fast operation ( of course i'm not talking about running next kernel to do transposition ).

                                                                                    The same goes for A*B^T and A^T*B.

                                                                                      • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                        cjang

                                                                                        Yes, you are quite right of course. You've given me something else to think about.

                                                                                        The results shown are literally op(A) * op(B) without any algebraic transformations. My thought was that the user will necessarily work out the linear algebra and design the calculation to be efficient. This would subsume any algebraic transformations of this kind. I took the easy way out instead of trying to make the kernel smarter.

                                                                                        Another reason I didn't go farther is that the fastest kernels are often due to inner blocking with rectangles that are non-even quads on a side (i.e. 5x4, 6x4, 10x4). The kernel implementation (which you have not seen as the source code is unreleased as of yet) accesses memory as float4s. If the inner blocks have to transpose, this becomes very awkward because they no longer fit evenly into the float4 vector elements.

                                                                                        What I hadn't thought of before is that the performance gain from a faster matrix multiply will very likely outweigh any penalty when writing back to global memory in a non-optimal way. The dominant cost is the block inner product. The write to output is less. So as you say, speed should be very close.

                                                                                        Oh another thing - I did have the thought to implement this same kind of auto-tuned optimization with CAL++. I would expect performance to jump by several hundred GFLOPS.

                                                                                          • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                            hazeman

                                                                                            There is one more issue witch you are probably not considering. I see in n0thing code that block position is simply (get_global_id(0),get_global_id(1)).

                                                                                            This isn't the best solution for reusing cache between gpu threads.

                                                                                            It looks like example in CAL++ is taking the same approach but it isn't the case. In pixel shader mode gpu hardware is assigning position to threads in tiled order. Whether this is best for opencl kernel or some other method is better is left for testing.

                                                                                            There is also one important issue involved. There is total lack of information about this so some parts of it are my guesses ( marked with italic ). CAL buffer (image) can have 2 formats - tiled & linear ( flag CAL_RESALLOC_GLOBAL_BUFFER ). This impacts which points are near in cache. TU reads 64 continuous "points" from memory. In tiled format it will mean that 4 ( with exception of edges ) neighbours are in cache ( x-1,x+1,y-1,y+1 ). But for linear format only (x-1,x+1) neighbours are in cache.

                                                                                            Of course this cache behaviour has huge impact on the design of multiplication kernel.

                                                                                            Now the question is, which format OpenCL uses. As far as I know IL doesn't allow scatter write to tiled buffer. This would imply that OpenCL is using linear format. But maybe there is some new unpublished instruction available.

                                                                                            Maybe someone could post here how the write_imagef is translated into IL code.

                                                                                              • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                nou

                                                                                                 

                                                                                                 

                                                                                                __kernel void image_write2(__read_only image2d_t read, __write_only image2d_t img) { sampler_t sampl = CLK_NORMALIZED_COORDS_TRUE | CLK_ADDRESS_NONE | CLK_FILTER_NEAREST; size_t x = get_global_id(0); size_t y = get_global_id(1); float4 c = read_imagef(read, sampl, (int2)(x,y)); write_imagef(img, (int2)(x,y), (float4)(x/256.0f, y/256.0f, x/256.0f, y/256.0f));//toto funguje } mdef(46)_out(1)_in(2) mov r0, in0 mov r1, in1 dcl_literal l1, 0x7f800000, 0x7f800000, 0x807fffff, 0x807fffff dcl_literal l2, 0x7f800000, 0x7f800000, 0, 0 dcl_literal l3, 0x80000000, 0x80000000, 0x80000000, 0x80000000 dcl_literal l4, 0x3f800000, 0x3f800000, 0, 0 dcl_literal l5, 0, 0, 0, 0 dcl_literal l6, 0x7fffffff, 0x80000000, 0x7fffffff, 0x80000000 dcl_literal l7, 0x00800000, 0x00800000, 0x00800000, 0x00800000 dcl_literal l8, 0x00000017, 0x00000017, 0x00000017, 0x00000017 dcl_literal l9, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff mov r2.x___, r0.x mov r2._y__, r1.x and r3.xyzw, r2.xyxy, l1 ieq r4.xyzw, r3.xyxy, l2 and r5.xy__, r2.xy, l3 ior r3.__zw, r3.zwzw, l4.xyxy cmov_logical r3.__zw, r4.zwzw, r5.xyxy, r3.zwzw cmov_logical r3.__zw, r4.xyxy, r2.xyxy, r3.zwzw ior r5.xy__, r4.xz, r4.yw ior r5.x___, r5.x, r5.y inegate r5.__z_, r3.yyyy iadd r3.x___, r3.x, r5.z cmov_logical r3.x___, r5.xxxx, l5, r3.xxxx rcp_zeroop(infinity) r2._y__, r3.ww mul_ieee r2.x___, r3.z, r2.y and r2.__zw, r2.xxxx, l6.xyzw ishr r6.x___, r2.z, l8 ishr r6._y__, r3.xxxx, l8 iadd r2.xy__, r2.xzxz, r3.xxxx iadd r6.x___, r6.x, r6.y ige r4.__z_, l5, r6.x ior r4._y__, r2.wwww, l1 ige r4.x, r6.x, l9 cmov_logical r4.x, r5.x, l5, r4.x cmov_logical r2.x, r4.z, r2.w, r2.x cmov_logical r0.x, r4.x, r4.y, r2.x mov out0, r0 mend il_cs_2_0 dcl_cb cb0[9] ; Constant buffer that holds ABI data dcl_literal l0, 4, 1, 2, 3 dcl_literal l1, 0x00FFFFFF, -1, -2, -3 dcl_literal l2, 0x0000FFFF, 0xFFFFFFFE,0x000000FF,0xFFFFFFFC dcl_literal l3, 24, 16, 8, 0xFFFFFFFF dcl_literal l4, 0xFFFFFF00, 0xFFFF0000, 0xFF00FFFF, 0xFFFF00FF dcl_literal l5, 0, 4, 8, 12 mov r769, cb0[8].x call 1202;$ endmain func 1202 ; __OpenCL_image_write2_kernel dcl_max_thread_per_group 256 mov r770, l1.0 dcl_literal l6, 0x43800000, 0x43800000, 0x43800000, 0x43800000; float: 2.560000e+02 dcl_literal l7, 0x00000000, 0x00000000, 0x00000000, 0x00000000; float: 0.000000e+00 dcl_literal l8, 0x00000018, 0x00000018, 0x00000018, 0x00000018; int: 24 dcl_literal l9, 0x00000000, 0x00000000, 0x00000000, 0x00000000; int: 0 dcl_resource_id(0)_type(2d)_fmtx(unknown)_fmty(unknown)_fmtz(unknown)_fmtw(unknown) dcl_uav_id(2)_type(2d)_fmtx(uint) dcl_raw_uav_id(1) dcl_arena_uav_id(0) mov r0.z, vThreadGrpIdFlat.x mov r1022.xyz0, vTidInGrp.xyz mov r1023.xyz0, vThreadGrpId.xyz imad r1021.xyz0, r1023.xyz0, cb0[1].xyz0, r1022.xyz0 iadd r1021.xyz0, r1021.xyz0, cb0[6].xyz0 iadd r1023.xyz0, r1023.xyz0, cb0[7].xyz0 mov r1023.w, r0.z ishl r1023.w, r1023.w, l0.z imad r771.x, vAbsTidFlat.x, cb0[3].y, cb0[3].x dcl_literal l10, 0x00000005, 0x00000005, 0x00000005, 0x00000005; int: 5 dcl_literal l11, 0x00000010, 0x00000010, 0x00000010, 0x00000010; int: 16 dcl_literal l12, 0x00000004, 0x00000004, 0x00000004, 0x00000004; int: 4 dcl_cb cb1[4] ; Kernel arg setup: read mov r1, cb1[0] ; Kernel arg setup: img mov r2, cb1[2] call 1206 ; image_write2 ret endfunc ; __OpenCL_image_write2_kernel ;ARGSTART:__OpenCL_image_write2_kernel ;version:1:2:44 ;uniqueid:1202 ;memory:private:0 ;memory:hwlocal:0 ;image:read:2D:RO:0:1:0 ;image:img:2D:WO:2:1:32 ;function:1:1206 ;intrinsic:0 ;sampler:unknown_24:0:1:24 ;ARGEND:__OpenCL_image_write2_kernel func 1206 ; image_write2 mov r176.x___, r2.xxxx mov r177.x___, r1.xxxx call 1111 ; __amdil_get_global_id_int mov r178, r1 mov r179, l6 mov r178, r178.x000 call 1111 ; __amdil_get_global_id_int mov r181, r1 itof r182.x___, r178.xxxx mov r181, r181.y000 utof r183.x___, r178.xxxx mov r182, r182.xxxx mov r184, l7 utof r185.x___, r181.xxxx mov r0.___w, r184.x mov r0.xyz_, r182 mov r182, r0 ;__fdiv_f32 mcall(46) (r183),(r183, r179) ;__fdiv_f32 mcall(46) (r179),(r185, r179) mov r185, r183.xxxx itof r186.x___, r181.xxxx mov r0.__z_, r184.x mov r0.xy_w, r182 mov r182, r0 mov r0.___w, r179.x mov r0.xyz_, r185 mov r184, r0 mov r0._y__, r186.x mov r0.x_zw, r182 mov r182, r0 mov r185, l8 mov r186, l9 mov r0.__z_, r183.x mov r0.xy_w, r184 mov r183, r0 mov r1.x___, r177.xxxx mov r2.x___, r185.xxxx mov r113, r182 mov r224.x___, r186.xxxx sample_resource(0)_sampler(0)_coordtype(normalized) r1, r113 ; __amdil_sample_data_0_0_0_1 mov r0._y__, r179.x mov r0.x_zw, r183 mov r177, r0 mov r178, r178.xxxx mov r177, r177 mov r0._y__, r181.x mov r0.x_zw, r178 mov r178, r0 mov r1.x___, r176.xxxx mov r2.xy__, r178.xyxy mov r113, r177 uav_store_id(2) r2.xy, r113 ret endfunc ; image_write2 ;ARGSTART:image_write2 ;uniqueid:1206 ;memory:private:0 ;memory:hwlocal:0 ;function:1:1111 ;intrinsic:1:46 ;ARGEND:image_write2 func 1078 ; Store32BitsUAV uav_raw_store_id(1) mem0.x___, r1.x, r2 ret endfunc ; Store32BitsUAV func 1111 ; getGlobalIDInt mov r1, r1021.xyz0 ret endfunc ; getGlobalIDInt end

                                                                                                  • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                    hazeman

                                                                                                    So OpenCL uses new instruction to write to image ( uav_store with 2d format ) - but it doesn't answer if the image has linear or tiled format.

                                                                                                    There are 2 options to verify it. First one is to try to run CAL kernel with uav_store and test if it does allow for tiled layout. Second one is based on catching call to CAL library and check flags used for resalloc function.

                                                                                                    <joke on> Of course we could simply ask someone from ATI for answer <joke off>

                                                                                                     

                                                                                                      • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                        cjang

                                                                                                        I've noticed the image based matrix multiply kernels seem
                                                                                                        to fall into one of three classes by performance. What you
                                                                                                        are saying is that we may be missing a fourth class that
                                                                                                        is even faster?

                                                                                                        1. ???
                                                                                                        2. 1000 GFLOPS for A * B and At * B
                                                                                                        3. 700 GFLOPS for A * Bt
                                                                                                        4. 400 GFLOPS for At * Bt

                                                                                                        I wouldn't be surprised if this is possible. I admit to
                                                                                                        not understanding how the tiling and texture cache work.
                                                                                                        From my point of view, it is (partially automated) trial
                                                                                                        and error.

                                                                                                        It is kind of surprising that the ATI hardware is capable
                                                                                                        of such high performance for math kernels and yet the
                                                                                                        developer is mostly on his or her own. There is so much
                                                                                                        unrealized potential. This would all be good for AMD/ATI
                                                                                                        as well as the developer. Everyone would benefit.

                                                                                                          • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                            hazeman

                                                                                                             

                                                                                                            Originally posted by: cjang I've noticed the image based matrix multiply kernels seem to fall into one of three classes by performance. What you are saying is that we may be missing a fourth class that is even faster?


                                                                                                            Yep. Fourth , fifth and maybe more . In language of optimization, I'm simply saying that you are searching too small space and there are 2 more variables which have huge impact on performance.

                                                                                                            First is data representation in memory ( linear, tiled format or some other - but if the computations for pixel position will be too complicated it will impact kernel speed ).

                                                                                                            The other "variable" is packing of workitems into groups - this one also have huge impact on reusing data in cache.

                                                                                                             

                                                                                                            It is kind of surprising that the ATI hardware is capable of such high performance for math kernels and yet the developer is mostly on his or her own. There is so much unrealized potential. This would all be good for AMD/ATI as well as the developer. Everyone would benefit.


                                                                                                            ATI approach to developers is really bad. Documentation is one of the worst I've seen. There is a lot spelling errors, obvious mistakes, really stupid and wrong descriptions of some instructions. Also ATI is still in the age of cold war ( those pesky nvidia spys ). They can't give ofiicial info on cache, cache latency, cache behaviour, coalescing, etc . And when by some mistake they will answer on forum it's wrong or in opposition to answer from other ATI developer.

                                                                • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                  MicahVillmow
                                                                  hazeman,
                                                                  If there are examples of the documentation that you believe need work and are not fixed in our new release, please list them in an email to streamdeveloper@amd.com CC: Micah and i'll work on getting them fixed.
                                                                    • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                      cjang

                                                                      Here is what I have.

                                                                      http://github.com/cjang/GATLAS

                                                                      There is also a homepage with some deeper explanation of what I have tried to do so far. For me, this is very much a work in progress and continually evolving. I have tried to keep the code and design clean and useful. However, it is still quite early. I am still trying to understand how all of this fits into the bigger picture.

                                                                      Next up is an optimized SIFT kernel (or more generally convolution based feature transforms at multiple scales) for computer vision applications. This will be fun. And I think it will reveal if the approach I have taken so far generalizes.

                                                                        • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                          cjang

                                                                          This thread started with a question as to SGEMM performance. However, most of the performance figures have been for matrix multiplication, not SGEMM. So...

                                                                          This is the best picture I have right now of SGEMM performance. It takes into account PCIe bus data transfers as well. Note that memory is not pinned. (I have not tried specifying pinned memory - don't even know if that works.)

                                                                          http://golem5.org/gatlas/bench_sgemm/bench_sgemm_pcie.html

                                                                          Note that all data points are from empirical timings.

                                                                          One surprise is that memory buffer kernels are faster than image kernels when writing A and B from host to device. When all matrices are transferred between host and device, a memory buffer kernel just breaks 300 GFLOPS at 5600x5600 while the corresponding image kernel is about 260 GFLOPS.

                                                                          I also think there is a very natural tendency to overemphasize kernel benchmarks to the exclusion of PCIe bus data transfer costs. Many applications are likely to have much closer performance between IL/ISA and OpenCL due to the I/O bottleneck. OpenCL is more competitive than synthetic kernel benchmarks suggest.

                                                                          Another obvious (in hindsight) observation is that more memory allows blocking problems to reduce the effect of PCIe bus data transfer costs. I do not think the 256 MB UAV memory buffer limitation is really an obstacle so long as we can use as much device memory as possible in multiple buffers.

                                                                            • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                              cjang

                                                                              Just a SGEMM performance update with the current SDK and driver. The 5870 now reaches almost exactly 50% utilization at 1375 gigaFLOPS with an image based kernel. Memory buffer kernels are well over 500 gigaFLOPS.

                                                                              Using ATI SDK 2.1 and Catalyst 10.4 on Ubuntu 9.10 64 bit

                                                                              OpenCL SGEMM benchmarks (gigaFLOPS)

                                                                                               5440     5670     5770     5870
                                                                                               Cedar    Redwood  Juniper  Cypress

                                                                              memory buffer:   @1280    @1600    @4800    @5440
                                                                              A  B             15       73       257      536
                                                                              At B             13       54       224      446
                                                                              A  Bt            17       78       277      558
                                                                              At Bt            14       56       225      447

                                                                              image:           @1280    @1600    @4800    @2240
                                                                              A  B             52       307      720      1307
                                                                              At B             67       320      722      1375
                                                                              A  Bt            50       182      416      794
                                                                              At Bt            53       163      362      703

                                                                                • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                  nnsan

                                                                                  Hi,

                                                                                  This is a little bit off topic but I have tried to reproduce an optimized matrix-muliply(MM) in IL. I summarized our results at

                                                                                  http://galaxy.u-aizu.ac.jp/trac/note/wiki/MatrixMultiply

                                                                                  I obtain ~2100 Gflops on Cypress 850MHz. Not highest performance but I believe our code is simpler in terms of memory layout than the fastest code posted on beyond3d.

                                                                                    • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                      hazeman

                                                                                       

                                                                                      Originally posted by: nnsan

                                                                                       

                                                                                      I obtain ~2100 Gflops on Cypress 850MHz. Not highest performance but I believe our code is simpler in terms of memory layout than the fastest code posted on beyond3d.

                                                                                       

                                                                                      I have to admit that something doesn't add up for me.

                                                                                      I did some math after looking at your code. For 2100 gflops your kernel needs to transfer 1050 GB/s from L1 cache. For 5870 agregate cache speed is 1080 GB/s ( 20*54 ). So to get your result you would need perfect transfer from L1. This would require ideal efficiency and no cache misses - which simply isn't the case for ati gpu and matrix mul.

                                                                                      Also i have my doubts about hiding latency from L1 cache hits.

                                                                                      So it would be nice for someone else to verify your results - there might be some error in benchmarking.

                                                                                       

                                                                                        • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                          nnsan

                                                                                          hazeman,

                                                                                          You might be right so I will double-check my code.

                                                                                          The code at inside of the loop seems to be same as prunedtree's code except load instructions. My code reads from two streams but prunedtree's code reads from four streams. Instead I use the offset feature of sample_resource.  Do you think these differences affect the performance (i.e. cache hit rate)? I don't fully understand the internal architecture of Cypress...

                                                                                            • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                              hazeman

                                                                                              If I remember correctly prunedtree achived >2TFlops performance with bigger blocks than 8x8 ( 10x12 i think ). Bigger block equals to lower transfer required from L1 cache.

                                                                                              Also he used precomputed position for each "workitem" - and he didn't post info about it. Proper position and data layout can increase cache hits. Tiled layout used by default in pixel shader is better than linear but i doubt it's optimal .

                                                                                              The code for 8x8 block with tiled layout ( the same you use ) is available as part of CAL++. So you should have similar result ( ~1.6 TFlops ).

                                                                                              Like i said I have some doubts about hiding L1 cache latency in your code. To be sure I would have to see generated ISA.

                                                                                              You have divided reads into 2 groups with 4 sample instructions. CAL compiler sometimes merges those , but usually it doesn't. If it's not merged than you have 13 ops per load clause. L1 latency is 180 cycles and you have only 13 ops*8 threads=104 cycles to hide it. So you could be using <60% of gpu power.

                                                                                              I'm guessing that you have some error which cause you to overestimate performance by factor of 2.

                                                                                              PS. I'm writing numbers from my own memory so I could be wrong somewhere .

                                                                                                • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                  nnsan

                                                                                                  I upload the full disassembly code on my site.

                                                                                                  Read insturcions are merged as far as I understand correctly. And it uses 25 registers so that  I can have 10 threads, is this correct?

                                                                                                  Thanks for detailed explanations. I did not realize 2D filling curve access was used in the prunedtree code. Deciphering ISA is really difficult.

                                                                                                    • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                      hazeman

                                                                                                       

                                                                                                      Originally posted by: nnsan And it uses 25 registers so that  I can have 10 threads, is this correct?

                                                                                                      In theory. But in practice number of threads which is not a power of 2 can have negative impact on performance.

                                                                                                      Thanks for detailed explanations. I did not realize 2D filling curve access was used in the prunedtree code. Deciphering ISA is really difficult.

                                                                                                       

                                                                                                      True . That's why I've rewriten his 8x8 block code using CAL++.

                                                                                                      So difference between your code and CAL++ version is only usage of aoffimi. This means you should have similar performance.

                                                                                                      • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                        Jawed

                                                                                                         

                                                                                                        Originally posted by: nnsanAnd it uses 25 registers so that  I can have 10 threads, is this correct?


                                                                                                        I believe that's correct:

                                                                                                        10 hardware threads * 64 work items * 25 GPRs = 16000 GPRs.

                                                                                                        The clause temporaries used are T0 and T1. These consume:

                                                                                                        2 clause temporaries * 64 work items * 2 hardware threads  = 256 GPRs.

                                                                                                        The total GPR allocation is therefore 16256 GPRs and the capacity of the register file is 16384.

                                                                                                    • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                      hazeman

                                                                                                       

                                                                                                      Originally posted by: nnsan  My code reads from two streams but prunedtree's code reads from four streams. Instead I use the offset feature of sample_resource.  Do you think these differences affect the performance (i.e. cache hit rate)?

                                                                                                       

                                                                                                      I think it shouldn't have impact.

                                                                                                      • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                        Jawed

                                                                                                        prunedtree also reported extremely high L1 bandwidth usage in his original description of his technique, 444 GB/s out of a theoretical 480GB/s.

                                                                                                        Additionally his code uses a tiled, rather than linear, access pattern, maximising efficiency. His kernel reads pre-computed addresses, i.e. work item IDs are assigned along a 2D space filling curve.

                                                                                                        The use of aoffimmi shouldn't have any effect. It saves some registers though, which is very very very useful.

                                                                                                        Finally, since A is transposed, the memory access pattern is not strictly "matrix multiply".

                                                                                                         

                                                                                                          • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                            hazeman

                                                                                                             

                                                                                                            Originally posted by: Jawed prunedtree also reported extremely high L1 bandwidth usage in his original description of his technique, 444 GB/s out of a theoretical 480GB/s.


                                                                                                            Prunedtree have written that he achieved 444GB/s with synthetic tests. He had slightly smaller value for matrix mul. This gives ~90% efficiency , which I have no problem with.

                                                                                                            But code posted by nnsan would need to get 99% of theoretical bandwitch to achive 2.1TFLOPs. This simply isn't possible for ati gpu.

                                                                                                              • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                                Jawed

                                                                                                                http://forum.beyond3d.com/showthread.php?p=1369860#post1369860

                                                                                                                 

                                                                                                                Optimizing a little for RV870, I managed to reach up to 1083 GB/s (L1 fetch bandwidth peaks at 1088 GB/s afaik) with 12x8 blocks (that puts the TMU bottlneck at 2.6 TFlop/s) yet this only achieves 2.17 TFlop/s in practice: ALU becomes the bottlneck.


                                                                                                                Also, nnsan's kernel is a pixel shader (I've only just noticed), so it is, by default, using tiling for work item location - which means the kernel should have a very favourable L1 access pattern. In other words, nnsan does not need to worry about trying to duplicate the pre-computed addressing that prunedtree has used.

                                                                                                                  • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                                    hazeman

                                                                                                                     

                                                                                                                    Also, nnsan's kernel is a pixel shader (I've only just noticed), so it is, by default, using tiling for work item location - which means the kernel should have a very favourable L1 access pattern. In other words, nnsan does not need to worry about trying to duplicate the pre-computed addressing that prunedtree has used.


                                                                                                                    This is exactly the same as kernel in CAL++. So nnsan should have similar performance.

                                                                                                                    Btw are you sure that prunedtree used EXACTLY the same pattern which is used in pixel shader ( specially that there is no official info about the pattern used in gpu* ) ? Small change there can have significant impact on cache hits performance.

                                                                                                                    * I think i've seen somewhere comment that ATI doesn't want to disclose it because it can change with architecture ( or something )

                                                                                                                      • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                                        Jawed

                                                                                                                         

                                                                                                                        Originally posted by: hazemanBtw are you sure that prunedtree used EXACTLY the same pattern which is used in pixel shader ( specially that there is no official info about the pattern used in gpu* ) ? Small change there can have significant impact on cache hits performance.


                                                                                                                        No I'm not saying that the pattern used is the same in both cases. Merely that prunedtree reported that tiling improves performance in his algorithm. In fact tiling is essential, as performance for N>1024 "crumbled".

                                                                                                                        prunedtree's tiling, for example, might account for the fact that there's a 4:1 ratio in the vertical and horizontal directions due to float4 packing. Whereas pixel shader doesn't.

                                                                                                                          • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                                            cjang

                                                                                                                            Excuse me for being a bit off the current discussion topic.

                                                                                                                            In my experience, it is important to validate output data as there are cases in which a kernel seems perfectly fine except that the results are wrong. A kernel design that works at one combination of matrix dimensions, inner/outer blocking sizes, accumulation loop order, etc may fail at another point in the kernel solution space. While everything I am doing is with OpenCL, I believe the same philosophy applies to CAL/IL (as it is also translated by a compiler) and even ISA if there is further translation done somewhere down at the device level.

                                                                                                                            I believe data validation should be the ultimate criterion for correctness rather than code inspection (even at the ISA level). If the output is good, then we know however the kernel works, the calculation is correct. We then know how much work is done by whatever measure of complexity we are using during the time in which the kernel executes.

                                                                                                                              • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                                                hazeman

                                                                                                                                I've recoded nnsan kernel in CAL++.Here are the results for 4770

                                                                                                                                2 images with burst write ( 30 regs )- 523 GFLOP/s

                                                                                                                                2 images without burst write ( 26 regs ) - 623 GFLOP/s ( this is closest to prunedtree's code - i think he had 28 regs used )

                                                                                                                                1 image with burst write ( 30 regs ) - 548GFLOPs - internal loop (ISA) is 2 ops longer than for nnsan kernel

                                                                                                                                1 image without burst write ( 26 regs ) - 660GFLOPs - internal loop (ISA) is 2 ops shorter than for nnsan kernel

                                                                                                                                Difference in internal loop length is simply result of bad quality of CAL/IL compiler. Also burst write shoudn't increase register usage. Registers for burst must have consecutive indexes. So any resonable compiler should use correct indexes from the beginning. But this is too much for CAL/IL compiler :////// .

                                                                                                                                PS. code is available in CAL++ svn

                                                                                                                                  • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                                                    hazeman

                                                                                                                                     

                                                                                                                                    Originally posted by: cjang In my experience, it is important to validate output data as there are cases in which a kernel seems perfectly fine except that the results are wrong.


                                                                                                                                    You are totally right here.

                                                                                                                                     

                                                                                                                                    We then know how much work is done by whatever measure of complexity we are using during the time in which the kernel executes.


                                                                                                                                    This is not always valid assumption - sometimes by mistake we can use different value ( i've seen this happen ).

                                                                                                                                    And with ATI IL i think looking at the ISA is basic optimisation tool. CAL/IL compiler can make really stupid mistakes which can be corrected by slightly changing code.

                                                                                                                                     

                                                                                                                                      • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                                                        Jawed

                                                                                                                                        The input data must either be padded with zeroes to ensure that each work-item never fetches out of bounds or the kernel does bounds checking. The latter option is going to produce too much computational overhead, generally, though.

                                                                                                                                        So the padding has to be adapted to the block dimensions if the implementation has variable sized blocks.

                                                                                                                                        In general the GPU will have junk data in memory (no different from a PC's memory). Only sufficient zero-padding (done host side, then everything copied) is going to ensure that the junk doesn't affect the result.

                                                                                                                                      • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                                                        nnsan

                                                                                                                                        Jawed,  thanks for precise numbers.

                                                                                                                                        I think in PS mode, it automatically change the number of threads depending on the number register and we have no contorol. In CS, we can specify the number of threads through dcl_num_thread instructions (correct me if I'm wrong.)

                                                                                                                                         

                                                                                                                                        Originally posted by: hazeman I've recoded nnsan kernel in CAL++.Here are the results for 4770

                                                                                                                                        2 images with burst write ( 30 regs )- 523 GFLOP/s

                                                                                                                                        2 images without burst write ( 26 regs ) - 623 GFLOP/s ( this is closest to prunedtree's code - i think he had 28 regs used )

                                                                                                                                        1 image with burst write ( 30 regs ) - 548GFLOPs - internal loop (ISA) is 2 ops longer than for nnsan kernel

                                                                                                                                        1 image without burst write ( 26 regs ) - 660GFLOPs - internal loop (ISA) is 2 ops shorter than for nnsan kernel



                                                                                                                                        Your results seem to indicate that register usage is critical hence the number of threads is. As you know, without the "breakc" hack my code requires 36 registers and shows ~ 1500 Gflops at maximum on Cypress. In my case, with the hack I have 256/25 > 10 threads and without I have 256/36 > 7 threads. 10/7 is roughly same as the performance ratio.

                                                                                                            • OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?
                                                                                                              MicahVillmow
                                                                                                              nnsan,
                                                                                                              Congrats on your work, very impressive. It is nice to see people using the newer features of IL to make the code simpler to read.