Archives Discussions

Nikolai_ · ‎02-17-2010

anyone have any results?

Hello!

i just want to see how close to theoretical flops the radeon/firestream cards get... i figure sgemm results would be a measure

Nikolai_ · ‎02-18-2010

no one knows of any blas results?

hazeman · ‎02-18-2010

Here you have some results http://forum.beyond3d.com/showthread.php?t=54842 .

EDIT: Prunedtree achived 1 TFlops on 4890 and 2+ TFlops on 58xx.

But it should be noted that it isn't real sgemm. Input matrices are diveded into 2 parts ( even and odd colums ). Also one matrix is in the row format ( C style ) and the other in the column format ( fortran style ).

I'm including almost exact rewrite of prunedtree code in more readable form ( he gave ISA code only ). The only difference is that i calculate block position and prunedtree used precomputed data.

PS. The code is written in C++ to IL compiler/converter. Normal C++ if/while/for are used for controling code generation and loop unroling ( something like really advanced preprocessor ). The il_whileloop, il_if, il_breakc give coresponding instructions in IL.

#define BX 8 #define BY 8 #define BX4 (BX/4) #define BY4 (BY/4) void kernel_matrixmul( input2d<float4>& A0, input2d<float4>& A1, input2d<float4>& B0, input2d<float4>& B1, global<float4>& C, float1 xsize, float1 ysize ) { float4 R[BY][BX4],ta[2][BY4],tb[2][BX4],k; float2 p; int i,j; assert( BX==8 && BY==8 ); p = (named_variable<float2>("vWinCoord0.xy")-float2(0.5))*float2(BX4,BY); for(i=0;i<BY;i++) { for(j=0;j<BX4;j++) R=float4(0); } k = float4( p.y(), p.x(), float1(-2), float1(-1) ); il_whileloop { k += float4(0,0,2,2); il_breakc( k.z()>=ysize ); ta[0][0] = A0[k.xz()]; ta[0][1] = A1[k.xz()]; tb[0][0] = B0[k.yz()]; tb[0][1] = B1[k.yz()]; ta[1][0] = A0[k.xw()]; ta[1][1] = A1[k.xw()]; tb[1][0] = B0[k.yw()]; tb[1][1] = B1[k.yw()]; for(i=0;i<BY4;i++) { for(j=0;j<BX4;j++) { R[4*i+0].x() = mad( ta[0].x(),tb[0].x() , mad( ta[1].x(),tb[1].x(), R[4*i+0].x() ) ); R[4*i+0].y() = mad( ta[0].x(),tb[0].y() , mad( ta[1].x(),tb[1].y(), R[4*i+0].y() ) ); R[4*i+0].z() = mad( ta[0].x(),tb[0].z() , mad( ta[1].x(),tb[1].z(), R[4*i+0].z() ) ); R[4*i+0].w() = mad( ta[0].x(),tb[0].w() , mad( ta[1].x(),tb[1].w(), R[4*i+0].w() ) ); R[4*i+1].x() = mad( ta[0].y(),tb[0].x() , mad( ta[1].y(),tb[1].x(), R[4*i+1].x() ) ); R[4*i+1].y() = mad( ta[0].y(),tb[0].y() , mad( ta[1].y(),tb[1].y(), R[4*i+1].y() ) ); R[4*i+1].z() = mad( ta[0].y(),tb[0].z() , mad( ta[1].y(),tb[1].z(), R[4*i+1].z() ) ); R[4*i+1].w() = mad( ta[0].y(),tb[0].w() , mad( ta[1].y(),tb[1].w(), R[4*i+1].w() ) ); R[4*i+2].x() = mad( ta[0].z(),tb[0].x() , mad( ta[1].z(),tb[1].x(), R[4*i+2].x() ) ); R[4*i+2].y() = mad( ta[0].z(),tb[0].y() , mad( ta[1].z(),tb[1].y(), R[4*i+2].y() ) ); R[4*i+2].z() = mad( ta[0].z(),tb[0].z() , mad( ta[1].z(),tb[1].z(), R[4*i+2].z() ) ); R[4*i+2].w() = mad( ta[0].z(),tb[0].w() , mad( ta[1].z(),tb[1].w(), R[4*i+2].w() ) ); R[4*i+3].x() = mad( ta[0].w(),tb[0].x() , mad( ta[1].w(),tb[1].x(), R[4*i+3].x() ) ); R[4*i+3].y() = mad( ta[0].w(),tb[0].y() , mad( ta[1].w(),tb[1].y(), R[4*i+3].y() ) ); R[4*i+3].z() = mad( ta[0].w(),tb[0].z() , mad( ta[1].w(),tb[1].z(), R[4*i+3].z() ) ); R[4*i+3].w() = mad( ta[0].w(),tb[0].w() , mad( ta[1].w(),tb[1].w(), R[4*i+3].w() ) ); } il_breakc( xsize<float1(0) ); // hack to reduce register usage } } il_endloop uint1 s,step; s = convert_uint1(p.y()*xsize + p.x()); step = convert_uint1(xsize); for(i=0;i<BY;i++) { for(j=0;j<BX4;j++) { C[s+j] = R; } s += step; } }

oscarbarenys1 · ‎02-19-2010

Can supply perf numbers?

oscarbarenys1 · ‎02-19-2010

Can you post also the C++ to IL compiler/converter for playing in binary or source form..

Can you elaborate how easy is to port that to OpenCL?

hazeman · ‎02-19-2010

The performance results are the same as in the prunedtree post. ISA generated from the included C++ code is almost the same as in the prunedtree's case ( including the hack to reduce register usage ).

I think writing this in OpenCL is almost impossible at the time. First of all you don't have enough control over IL code generated from OCL ( without the register usage hack you loose 20-30% of speed ). Also matrixmul uses TU's to access memory. ATI probably won't deliver image support in OCL for a long time ( they need huge change in CAL and OCL compiler architecture - they went easy way to provide OCL compiler <sarcasm on>in such a short time <sarcasm off>. Without TU's you need explicit address computation (increase in register usage) and you loose memory caching ( at least on 4xxx ).

As to C++ to IL compiler/converter/generator ( I really don't know what it should be called ). It's set of include files to C++ ( huge use of templates ) which enables to write kernels directly in C++ in OCL like language. It also includes C++ bindings to CAL ( which are very similar to OCL C++ bindings ). I'm really thinking about making it available ( for over a month now ), but I don't have much spare time to do it. Also I should probably write some docs for it. Ohh and one more thing - I don't have name for it .

oscarbarenys1 · ‎02-19-2010

Well I have been holding this trick on my head for over a month now..

More info on my blog coming soon: oscarbg.blogspot.com

Really you can enable image support set:

set GPU_IMAGES_SUPPORT=1

or export

tested on 5870 and amd stream 2.0 and hotfix 9.12 only works for

2d images.. also you can get generated IL code so only lacking is building programs from that.. Nvidia supports this..

Similarly you can enable byte_addresable_support (but seems is not using RAW UAVs) and some Nvidia samples work (histogram64) with GPU_BYTE_ADDRESSABLE_STORE

Also doules extension reporting

GPU_DOUBLE_PRECISION

and gl

CL_KHR_GL_SHARING

I strongly encourage to do so.. (release C++ to IL compiler/converter/generator)

Regarding documentation as far as you document how to compile this matmul code is enough!

Please don't wait for full documentation as perhaps I can contribute to improve your code if you want..

Do you support doubles?

hazeman · ‎02-19-2010

Originally posted by: oscarbarenys1

I strongly encourage to do so..

I'm not happy about it but I'm doing it ( choice beetwen sleeping and making release ).

Do you support doubles?

Doubles, LDS ( on 4xxx also ) , some double functions like exp,log,....

And as a bonus all devices which can run pixel shaders are supported ( like 3xxx family ).

oscarbarenys1 · ‎02-22-2010

I don't understand are you going to release it some day?

Are you the boy doing ighashgpu?

in case I have found in your blog that you are using powerful bitalign in AMD IL v2..

is your compiler also able to use this 5xxx instructions from C level code as native functions or you code in AMD IL?..

Also if yes, have you added 24 bit integer instructions as OpenCL doesn't compile to it and also aren't aren't exposed in AMD IL (only shown in 5xxx assembly) so you can add as native funtions in C level..

Please see that post.. http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=125665

cjang · ‎04-10-2010

Here is a chart of sgemm performance on a HD 5870 using OpenCL SDK 2.0.

http://golem5.org/bucket/gatlas/sgemm.png

There are three parameterized kernel implementations. One uses memory buffers (blue curve). The other two use images (red curves). Dual images means even/odd column storage of A and B matrices.

For each (square) matrix size, the kernels are combinatorially optimized over: 1) local work item dimension; 2) Nx4 quad inner product blocking; 3) row/column versus tiled matrix data layout.

Is anyone aware of faster OpenCL performance for sgemm? I've run out of ideas but feel that higher performance is possible as CAL IL/ISA performance is so much better.

Also, is it possible (or practical) to interoperate OpenCL with CAL IL/ISA? I wonder if they can co-exist and run using the same memory buffers. That would be the best of both worlds.

hazeman · ‎04-10-2010

I think you should try to compare ISA generated from CAL++/prunedtree version and OpenCL version.

This would give you idea where OpenCL generates bad code and maybe you could find a workaround.

Also to achive highest performance you need to run 8 hardware threads on each simd ( so 512 gpu threads per simd ). But if you try to force it on kernel with too much used registers then CAL uses stack ( to store registers ) and result is really big slowdown.

cjang · ‎04-13-2010

Your suggestion helped. After reordering the image reads and inner product MADs to match CAL++/prunedtree, performance is somewhat improved for the dual image version in OpenCL.

I've captured IL from /tmp/OCL*.il files generated by the OpenCL runtime and compared them with CAL++/prunedtree and AMD simple_matmult IL. Though I have no experience with ATI GPU IL/ISA, it's obvious the OpenCL compiler is generating less efficient code. It appears to use registers less efficiently than it could. Everything is slower as there are lots of unnecessary MOVs to shift data in and out of common registers around the MADs and I/O reads. I don't know what to do about this short of some very ugly hacking to get around the LLVM OpenCL compiler - in that case, why not use CAL/IL? (which leads to CAL++ )

Here's something that is kind of interesting. It's a graph comparing six different implementations of matrix multply on a HD 5870. Note that there is only a single timing trial so these numbers are not average values. I've noticed that the faster implementations tend to have higher variance so these numbers may be higher or lower than expected. The slower OpenCL implementations have low variance, only one or two percent at most.

http://golem5.org/bucket/gatlas/matmult6.png

1627 GFLOPS - CAL++/prunedtree at 3840x3840
1182 GFLOPS - CAL simple_matmult at 1280x1280
915 GFLOPS - OpenCL dual images at 2560x2560
727 GFLOPS - OpenCL single images at 3840x3840
486 GFLOPS - OpenCL memory buffers at 2560x2560
172 GFLOPS - CAL memimport_matmult at 1280x1280

At this point, I see IL as between 50% and 100% faster than OpenCL. The difference increases as kernels shift towards becoming compute bound. OpenCL is more competitive when using UAV memory buffers as the bottleneck is then local memory bandwidth. With images (texture sampling) as the data source, the memory bandwidth is much higher so the inner loops matter more. Then OpenCL falls further behind IL.

n0thing · ‎04-13-2010

You are getting 486 GFLops using global memory? Is this possible? As peak bandwidth from global is only 153 GB/s IIRC. My kernel using shared memory gets to about 475 GFlops on 5870.

Can you explain a bit more about dual images? I guess its to increase the cache hit ratio but haven't used this technique.

I have an internal version using single images and I get around 975 GFLops on 5850.

Can you share your kernels?

cjang · ‎04-13-2010

Here are the top five global to local kernels found at 2560x2560 for a test run used to make the matmult6 chart above. The numbers at the end are: <L> <N> <0|1> where: the 2D local work item dimension is LxL; the output matrix blocking is Nx4; 0 means row/column and 1 means tiled matrix data layout.

microseconds: 68966 gflops: 486.441 2560 8 10 1
microseconds: 87164 gflops: 384.882 2560 8 5 1
microseconds: 87823 gflops: 381.994 2560 8 8 1
microseconds: 93356 gflops: 359.354 2560 8 10 0
microseconds: 106210 gflops: 315.864 2560 8 5 0

Note the matrix blocking is 10x4. I had this idea to parameterize over arbitrary block sizes (not only even quads) after reading an explanation posted by prunedtree. Also note that the data layout is tiled to faciliate coalesced reads. It's basically a block outer product friendly layout.

The dual images are just what the CAL++/prunedtree matrix multiply implementations do. This better hides the latency in texture reads. Take a look at hazeman's CAL++/prunedtree kernel_matrixmul source code in an earlier post on this thread. Note the two MADs per line, even/odd, in the inner product accumulation.

After looking at the IL generated by the OpenCL runtime, I believe higher performance is possible using memory buffers. The inner loop accumulating the inner product from registers is not as efficient as the hand crafted IL implementations I've seen. Even when the memory reads are removed so only the inner loop and final output write are left, performance reaches around 1600 GFLOPS only. This seems low but makes sense after looking at the IL generated by OpenCL.

Is your 975 GFLOPS single image version in OpenCL? I would be interested in this. How does it work? Presently, I am stuck with single images at around 720 GFLOPS with an OpenCL kernel.

I do plan on sharing soon. Sorry... It's just not ready yet. The approach I've taken is auto-tuning and empirical optimization in the style of ATLAS. So it's not just kernels but a framework around them that finds optimal specializations (to use C++ terminology) for any hardware/software configuration. I want a design that is practical and useful, not only a benchmarking harness.

OpenCL allows many more design possibilities for this optimization approach compared with a purely static compilation toolchain. I mean, it is possible that optimization could be lazy as well as eager and could exploit profiling statistics. That makes it more like a JIT or Oracle optimizing execution planner. However, going down this road also leads to much greater complexity.

hazeman · ‎04-13-2010

Originally posted by: cjang

I do plan on sharing soon. Sorry... It's just not ready yet. The approach I've taken is auto-tuning and empirical optimization in the style of ATLAS. So it's not just kernels but a framework around them that finds optimal specializations (to use C++ terminology) for any hardware/software configuration. I want a design that is practical and useful, not only a benchmarking harness.

OpenCL allows many more design possibilities for this optimization approach compared with a purely static compilation toolchain. I mean, it is possible that optimization could be lazy as well as eager and could exploit profiling statistics. That makes it more like a JIT or Oracle optimizing execution planner. However, going down this road also leads to much greater complexity.

I think that this is one quite big flaw of OpenCL specification.

OpenCL kernels are supposed to be working on many different architectures/devices. And yet goal of OpenCL is to deliver maximum performance. It's simply impossible to write one kernel to fit them all.

IMHO OpenCL should have been designed with dynamic kernel generation in mind from the beginning. Adding some kind of scripting language of top of current C code would be good start. Also tools for best kernel selection should be included in specification.

I know that this can be achived by using additional library independend of OpenCL. But imho it so important to achive main goal of OpenCL ( speed ) that it should have been included in specs.

cjang · ‎04-13-2010

You are right about the need for kernel generation and optimization. It's necessary for good performance. If it's not part of the OpenCL specification, then everyone must re-invent the wheel. I find it quite amusing that everyone builds an internal DSL in C++ using some form of metaprogramming in order to achieve good performance in OpenCL or IL. We are forced to write custom languages which does seem very wasteful.

The state of OpenCL reminds me of CORBA when it was first released. The 1.0 specification was the minimal requirement for source code compatibility. There was no common wire protocol. The 2.0 specification included IIOP which addressed this and made vendor interoperability possible. I kind of expect the evolution of OpenCL (assuming it catches on) to follow this. The next major release should hopefully address important practical concerns.

I would vote for the metaprogramming kernel generation and optimization separate from the core OpenCL specification. This seems to be the trend in language design (e.g. the JVM based languages like Scala and Clojure). It's probably still yet early for OpenCL. That's why we are forced to write our own internal DSLs.

n0thing · ‎04-14-2010

Performance on 5850 (OCed to 900/1300)-

Executing kernel for 1 iterations
-------------------------------------------

KernelTime (ms) : 62.9427
GFlops achieved : 1041.2

MatrixA MatrixB Time(sec) KernelTime(sec)
3200x3200 3200x3200 0.876951 0.175642

I am using an internal version of ATI's OpenCL binaries so it may be optimized a bit from SDK 2.0.

I have also tried using 4x8 register size to compute 32 values by each thread but performance is lesser (around 900 GFlops)

#define TILEX 4 #define TILEX_SHIFT 2 #define TILEY 4 #define TILEY_SHIFT 2 __constant sampler_t imageSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST; __kernel void mmmKernel(__read_only image2d_t matrixA, __read_only image2d_t matrixB, __write_only image2d_t matrixC, uint widthA, uint widthB) { int2 pos = (int2)(get_global_id(0), get_global_id(1)); float4 sum0 = (float4)(0); float4 sum1 = (float4)(0); float4 sum2 = (float4)(0); float4 sum3 = (float4)(0); widthB /= 4; for(int i = 0; i < widthA; i=i+4) { float4 tempA0 = read_imagef(matrixA, imageSampler, (int2)(i/4, pos.y << TILEY_SHIFT)); float4 tempA1 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 1)); float4 tempA2 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 2)); float4 tempA3 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 3)); float4 tempB0 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i)); float4 tempB1 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i + 1)); float4 tempB2 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i + 2)); float4 tempB3 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i + 3)); sum0.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x + tempA0.w * tempB3.x; sum0.y += tempA0.x * tempB0.y + tempA0.y * tempB1.y + tempA0.z * tempB2.y + tempA0.w * tempB3.y; sum0.z += tempA0.x * tempB0.z + tempA0.y * tempB1.z + tempA0.z * tempB2.z + tempA0.w * tempB3.z; sum0.w += tempA0.x * tempB0.w + tempA0.y * tempB1.w + tempA0.z * tempB2.w + tempA0.w * tempB3.w; sum1.x += tempA1.x * tempB0.x + tempA1.y * tempB1.x + tempA1.z * tempB2.x + tempA1.w * tempB3.x; sum1.y += tempA1.x * tempB0.y + tempA1.y * tempB1.y + tempA1.z * tempB2.y + tempA1.w * tempB3.y; sum1.z += tempA1.x * tempB0.z + tempA1.y * tempB1.z + tempA1.z * tempB2.z + tempA1.w * tempB3.z; sum1.w += tempA1.x * tempB0.w + tempA1.y * tempB1.w + tempA1.z * tempB2.w + tempA1.w * tempB3.w; sum2.x += tempA2.x * tempB0.x + tempA2.y * tempB1.x + tempA2.z * tempB2.x + tempA2.w * tempB3.x; sum2.y += tempA2.x * tempB0.y + tempA2.y * tempB1.y + tempA2.z * tempB2.y + tempA2.w * tempB3.y; sum2.z += tempA2.x * tempB0.z + tempA2.y * tempB1.z + tempA2.z * tempB2.z + tempA2.w * tempB3.z; sum2.w += tempA2.x * tempB0.w + tempA2.y * tempB1.w + tempA2.z * tempB2.w + tempA2.w * tempB3.w; sum3.x += tempA3.x * tempB0.x + tempA3.y * tempB1.x + tempA3.z * tempB2.x + tempA3.w * tempB3.x; sum3.y += tempA3.x * tempB0.y + tempA3.y * tempB1.y + tempA3.z * tempB2.y + tempA3.w * tempB3.y; sum3.z += tempA3.x * tempB0.z + tempA3.y * tempB1.z + tempA3.z * tempB2.z + tempA3.w * tempB3.z; sum3.w += tempA3.x * tempB0.w + tempA3.y * tempB1.w + tempA3.z * tempB2.w + tempA3.w * tempB3.w; } write_imagef(matrixC, (int2)(pos.x, pos.y * 4), sum0); write_imagef(matrixC, (int2)(pos.x, pos.y * 4 + 1), sum1); write_imagef(matrixC, (int2)(pos.x, pos.y * 4 + 2), sum2); write_imagef(matrixC, (int2)(pos.x, pos.y * 4 + 3), sum3); }

n0thing · ‎04-15-2010

Update - The 8x4 kernel uses 36 registers so using 256 group size resulted in only 4 active wavefronts, reducing the group-size to 64 gives 6 active wavefronts hence performance increases a bit.

Performance on a Oced 5850 -

Executing kernel for 5 iterations
-------------------------------------------
KernelTime (ms) : 51.9327
GFlops achieved : 1261.94

KernelTime (ms) : 52.0417
GFlops achieved : 1259.3

KernelTime (ms) : 52.0071
GFlops achieved : 1260.14

KernelTime (ms) : 52.0453
GFlops achieved : 1259.21

KernelTime (ms) : 52.1003
GFlops achieved : 1257.88

MatrixA MatrixB Time(sec) KernelTime(sec)
3200x3200 3200x3200 0.861197 0.152418
Press any key to continue . . .

#define TILEX 4 #define TILEX_SHIFT 2 #define TILEY 8 #define TILEY_SHIFT 3 __kernel void mmmKernel2(__read_only image2d_t matrixA, __read_only image2d_t matrixB, __write_only image2d_t matrixC, uint widthA, uint widthB) { int2 pos = (int2)(get_global_id(0), get_global_id(1)); float4 sum0 = (float4)(0); float4 sum1 = (float4)(0); float4 sum2 = (float4)(0); float4 sum3 = (float4)(0); float4 sum4 = (float4)(0); float4 sum5 = (float4)(0); float4 sum6 = (float4)(0); float4 sum7 = (float4)(0); widthB /= 4; for(int i = 0; i < widthA; i=i+4) { float4 tempA0 = read_imagef(matrixA, imageSampler, (int2)(i/4, pos.y << TILEY_SHIFT)); float4 tempA1 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 1)); float4 tempA2 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 2)); float4 tempA3 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 3)); float4 tempA4 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 4)); float4 tempA5 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 5)); float4 tempA6 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 6)); float4 tempA7 = read_imagef(matrixA, imageSampler, (int2)(i/4, (pos.y << TILEY_SHIFT) + 7)); float4 tempB0 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i)); float4 tempB1 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i + 1)); float4 tempB2 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i + 2)); float4 tempB3 = read_imagef(matrixB, imageSampler, (int2)(pos.x, i + 3)); sum0.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x + tempA0.w * tempB3.x; sum0.y += tempA0.x * tempB0.y + tempA0.y * tempB1.y + tempA0.z * tempB2.y + tempA0.w * tempB3.y; sum0.z += tempA0.x * tempB0.z + tempA0.y * tempB1.z + tempA0.z * tempB2.z + tempA0.w * tempB3.z; sum0.w += tempA0.x * tempB0.w + tempA0.y * tempB1.w + tempA0.z * tempB2.w + tempA0.w * tempB3.w; sum1.x += tempA1.x * tempB0.x + tempA1.y * tempB1.x + tempA1.z * tempB2.x + tempA1.w * tempB3.x; sum1.y += tempA1.x * tempB0.y + tempA1.y * tempB1.y + tempA1.z * tempB2.y + tempA1.w * tempB3.y; sum1.z += tempA1.x * tempB0.z + tempA1.y * tempB1.z + tempA1.z * tempB2.z + tempA1.w * tempB3.z; sum1.w += tempA1.x * tempB0.w + tempA1.y * tempB1.w + tempA1.z * tempB2.w + tempA1.w * tempB3.w; sum2.x += tempA2.x * tempB0.x + tempA2.y * tempB1.x + tempA2.z * tempB2.x + tempA2.w * tempB3.x; sum2.y += tempA2.x * tempB0.y + tempA2.y * tempB1.y + tempA2.z * tempB2.y + tempA2.w * tempB3.y; sum2.z += tempA2.x * tempB0.z + tempA2.y * tempB1.z + tempA2.z * tempB2.z + tempA2.w * tempB3.z; sum2.w += tempA2.x * tempB0.w + tempA2.y * tempB1.w + tempA2.z * tempB2.w + tempA2.w * tempB3.w; sum3.x += tempA3.x * tempB0.x + tempA3.y * tempB1.x + tempA3.z * tempB2.x + tempA3.w * tempB3.x; sum3.y += tempA3.x * tempB0.y + tempA3.y * tempB1.y + tempA3.z * tempB2.y + tempA3.w * tempB3.y; sum3.z += tempA3.x * tempB0.z + tempA3.y * tempB1.z + tempA3.z * tempB2.z + tempA3.w * tempB3.z; sum3.w += tempA3.x * tempB0.w + tempA3.y * tempB1.w + tempA3.z * tempB2.w + tempA3.w * tempB3.w; sum4.x += tempA4.x * tempB0.x + tempA4.y * tempB1.x + tempA4.z * tempB2.x + tempA4.w * tempB3.x; sum4.y += tempA4.x * tempB0.y + tempA4.y * tempB1.y + tempA4.z * tempB2.y + tempA4.w * tempB3.y; sum4.z += tempA4.x * tempB0.z + tempA4.y * tempB1.z + tempA4.z * tempB2.z + tempA4.w * tempB3.z; sum4.w += tempA4.x * tempB0.w + tempA4.y * tempB1.w + tempA4.z * tempB2.w + tempA4.w * tempB3.w; sum5.x += tempA5.x * tempB0.x + tempA5.y * tempB1.x + tempA5.z * tempB2.x + tempA5.w * tempB3.x; sum5.y += tempA5.x * tempB0.y + tempA5.y * tempB1.y + tempA5.z * tempB2.y + tempA5.w * tempB3.y; sum5.z += tempA5.x * tempB0.z + tempA5.y * tempB1.z + tempA5.z * tempB2.z + tempA5.w * tempB3.z; sum5.w += tempA5.x * tempB0.w + tempA5.y * tempB1.w + tempA5.z * tempB2.w + tempA5.w * tempB3.w; sum6.x += tempA6.x * tempB0.x + tempA6.y * tempB1.x + tempA6.z * tempB2.x + tempA6.w * tempB3.x; sum6.y += tempA6.x * tempB0.y + tempA6.y * tempB1.y + tempA6.z * tempB2.y + tempA6.w * tempB3.y; sum6.z += tempA6.x * tempB0.z + tempA6.y * tempB1.z + tempA6.z * tempB2.z + tempA6.w * tempB3.z; sum6.w += tempA6.x * tempB0.w + tempA6.y * tempB1.w + tempA6.z * tempB2.w + tempA6.w * tempB3.w; sum7.x += tempA7.x * tempB0.x + tempA7.y * tempB1.x + tempA7.z * tempB2.x + tempA7.w * tempB3.x; sum7.y += tempA7.x * tempB0.y + tempA7.y * tempB1.y + tempA7.z * tempB2.y + tempA7.w * tempB3.y; sum7.z += tempA7.x * tempB0.z + tempA7.y * tempB1.z + tempA7.z * tempB2.z + tempA7.w * tempB3.z; sum7.w += tempA7.x * tempB0.w + tempA7.y * tempB1.w + tempA7.z * tempB2.w + tempA7.w * tempB3.w; } write_imagef(matrixC, (int2)(pos.x, pos.y * 8), sum0); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 1), sum1); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 2), sum2); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 3), sum3); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 4), sum4); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 5), sum5); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 6), sum6); write_imagef(matrixC, (int2)(pos.x, pos.y * 8 + 7), sum7); }

cjang · ‎04-15-2010

Your matrix multiply kernel utilizes the texture cache well (at least better than my kernels were). After changing this to follow your example, performance jumps by 33%, from around 720 GFLOPS to 966 GFLOPS. Changing the output matrix C from a buffer to an image adds another 3% to 980 GFLOPS. Thank you very much!

http://golem5.org/bucket/gatlas/goodbadmatmult.png

I am using a stock HD 5870. And I think your kernels are "tighter" as they are written for specific matrix blocking and data layout. You have less miscellaneous boilerplate arithmetic. So our numbers are consistent as your HD 5850 is overclocked 25% memory and 30% core.

The documentation says the texture cache has 2D locality. I didn't really know what this meant until now. There is obviously an underlying texture cache structure of quads in squares. For the benefit of anyone else, here's what I changed to get this jump in performance. For many with experience, this is probably obvious. It wasn't for me. (note that the code below is semi-pseudocode for clarity)

The outermost innerproduct loop is: for (int idx = 0; idx < matrixSize / VECTOR_LENGTH; idx++) so idx is traversing over blocks of A and B.

How it was before when it was slow:

valB = read_imagef(matB, sampler, idx, VECTOR_LENGTH * (localSize() * blockCol + col) + j);

How it is after when it is fast:

valB = read_imagef(matB, sampler, localSize() * blockCol + col, VECTOR_LENGTH * idx + j);

This matches n0thing's traversal of matrix B. Also note how the traversal of matrix A is transposed. If the texture for A is changed to match the same pattern as B, performance is less.

One thing that may help you push performance higher. I've noticed the fastest kernels have a blocking of 5x4. However, this is for the HD 5870 which has a different core configuration than the HD 5850. What I mean is that a non-even quad blocking may be optimal for you. This is also an example of why I think auto-tuning and empirical optimization is very useful. It can find these unusual kernel solutions.

n0thing · ‎04-17-2010

Nice to see a performance jump there!

I haven't really benchmarked the texture cache yet, but since the cache line is 256-bit(from beyond3d architectureal review) and its 2D so when you read a vec4 then the cacheline would contain current vec4 and the one in the vertical direction (above or below it).

This gives a good texture hit rate when the access pattern is like matrixA. I didn't transpose matrixB because it gives a good global memory bandwidth as the read pattern is linear.

I will try using 5x4 and 9x4 blocking-sizes also as they should result in same active wavefronts as 4x4 and 8x4 sizes assuming this is correct -

Registers per thread x (group-size / 64) x number of groups <= 240

cjang · ‎04-17-2010

I am noticing that very subtle code changes significantly affect performance.

After I changed to use your texture cache friendly inner block traversal of matrix B, I noticed that the squares of quads are of course transposed. I am reading down columns instead of across rows. So I changed the inner product accumulation loop to match this and performance increases by 5 GFLOPS. For a dual image version, performance increases by 25 GFLOPS (now up to 1045 GFLOPS on my HD 5870).

Then in the CAL++/prunedtree matrixmult.cpp example, I reversed the order of nested for loops for the inner product to gain 50 GFLOPS at 2048x2048 and 130 GFLOPS at 5120x5120.

This gives: "Device 0: execution time 1114.64 ms, achieved 1541.29 gflops"

for (i = 0; i < BY4; i++) {
for (j = 0; j < BX4; j++) {
// lots of MADs
}
}

This gives: "Device 0: execution time 1081.06 ms, achieved 1589.17 gflops"

for (j = 0; j < BX4; j++) {
for (i = 0; i < BY4; i++) {
// lots of MADs
}
}

I've seen other more counter-intuitive effects on performance from using the global ID versus group/local IDs. Performance changes in unexpected ways. It's very difficult to predict what will happen with any given change.

This makes me think that kernel designs should be represented as game trees. Then the combinatorial search over the design space can be a recursive evaluation of variations, kind of like computer chess. The design space is just too strange and unknown to find optimal kernels by hand from first principles.

esotericeric · ‎04-20-2010

Hi folks,

Thanks for a really interesting discussion. I'd like to follow up with a question (hopefully someone can help answer). I've mostly studied Nvidia implementations of SGEMM, and was really impressed to see the kind of performance numbers you guys are getting.

One concern that was raised earlier by Hazeman was the specific memory layout requirements (e.g., breaking the input matrices into even and odd columns) and having mixed row/column formats. In terms of reshuffling memory to do this, how does that factor into your performance results?

My follow-up question to that is: are there any SGEMM implementations that handle the standard row-based BLAS format (i.e., input and output matrices are in row format). And if so, does anyone know of the best implementation and how do those performance compare?

Thanks!

n0thing · ‎04-21-2010

According to cjang''s results having mixed row/column format (dual-images version) gives about 20% performance advantage over single-images version.

The kernels I posted above use standard-row based format. I get around 1260 GFlops on a Overclocked Radeon 5850 on the 8x4 blocking kernel in OpenCL.

cjang · ‎04-24-2010

The advantage of unusual data layouts has diminished as
optimization has improved. Any advantage is either at
certain matrix sizes (by moving to another size, the
advantage disappears) or not enough to justify the
complexity (the performance gain may be larger in IL/ISA
but in OpenCL is not very much). So I've abandoned tiled
and even/odd row data layouts. I'm only working with
row/column major layout now.

The kernels support transposing A and B and have been
tested against random data against a simple for-loop CPU
implementation. They also support rectangular M/N/K
matrix dimensions. However, I've not observed performance
to increase by moving away from square shapes. I'm pretty
confident in algorithmic correctness (round-off numerical
error and stability is another issue, especially in single
precision). So this is good for general SGEMM now and does
not impose any awkward data layout constraints on the
input matrices.

Below are the top three kernel specializations found for
un-transposed A and B for memory buffer and image based
matrix multiply kernels at 4800x4800 (ten trials).

./bmm_buffer -d gpu -n 4800 -t 10 -x 10
[0] 4603984 usec avg: 480.394 stddev: 3.50936 4800 4800 4800 8 10 10 inlineMNK klj
[1] 4689310 usec avg: 471.629 stddev: 0.589461 4800 4800 4800 8 5 8 inlineMNK ljk
[2] 4742431 usec avg: 466.346 stddev: 0.542847 4800 4800 4800 8 5 10 inlineMNK klj

./bmm_image -d gpu -n 4800 -t 10 -x 10
[0] 2125086 usec avg: 1040.72 stddev: 0.343981 4800 4800 4800 16 5 18 inlineMNK jkl group/local
[1] 2145931 usec avg: 1030.61 stddev: 0.514469 4800 4800 4800 8 6 2 argsMNK ljk global
[2] 2146057 usec avg: 1030.55 stddev: 0.405767 4800 4800 4800 8 6 6 argsMNK jkl group/local

Performance depends on if A and B are transposed or not.
Here are the top three for a memory buffer kernel at
4800x4800 when B is transposed.

./bmm_buffer -d gpu -n 4800 -t 10 -x 10 -b
[0] 4518746 usec avg: 489.453 stddev: 3.33775 4800 4800 4800 8 10 2 argsMNK ljk
[1] 4550961 usec avg: 485.974 stddev: 2.0284 4800 4800 4800 8 10 5 argsMNK lkj
[2] 4650525 usec avg: 475.563 stddev: 0.767519 4800 4800 4800 8 5 8 inlineMNK ljk

So it is a little faster for memory buffers when B is
transposed.

The kernels are parameterized in different ways including
work group size, inner blocking height, inner product
accumulation loop order and matrix dimension inlining. For
given M/N/K matrix dimensions, kernels are specialized
over many combinations and individually timed. There is
some simple pruning in the combinatorial search so this
takes a reasonable amount of time.

It's still not ready yet. A small command line application
does the search right now. It needs to be in a re-usable
component and more "directional" to increase speed. That
way, it will be cheap enough to (sort of) run on demand
instead of requiring hours of GPU time (although map-reduce
over a cluster or fleet would alleviate this) to find good
kernels.

cjang · ‎05-01-2010

Here are some more complete charts of matrix multiply performance and solutions found by the auto-tuning. They include variants with transposed matrices for both memory buffer and image based kernel implementations.

http://golem5.org/bucket/gatlas/

I am packaging what I have into an initial release. There is still some work to do in that regard. I do not wish to dump a mess of source code and be inconsiderate. I will try and hurry up.

One question I have, if anyone has strong opinions... where would automatic tuning and optimization of GPU kernels have the most benefit for applications today? The obvious areas involve signal and image processing. In past experience, I saw others invest substantial effort in machine learning to design classifiers. This sort of thing often becomes compute bound. I am quite new to the GPGPU world and still trying to grasp the scope of what is possible.

hazeman · ‎05-01-2010

Looking at your results I have one comment.

Final implementation of (A^T)*(B^T) should have almost identical speed as A*B. You should remember that A^T*B^T=(B*A)^T and transposing output of multiplication kernel is rather fast operation ( of course i'm not talking about running next kernel to do transposition ).

The same goes for A*B^T and A^T*B.

cjang · ‎05-01-2010

Yes, you are quite right of course. You've given me something else to think about.

The results shown are literally op(A) * op(B) without any algebraic transformations. My thought was that the user will necessarily work out the linear algebra and design the calculation to be efficient. This would subsume any algebraic transformations of this kind. I took the easy way out instead of trying to make the kernel smarter.

Another reason I didn't go farther is that the fastest kernels are often due to inner blocking with rectangles that are non-even quads on a side (i.e. 5x4, 6x4, 10x4). The kernel implementation (which you have not seen as the source code is unreleased as of yet) accesses memory as float4s. If the inner blocks have to transpose, this becomes very awkward because they no longer fit evenly into the float4 vector elements.

What I hadn't thought of before is that the performance gain from a faster matrix multiply will very likely outweigh any penalty when writing back to global memory in a non-optimal way. The dominant cost is the block inner product. The write to output is less. So as you say, speed should be very close.

Oh another thing - I did have the thought to implement this same kind of auto-tuned optimization with CAL++. I would expect performance to jump by several hundred GFLOPS.

hazeman · ‎05-02-2010

There is one more issue witch you are probably not considering. I see in n0thing code that block position is simply (get_global_id(0),get_global_id(1)).

This isn't the best solution for reusing cache between gpu threads.

It looks like example in CAL++ is taking the same approach but it isn't the case. In pixel shader mode gpu hardware is assigning position to threads in tiled order. Whether this is best for opencl kernel or some other method is better is left for testing.

There is also one important issue involved. There is total lack of information about this so some parts of it are my guesses ( marked with italic ). CAL buffer (image) can have 2 formats - tiled & linear ( flag CAL_RESALLOC_GLOBAL_BUFFER ). This impacts which points are near in cache. TU reads 64 continuous "points" from memory. In tiled format it will mean that 4 ( with exception of edges ) neighbours are in cache ( x-1,x+1,y-1,y+1 ). But for linear format only (x-1,x+1) neighbours are in cache.

Of course this cache behaviour has huge impact on the design of multiplication kernel.

Now the question is, which format OpenCL uses. As far as I know IL doesn't allow scatter write to tiled buffer. This would imply that OpenCL is using linear format. But maybe there is some new unpublished instruction available.

Maybe someone could post here how the write_imagef is translated into IL code.

nou · ‎05-02-2010

__kernel void image_write2(__read_only image2d_t read, __write_only image2d_t img) { sampler_t sampl = CLK_NORMALIZED_COORDS_TRUE | CLK_ADDRESS_NONE | CLK_FILTER_NEAREST; size_t x = get_global_id(0); size_t y = get_global_id(1); float4 c = read_imagef(read, sampl, (int2)(x,y)); write_imagef(img, (int2)(x,y), (float4)(x/256.0f, y/256.0f, x/256.0f, y/256.0f));//toto funguje } mdef(46)_out(1)_in(2) mov r0, in0 mov r1, in1 dcl_literal l1, 0x7f800000, 0x7f800000, 0x807fffff, 0x807fffff dcl_literal l2, 0x7f800000, 0x7f800000, 0, 0 dcl_literal l3, 0x80000000, 0x80000000, 0x80000000, 0x80000000 dcl_literal l4, 0x3f800000, 0x3f800000, 0, 0 dcl_literal l5, 0, 0, 0, 0 dcl_literal l6, 0x7fffffff, 0x80000000, 0x7fffffff, 0x80000000 dcl_literal l7, 0x00800000, 0x00800000, 0x00800000, 0x00800000 dcl_literal l8, 0x00000017, 0x00000017, 0x00000017, 0x00000017 dcl_literal l9, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff mov r2.x___, r0.x mov r2._y__, r1.x and r3.xyzw, r2.xyxy, l1 ieq r4.xyzw, r3.xyxy, l2 and r5.xy__, r2.xy, l3 ior r3.__zw, r3.zwzw, l4.xyxy cmov_logical r3.__zw, r4.zwzw, r5.xyxy, r3.zwzw cmov_logical r3.__zw, r4.xyxy, r2.xyxy, r3.zwzw ior r5.xy__, r4.xz, r4.yw ior r5.x___, r5.x, r5.y inegate r5.__z_, r3.yyyy iadd r3.x___, r3.x, r5.z cmov_logical r3.x___, r5.xxxx, l5, r3.xxxx rcp_zeroop(infinity) r2._y__, r3.ww mul_ieee r2.x___, r3.z, r2.y and r2.__zw, r2.xxxx, l6.xyzw ishr r6.x___, r2.z, l8 ishr r6._y__, r3.xxxx, l8 iadd r2.xy__, r2.xzxz, r3.xxxx iadd r6.x___, r6.x, r6.y ige r4.__z_, l5, r6.x ior r4._y__, r2.wwww, l1 ige r4.x, r6.x, l9 cmov_logical r4.x, r5.x, l5, r4.x cmov_logical r2.x, r4.z, r2.w, r2.x cmov_logical r0.x, r4.x, r4.y, r2.x mov out0, r0 mend il_cs_2_0 dcl_cb cb0[9] ; Constant buffer that holds ABI data dcl_literal l0, 4, 1, 2, 3 dcl_literal l1, 0x00FFFFFF, -1, -2, -3 dcl_literal l2, 0x0000FFFF, 0xFFFFFFFE,0x000000FF,0xFFFFFFFC dcl_literal l3, 24, 16, 8, 0xFFFFFFFF dcl_literal l4, 0xFFFFFF00, 0xFFFF0000, 0xFF00FFFF, 0xFFFF00FF dcl_literal l5, 0, 4, 8, 12 mov r769, cb0[8].x call 1202;$ endmain func 1202 ; __OpenCL_image_write2_kernel dcl_max_thread_per_group 256 mov r770, l1.0 dcl_literal l6, 0x43800000, 0x43800000, 0x43800000, 0x43800000; float: 2.560000e+02 dcl_literal l7, 0x00000000, 0x00000000, 0x00000000, 0x00000000; float: 0.000000e+00 dcl_literal l8, 0x00000018, 0x00000018, 0x00000018, 0x00000018; int: 24 dcl_literal l9, 0x00000000, 0x00000000, 0x00000000, 0x00000000; int: 0 dcl_resource_id(0)_type(2d)_fmtx(unknown)_fmty(unknown)_fmtz(unknown)_fmtw(unknown) dcl_uav_id(2)_type(2d)_fmtx(uint) dcl_raw_uav_id(1) dcl_arena_uav_id(0) mov r0.z, vThreadGrpIdFlat.x mov r1022.xyz0, vTidInGrp.xyz mov r1023.xyz0, vThreadGrpId.xyz imad r1021.xyz0, r1023.xyz0, cb0[1].xyz0, r1022.xyz0 iadd r1021.xyz0, r1021.xyz0, cb0[6].xyz0 iadd r1023.xyz0, r1023.xyz0, cb0[7].xyz0 mov r1023.w, r0.z ishl r1023.w, r1023.w, l0.z imad r771.x, vAbsTidFlat.x, cb0[3].y, cb0[3].x dcl_literal l10, 0x00000005, 0x00000005, 0x00000005, 0x00000005; int: 5 dcl_literal l11, 0x00000010, 0x00000010, 0x00000010, 0x00000010; int: 16 dcl_literal l12, 0x00000004, 0x00000004, 0x00000004, 0x00000004; int: 4 dcl_cb cb1[4] ; Kernel arg setup: read mov r1, cb1[0] ; Kernel arg setup: img mov r2, cb1[2] call 1206 ; image_write2 ret endfunc ; __OpenCL_image_write2_kernel ;ARGSTART:__OpenCL_image_write2_kernel ;version:1:2:44 ;uniqueid:1202 ;memory:private:0 ;memory:hwlocal:0 ;image:read:2D:RO:0:1:0 ;image:img:2D:WO:2:1:32 ;function:1:1206 ;intrinsic:0 ;sampler:unknown_24:0:1:24 ;ARGEND:__OpenCL_image_write2_kernel func 1206 ; image_write2 mov r176.x___, r2.xxxx mov r177.x___, r1.xxxx call 1111 ; __amdil_get_global_id_int mov r178, r1 mov r179, l6 mov r178, r178.x000 call 1111 ; __amdil_get_global_id_int mov r181, r1 itof r182.x___, r178.xxxx mov r181, r181.y000 utof r183.x___, r178.xxxx mov r182, r182.xxxx mov r184, l7 utof r185.x___, r181.xxxx mov r0.___w, r184.x mov r0.xyz_, r182 mov r182, r0 ;__fdiv_f32 mcall(46) (r183),(r183, r179) ;__fdiv_f32 mcall(46) (r179),(r185, r179) mov r185, r183.xxxx itof r186.x___, r181.xxxx mov r0.__z_, r184.x mov r0.xy_w, r182 mov r182, r0 mov r0.___w, r179.x mov r0.xyz_, r185 mov r184, r0 mov r0._y__, r186.x mov r0.x_zw, r182 mov r182, r0 mov r185, l8 mov r186, l9 mov r0.__z_, r183.x mov r0.xy_w, r184 mov r183, r0 mov r1.x___, r177.xxxx mov r2.x___, r185.xxxx mov r113, r182 mov r224.x___, r186.xxxx sample_resource(0)_sampler(0)_coordtype(normalized) r1, r113 ; __amdil_sample_data_0_0_0_1 mov r0._y__, r179.x mov r0.x_zw, r183 mov r177, r0 mov r178, r178.xxxx mov r177, r177 mov r0._y__, r181.x mov r0.x_zw, r178 mov r178, r0 mov r1.x___, r176.xxxx mov r2.xy__, r178.xyxy mov r113, r177 uav_store_id(2) r2.xy, r113 ret endfunc ; image_write2 ;ARGSTART:image_write2 ;uniqueid:1206 ;memory:private:0 ;memory:hwlocal:0 ;function:1:1111 ;intrinsic:1:46 ;ARGEND:image_write2 func 1078 ; Store32BitsUAV uav_raw_store_id(1) mem0.x___, r1.x, r2 ret endfunc ; Store32BitsUAV func 1111 ; getGlobalIDInt mov r1, r1021.xyz0 ret endfunc ; getGlobalIDInt end

hazeman · ‎05-02-2010

So OpenCL uses new instruction to write to image ( uav_store with 2d format ) - but it doesn't answer if the image has linear or tiled format.

There are 2 options to verify it. First one is to try to run CAL kernel with uav_store and test if it does allow for tiled layout. Second one is based on catching call to CAL library and check flags used for resalloc function.

<joke on> Of course we could simply ask someone from ATI for answer <joke off>

cjang · ‎05-03-2010

I've noticed the image based matrix multiply kernels seem
to fall into one of three classes by performance. What you
are saying is that we may be missing a fourth class that
is even faster?

???
1000 GFLOPS for A * B and At * B
700 GFLOPS for A * Bt
400 GFLOPS for At * Bt

I wouldn't be surprised if this is possible. I admit to
not understanding how the tiling and texture cache work.
From my point of view, it is (partially automated) trial
and error.

It is kind of surprising that the ATI hardware is capable
of such high performance for math kernels and yet the
developer is mostly on his or her own. There is so much
unrealized potential. This would all be good for AMD/ATI
as well as the developer. Everyone would benefit.

hazeman · ‎05-03-2010

Originally posted by: cjang I've noticed the image based matrix multiply kernels seem to fall into one of three classes by performance. What you are saying is that we may be missing a fourth class that is even faster?

Yep. Fourth , fifth and maybe more . In language of optimization, I'm simply saying that you are searching too small space and there are 2 more variables which have huge impact on performance.

First is data representation in memory ( linear, tiled format or some other - but if the computations for pixel position will be too complicated it will impact kernel speed ).

The other "variable" is packing of workitems into groups - this one also have huge impact on reusing data in cache.

It is kind of surprising that the ATI hardware is capable of such high performance for math kernels and yet the developer is mostly on his or her own. There is so much unrealized potential. This would all be good for AMD/ATI as well as the developer. Everyone would benefit.

ATI approach to developers is really bad. Documentation is one of the worst I've seen. There is a lot spelling errors, obvious mistakes, really stupid and wrong descriptions of some instructions. Also ATI is still in the age of cold war ( those pesky nvidia spys ). They can't give ofiicial info on cache, cache latency, cache behaviour, coalescing, etc . And when by some mistake they will answer on forum it's wrong or in opposition to answer from other ATI developer.

MicahVillmow · ‎05-03-2010

hazeman,
If there are examples of the documentation that you believe need work and are not fixed in our new release, please list them in an email to streamdeveloper@amd.com CC: Micah and i'll work on getting them fixed.

cjang · ‎05-13-2010

Here is what I have.

http://github.com/cjang/GATLAS

There is also a homepage with some deeper explanation of what I have tried to do so far. For me, this is very much a work in progress and continually evolving. I have tried to keep the code and design clean and useful. However, it is still quite early. I am still trying to understand how all of this fits into the bigger picture.

Next up is an optimized SIFT kernel (or more generally convolution based feature transforms at multiple scales) for computer vision applications. This will be fun. And I think it will reveal if the approach I have taken so far generalizes.

cjang · ‎05-20-2010

This thread started with a question as to SGEMM performance. However, most of the performance figures have been for matrix multiplication, not SGEMM. So...

This is the best picture I have right now of SGEMM performance. It takes into account PCIe bus data transfers as well. Note that memory is not pinned. (I have not tried specifying pinned memory - don't even know if that works.)

http://golem5.org/gatlas/bench_sgemm/bench_sgemm_pcie.html

Note that all data points are from empirical timings.

One surprise is that memory buffer kernels are faster than image kernels when writing A and B from host to device. When all matrices are transferred between host and device, a memory buffer kernel just breaks 300 GFLOPS at 5600x5600 while the corresponding image kernel is about 260 GFLOPS.

I also think there is a very natural tendency to overemphasize kernel benchmarks to the exclusion of PCIe bus data transfer costs. Many applications are likely to have much closer performance between IL/ISA and OpenCL due to the I/O bottleneck. OpenCL is more competitive than synthetic kernel benchmarks suggest.

Another obvious (in hindsight) observation is that more memory allows blocking problems to reduce the effect of PCIe bus data transfer costs. I do not think the 256 MB UAV memory buffer limitation is really an obstacle so long as we can use as much device memory as possible in multiple buffers.

cjang · ‎06-24-2010

Just a SGEMM performance update with the current SDK and driver. The 5870 now reaches almost exactly 50% utilization at 1375 gigaFLOPS with an image based kernel. Memory buffer kernels are well over 500 gigaFLOPS.

Using ATI SDK 2.1 and Catalyst 10.4 on Ubuntu 9.10 64 bit

OpenCL SGEMM benchmarks (gigaFLOPS)

                 5440     5670     5770     5870
                 Cedar    Redwood Juniper Cypress

memory buffer:   @1280    @1600    @4800    @5440
A B             15       73       257      536
At B             13       54       224      446
A Bt            17       78       277      558
At Bt            14       56       225      447

image:           @1280    @1600    @4800    @2240
A B             52       307      720      1307
At B             67       320      722      1375
A Bt            50       182      416      794
At Bt            53       163      362      703

nnsan · ‎07-07-2010

Hi,

This is a little bit off topic but I have tried to reproduce an optimized matrix-muliply(MM) in IL. I summarized our results at

http://galaxy.u-aizu.ac.jp/trac/note/wiki/MatrixMultiply

I obtain ~2100 Gflops on Cypress 850MHz. Not highest performance but I believe our code is simpler in terms of memory layout than the fastest code posted on beyond3d.

hazeman · ‎07-08-2010

Originally posted by: nnsan

I obtain ~2100 Gflops on Cypress 850MHz. Not highest performance but I believe our code is simpler in terms of memory layout than the fastest code posted on beyond3d.

I have to admit that something doesn't add up for me.

I did some math after looking at your code. For 2100 gflops your kernel needs to transfer 1050 GB/s from L1 cache. For 5870 agregate cache speed is 1080 GB/s ( 20*54 ). So to get your result you would need perfect transfer from L1. This would require ideal efficiency and no cache misses - which simply isn't the case for ati gpu and matrix mul.

Also i have my doubts about hiding latency from L1 cache hits.

So it would be nice for someone else to verify your results - there might be some error in benchmarking.

nnsan · ‎07-08-2010

hazeman,

You might be right so I will double-check my code.

The code at inside of the loop seems to be same as prunedtree's code except load instructions. My code reads from two streams but prunedtree's code reads from four streams. Instead I use the offset feature of sample_resource. Do you think these differences affect the performance (i.e. cache hit rate)? I don't fully understand the internal architecture of Cypress...

hazeman · ‎07-08-2010

If I remember correctly prunedtree achived >2TFlops performance with bigger blocks than 8x8 ( 10x12 i think ). Bigger block equals to lower transfer required from L1 cache.

Also he used precomputed position for each "workitem" - and he didn't post info about it. Proper position and data layout can increase cache hits. Tiled layout used by default in pixel shader is better than linear but i doubt it's optimal .

The code for 8x8 block with tiled layout ( the same you use ) is available as part of CAL++. So you should have similar result ( ~1.6 TFlops ).

Like i said I have some doubts about hiding L1 cache latency in your code. To be sure I would have to see generated ISA.

You have divided reads into 2 groups with 4 sample instructions. CAL compiler sometimes merges those , but usually it doesn't. If it's not merged than you have 13 ops per load clause. L1 latency is 180 cycles and you have only 13 ops*8 threads=104 cycles to hide it. So you could be using <60% of gpu power.

I'm guessing that you have some error which cause you to overestimate performance by factor of 2.

PS. I'm writing numbers from my own memory so I could be wrong somewhere .

Archives Discussions

OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?