cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

riza_guntur
Journeyman III

Beta4, GPU performance is like "ultra slow"

Has anyone realized this?

4850 very slow compared to Brook+

I still have data for matmult, even OpenCL implementation can't compete with slow simple_matmult in Brook+

It's more like it takes a dozen of years until OpenCL can keep up

0 Likes
34 Replies
omkaranathan
Adept I

Brook+ samples use texture memory where as the OpenCL implementation uses global memory. So these are not exactly apple to apple comparison.

 

0 Likes

"So these are not exactly apple to apple comparison" - I think you are badly mistaken here.

The real and most important question for developer is "when I write normal program in Brook+ ( or CAL ) and compare it to program written in OpenCL, what will be faster". This is the question behind decision about using Brook+ and OpenCL ( Of course there are other factors as ease of developing, but these is not important when speed differnece is big ).

So for developer it really doesn't matter how did ATI implement OpenCL. If it's too slow then sorry, but we will not use it ( and probably switch to nvidia's cards and better opencl implementation )

On the other hand I think AMD/ATI should really think about hiring more programers ( and maybe firing few ). Cause at the moment the software side of ATI is really behind hardware ( look at the linux ati driver, quality of cal or speed of opencl development ... ).

 

0 Likes

It is definitely not an apples to apples comparison as one sample is using cached memory and other is not. Images are not supported yet (which is an optional requirement), you can use shared memory as a programmable cache (only on R800 series though). 

0 Likes

The example OpenCL matmul provided by ATI is almost verbatim from the NVIDIA OpenCL SDK, and should be more or less optimal on NVIDIA hardware.  It is about 5 times slower than the CUDA (Native, non-OpenCL) version on the NVIDIA OpenCL impl.

ATI hardware is explicit float4, so don't expect the NVIDIA kernel to run optimally on the ATI hardware.  The optimized brook+ implementation of matmul in the SDK is about 1.5 times faster than the CUDA (Native, non-OpenCL) version (4870 vs gtx260).  I have ported this "in flavour" to ATI OpenCL and it is only 1/2 slower than the brook+ version.  The OpenCL version is however all gather in and scatter out ... such a brook+ kernel may be as slow ... I'm gonna check it ...

Conclusion:  The ATI OpenCL compiler, at least for matmul with explicit float4, is doing pretty good, and much better than NVIDIA's compiler.

Try for yourself ... I've attached the explicit float4 port to OpenCL (its written in pyopencl from http://mathema.tician.de/software/pyopencl)

As for elegance of kernel code, I prefer the float4 versions to memory fences and blocksizes.

 

from __future__ import division kernel_code = """ // Matrix dimensions // (chosen as multiples of the thread block size for simplicity) #define WA %(w_a)d // Matrix A width #define HA %(h_a)d // Matrix A height #define WB %(w_b)d // Matrix B width #define HB WA // Matrix B height #define WC WB // Matrix C width #define HC HA // Matrix C height /* Matrix multiplication: C = A * B. * Device code. */ //////////////////////////////////////////////////////////////////////////////// //! Matrix multiplication on the device: C = A * B //! WA is A's width and WB is B's width //////////////////////////////////////////////////////////////////////////////// __kernel void matrixMul( __global float4* A, __global float4* B, __global float4* C ) { // Declaring and initializing accumulators float4 accumulator1 = {0.0f,0.0f,0.0f,0.0f}; float4 accumulator2 = {0.0f,0.0f,0.0f,0.0f}; float4 accumulator3 = {0.0f,0.0f,0.0f,0.0f}; float4 accumulator4 = {0.0f,0.0f,0.0f,0.0f}; float4 A1, A2, A3, A4; float4 B1, B2, B3, B4; // Thread index //int tx = get_global_id(0); //int ty = get_global_id(1); //int a_i; //int b_j; int f4WA = WA/4; int f4WB = WB/4; int a_i = get_global_id(1); int b_j = get_global_id(0); int a_start = a_i*f4WA*4; int b_start; int k = 0; for(; k < f4WA; ++k) { // Fetching values from A A1 = A[a_start+k]; A2 = A[a_start+k+f4WA]; A3 = A[a_start+k+2*f4WA]; A4 = A[a_start+k+3*f4WA]; b_start = k*f4WB*4; // Fetching values from B B1 = B[b_start+b_j]; B2 = B[b_start+b_j+f4WB]; B3 = B[b_start+b_j+2*f4WB]; B4 = B[b_start+b_j+3*f4WB]; accumulator1 += A1.xxxx * B1.xyzw + A1.yyyy * B2.xyzw + A1.zzzz * B3.xyzw + A1.wwww * B4.xyzw; accumulator2 += A2.xxxx * B1.xyzw + A2.yyyy * B2.xyzw + A2.zzzz * B3.xyzw + A2.wwww * B4.xyzw; accumulator3 += A3.xxxx * B1.xyzw + A3.yyyy * B2.xyzw + A3.zzzz * B3.xyzw + A3.wwww * B4.xyzw; accumulator4 += A4.xxxx * B1.xyzw + A4.yyyy * B2.xyzw + A4.zzzz * B3.xyzw + A4.wwww * B4.xyzw; } int index = a_i*f4WB*4+b_j; C[index] = accumulator1; C[index+f4WB] = accumulator2; C[index+2*f4WB] = accumulator3; C[index+3*f4WB] = accumulator4; } """ import pyopencl as cl from time import time import numpy gpu_ctx = cl.Context(dev_type=cl.device_type.GPU) gpu = 3 assert gpu<len(gpu_ctx.devices) for dev in gpu_ctx.devices: assert dev.local_mem_size > 0 ctx = cl.Context(devices=gpu_ctx.devices[gpu:gpu+1]) #queue = cl.CommandQueue(ctx, # properties=cl.command_queue_properties.PROFILING_ENABLE) queue = cl.CommandQueue(ctx) a_height = 4096 #a_height = 1024 a_width = 2048 #a_width = 256 #b_height == a_width b_height = a_width b_width = 4096 h_a = numpy.random.rand(a_height, a_width).astype(numpy.float32) h_b = numpy.random.rand(b_height, b_width).astype(numpy.float32) #h_c = numpy.empty((a_height, b_width)).astype(numpy.float32) h_c = numpy.empty((a_height, b_width)).astype(numpy.float32) mf = cl.mem_flags kernel_params = {"w_a":a_width, "h_a":a_height, "w_b":b_width} kernel = cl.Program(ctx, kernel_code % kernel_params).build().matrixMul #def __call__(self, queue, tgt, src, shape): # w, h = shape assert a_width % 4 == 0 assert a_height % 4 == 0 assert b_width % 4 == 0 assert b_height == a_width # kernel(queue, (w, h), tgt, src, numpy.uint32(w), numpy.uint32(h)) # __call__(queue, a_t_buf, a_buf, source.shape) # args: queue, domain, *args d_a_buf = cl.Buffer(ctx, mf.READ_ONLY, size=h_a.nbytes) d_b_buf = cl.Buffer(ctx, mf.READ_ONLY, size=h_b.nbytes) d_c_buf = cl.Buffer(ctx, mf.WRITE_ONLY, size=h_c.nbytes) t1 = time() write_events = [cl.enqueue_write_buffer(queue, d_b_buf, h_b)] write_events+= [cl.enqueue_write_buffer(queue, d_a_buf, h_a)] cl.wait_for_events(write_events) push_time = time()-t1 t1 = time() args = [queue, (int(b_width/4),int(a_height/4))] args+=[d_a_buf] args+=[d_b_buf] args+=[d_c_buf] event = kernel(*args) event.wait() t1 = time() event = kernel(*args) event.wait() #print event.profile.end - event.profile.start gpu_time = time()-t1 t1 = time() read_events = [cl.enqueue_read_buffer(queue, d_c_buf, h_c)] # wait on events cl.wait_for_events(read_events) pull_time = time()-t1 gpu_total_time = gpu_time+push_time+pull_time t1 = time() write_events = [cl.enqueue_write_buffer(queue, d_b_buf, h_b)] write_events+= [cl.enqueue_write_buffer(queue, d_a_buf, h_a)] cl.wait_for_events(write_events) event = kernel(*args) event.wait() cl.enqueue_read_buffer(queue, d_c_buf, h_c).wait() actual_time = time()-t1 ans1=h_c print "GPU (s) total:", gpu_total_time print "PUSH ", push_time print "PULL ", pull_time print "COMPUTE ", gpu_time print "ACTUAL ", actual_time do_cpu = True if do_cpu: t1 = time() ans2 = numpy.dot(h_a,h_b) cpu_time = time()-t1 print "CPU (s)", cpu_time print "GPU speedup: ", cpu_time/gpu_total_time print "GPU==CPU:",numpy.allclose(ans1,ans2)

0 Likes

"OpenCL optionally supports textures (images), in addition to global memory.  This is not yet available in beta4, which would have provided a fairer comparison."

This is a little beside the point. R7xx has 2 methods to access memory. One is by texture unit ( texture read clausule ) and second global memory access ( memory read clausule ). It's more about what kind of instruction to use to load data from memory. Texture unit can be used in a way which imitates simple access to memory. Probably in implementing OpenCL it's easier to use global memory. But if it's significantly slower then texture unit ... then CAL will be the only option to harvest power of Radeon cards, and what will be the sense of having OpenCL.

And 1/2 slower then Brook+ is huge slowdown ( as Brook isn't that fast ).

And about firing staff. Maybe it's my first post here ( I don't like talking, coding is more my thing ), but I'm watching ATI problems with software for a long time now. And I wonder why AMD after taking over ATI gave new linux driver development to Novell.

PS. Here is link to the fastest matrix multiplication  ( it uses texture units to access memory ) http://forum.beyond3d.com/showthread.php?t=54842.

 

0 Likes

I can see both sides of this.  Yes, ultimately OpenCL cannot under perform other alternatives by a wide margin.  But ATI's OpenCL is NOT done, I assume.  OpenCL optionally supports textures (images), in addition to global memory.  This is not yet available in beta4, which would have provided a fairer comparison.

It is now up to the DEVELOPER to choose whether to represent data as just memory or as textures, not some implementation detail chosen by ATI.  Both approaches have their strengths an weaknesses.  High I/O kernels can perform dramatically differently.  Low I/O kernels should probably be written using global memory, unless it actually is image data.

FYI, talking about firing people for your first post??  Amateur.

0 Likes

hazeman,
Although 7XX has multiple methods to access memory(a lot more than 2 if you read the ISA doc). OpenCL currenly only has one as the OpenCL programming model is pointer based, so all data has to be fully coherent(this is ignoring images which is read_only or write_only, not both). This does not allow the use of the texture unit in the same way that brook+/IL can use the texture unit. Brook+ does not allow you to alias pointers(unless you explicitly allow it) and IL you do so at your own risk. Writing to memory and reading from that same memory with the texture unit does not produce deterministic behavior. OpenCL requires that all writes and reads to global memory are coherent, so this approach is not feasible. This is a performance hit compared to a streaming model because the GPU is natively a streaming device. There is another performance hit for the R7XX since it was not designed with OpenCL in mind, our new HD5XXX series was.
One of the goals of the Stream SDK is to provide a full software stack for many different types of programmers.
That means if you want performance, AMD provides CAL/IL to do that. If you want ease of programming to the streaming model, we also provide Brook+ to do that. If you want to program in the same language across multiple devices from the same source, OpenCL.

As for performance, if you want a Apples to Apples comparison on performance, compare code optimized for our OpenCL platform against code optimized against other vendor's OpenCL platforms. If you see a way we can realistically improve our software stack, feel free to let us know. We have posted email addresses to contact us on the stream sdk page.

0 Likes

PS. Here is link to the fastest matrix multiplication  ( it uses texture units to access memory ) http://forum.beyond3d.com/showthread.php?t=54842.


@Hazeman: Thanks for the very nice link!  I'll read it when I'm a real programmer;-)

As for performance, if you want a Apples to Apples comparison on performance, compare code optimized for our OpenCL platform against code optimized against other vendor's OpenCL platforms. If you see a way we can realistically improve our software stack, feel free to let us know. We have posted email addresses to contact us on the stream sdk page.


Right on.  OK so my implementation up there is about 13x faster than the one in the ATI OpenCL SDK, and about 3 times faster than the one in the CUDA OpenCL SDK (4850 vs gtx260).  You've all got the code.  Improve on either of these and thanks for uploading for the rest of us.  It would be great if you report relative speed ups compared to above reference implementations for NVIDIA and/or ATI.  I guess both can be improved by a good margin, and I'd love to see how, in OpenCL.

I used the following matrix dimensions:

A h=4096, w=2048

B h=2048, h=4096

=>C h=4096, w=4096

C=A*B

As for suggestions for software stack improvements, I think first some thank yous are in order for ATI, NVIDIA, Apple, Khronos, and the rest of the OpenCL consortium!  The python wrapper for opencl, pyopencl (Andreas Klöckner, http://mathema.tician.de/software/pyopencl) was developed using NVIDIA OpenCL, and yesterday I built it against ATI OpenCL without a hitch.  Its like they were digging from two sides of a mountain and they met in the middle.  Thank you, this OpenCL opens a very important tunnel.

 

 

 

0 Likes

I've posted link to this matrix multiplication implementation as it's pushing Radeons 4870-90 to the hardware limits ( almost 1 TFLOPS ) - so it gives good base to compare other codes ( where are differences and so on ). I really didn't mean to suggest someone isn't real programmer .

Emuller could you calculate the GFLOPS for your implementation? It's easier to evaluate code/opencl performance this way. Thank you in advance .

 

 

0 Likes

Originally posted by: hazeman I've posted link to this matrix multiplication implementation as it's pushing Radeons 4870-90 to the hardware limits ( almost 1 TFLOPS ) - so it gives good base to compare other codes ( where are differences and so on ). I really didn't mean to suggest someone isn't real programmer .


I don't think I am a proverbial real programmer ...

"The Story of Mel, a Real Programmer"

http://www.ccil.org/jargon/jargon_49.html

"Real Programmer /n./"

http://www.ccil.org/jargon/jargon_33.html#TAG1478

 

Emuller could you calculate the GFLOPS for your implementation? It's easier to evaluate code/opencl performance this way. Thank you in advance .


gflop = C.size * (a_width * 2.) / (1000**3.)
gflops = gflop / gpu_time

Here a_width=2048, C.size=4096^2

So the GFLOP for the operation is 68.72

Now some gpu_times:

 

1) explicit float4 opencl impl above running on a 4870
gpu_time = 0.586 s

2) explicit float4 brook+ impl running on 4870
gpu_time = 0.226 s

3) demo_meta_matrixmul_cheetah.py example provided in PyCUDA source running on gtx260
gpu_time = 0.378 s

4) matrixMul.cl in the NVIDIA OpenCL SDK (bsaically the same as the one provided in ATI OpenCL SDK) on gtx260
gpu_time = 3.28s

5) same kernel as 4 running on 4870
gpu_time = 11.28s

So GFLOPS:

1) float4 opencl, 4870
117 GFLOPS

2) float4 brook+ 4870
304 GFLOPS

3) PyCUDA matmul demo, gtx260
182 GFLOPS

4) NVIDIA OpenCL matmul example, gtx260
20.95 GFLOPS

5) ATI (basically same kernel as 4) OpenCL matmul example, 4870
6.1 GFLOPS

6) I tried to run the explicit float4 opencl kernel of 1 on the gtx260, but it failed due to "out of resources"

So I ran smaller matrix sizes:

Now, GFLOP = 2048**2 * 1024*2.0 / 1000**3
gpu_time = 1.11s
For 7.7386798126126122 GFLOPS

 

0 Likes

>> Pointer aliasing resolution is actually very difficult in general. 

Thats generally true. But one simple trick would be to detect const pointer and then apply optimizations ( using TMU ).

>> If you cast pointers, embed pointers in structs, things get fun. 

Following const embeded in struct , casted is almost standard thing in C compiler ( as OpenCL is simplified C )

>> Also, statically, the compiler doesn't know if 2 arguments are really the same pointer at runtime. 

Again true. But again solution to that problem isn't hard. You need to generate two versions of code. One assuming no overlapping of variables ( with optimization ) and second one using general (slower) instructions. And during call simply check witch one to use.

If AMD/ATI wants to be competitive with Bullet physics engine then they really should think about adding as much optimizations to OpenCL compiler as they can ( and do it ASAP ). Imho justifications like "its not apple to apple comparision" simply won't cut it.

Thanks to emuller for posting GFLOPS for his mult   - and to be frank those numbers for OpenCL don't look to good .

Maybe it looks like I'm bashing ATI. But it's simply because I want for OpenCL compiler to be as good as possible. It's much less work to write in C then to use CAL. But if the numbers will look as they do now - I will have no choice and it really saddens me .

They have great hardware and I hope that software side will follow.

 

 

0 Likes

You know what, I think the OpenCL examples works great for CPU, not GPU, everything slows down even after including compilation time at both sides. Some of the examples have like 1:1 comparison in CPU:GPU that's comparison between famous Q6600 on its time with 4850. The Core i7 or Gulftown comparison with R700 should be even more frustating where high-end CPU will gonna stomp GPU.

Anyway, to overcome compiler constraints, AMD should make "OpenCL Best Practice" and make OpenCL developer forum with some fancy features like embedding code in post, no geeks look like now which is very plain, and maybe its own web address

Those would make us know which method works best with OpenCL.

AMD, way to go!

0 Likes

As a C/C++ 'Fanboy' i add another comparison:

Have a look at the usual Language-Shootouts and you will see that e.g. Java is in general roughly two times slower than C/C++ (depending on specific code). Although the Java world got modern JIT-VMs that know the lastest tricks and optimization.

So what do we expect from:

OpenCL -> CAL -> IL -> eg. R7xx ISA

when

Java -> IL -> e.g. x86 ISA

is two time slower than

C++ -> e.g. x86 ISA

Maybe this is something what the germans (BTW: Sorry for my worse english) call a 'milkmaid calculation', as a result of its triviality. Adding more and more abstraction layers will not produce overhead/slow down under all cicumstances. But in general it does! So this debate is nothin else than the plain old 'Assembler vs. HighLevelLanguage' debate.

In general it is absolutely no surprise when  AMD people say that, implication X results in slow down Y. Substitute the brandname AMD with NVidia, Intel, TI, ARM or others and you still get an correct statement. Just have a look at the x86-world an ask yourself why AMD/Intel deliver so called 'performance intrinsics' to their customers. Or why Xillinx/Altera/Actel/[more] do the same in the -intentionally very different- FPGA-world.

One question was: Are AMD Developers are stupid and deserve to be fired? Maybe, but then Intel, NVidia, Sun, TI, ARM and the developers of many other companies are IMO also stupid. We would end up with you and me as the lead developers for the world. 😉 I am not sure whether this is a good idea...

So if you dont want 'light-' or 'rediculous-speed' but 'ludicrousspeed' (according to metric in the spaceballs movie ;-)), i advice you to sign a lot NDA and to go with the native ISA. The same procedure as everywhere...

 

0 Likes

Originally posted by: Sternenprinz

 

Have a look at the usual Language-Shootouts and you will see that e.g. Java is in general roughly two times slower than C/C++ (depending on specific code). Although the Java world got modern JIT-VMs that know the lastest tricks and optimization.



Using java as an example here is really bad. Java standard enforces strict rules about numerical computations ( mainly about bit-by-bit reproducibility ). In result it prohibits common code generation optimizations (http://math.nist.gov/javanumerics/).

 

So what do we expect from:

 

OpenCL -> CAL -> IL -> eg. R7xx ISA

This should be : OpenCL->IL->ISA.

CAL is more like a whole programing enviroment ( maybe I'm overstreching analogy here ) - not language itself.

 

C++ -> e.g. x86 ISA

Here again you change the compilation flow to better fit your milkmaid theory. The correct diagram should be more like that:

C++ -> ASM -> x86 ISA

But if you want more correctnes then we should be using C as an example ( not C++ ). It's much more close to OpenCL.

So lets make those comparisons once again. From one side we have

OpenCL ( reduced C ) -> IL -> ISA

on the other side we have

C -> ASM -> ISA or Brook -> IL -> ISA ( OpenCL compiler loses here badly too ).

I think those look almost the same. Of course we could argue that IL is a litte bit higher level then ASM. Probably true.

But for me the real problem is with IL code generated from OpenCL. I think it could be done much better ( for HD48xx ) with some simple optimizations. So what I'm expecting is the same kind of loss as compiling from C to ASM or at least the same kind of loss as from Brook to IL.

Generally I agree with you milkmaid theory. But I think you are overstretching it a litte bit here.

PS. One more thing. I don't know why people did focus on comment about firing people from my previous post. This was only a maybe. But what imho ATI should DO is to HIRE more programmers. Why didn't you focus on this ?

 

0 Likes

Actually the different between OpenCL GPU and former code in Brook+ which later compiled directly to IL is more than twice, it's about 6 times slower......

If Sternenprinz said

Java -> IL -> e.g. x86 ISA

is two time slower than

C++ -> e.g. x86 ISA

then its not true, it's worse for OpenCL compared to previous framework......

Brook+ code compiled not at runtime to generate IL, IL code then compiled at runtime to ISA......

As for OpenCL, CPU compatibility seems hurt GPU performance a lot since it needs to copy the data CPU<->GPU. CPU don't need to copy right? It just access it right away. But for GPU, when is it actually? When the kernel is called or when set arguments? Maybe I should set timer at different place......

The weirdest thing that if normal GPU code just needs 0.3 seconds, why the almost exact code in OpenCL need 6 seconds? That simple matmult vs opencl_matmult... and for 2048x2048

Don't compare to fastest code in CAL/IL where prunedtree achive... OpenCL left behind extremely too much

0 Likes

To be fair, the driver still is in beta.

It's in the best interest of all involved in the OpenCL workgroup to bring their OpenCL implementations up to the level of their native, proprietary programming models (CUDA, Brooks+,...). It's actually not very different from the situation of opengl in the late 90s, and remember how that one took off!

Or do you still write drawing code for the 3dfx glide API?

0 Likes

I'm not that old LOL

Let's see how much it improved over time

0 Likes

As I mentioned in a previous post, the comparison:

1) explicit float4 opencl impl above running on a 4870
gpu_time = 0.586 s

2) explicit float4 brook+ impl running on 4870
gpu_time = 0.226 s

Is not a clear message that the opencl implementation is less than half as fast as brook.  This brook+ kernel here is not a "scatter" out kernel, and thus should result in a CAL pixel shader.  On the otherhand, the opencl kernel float4* for the output matrix indicates it is a "scatter kernel" and should result in a CAL compute shader kernel if I am not mistaken.

So to compare apples with apples, I wrote a brook scatter version of matrix multiply (kernel attached) which is analagous to the opencl version.

Here are the numbers for it on a 4870 with same matrix size:

gpu_time = 0.82 s   or   84 GFLOPS

i.e. a good bit slower than the equivalent opencl kernel.

Apples for apples, it looks like opencl is faster than brook for raw gpu time and analagous kernels.

Both are performinhg way below peak.  Is this a direct result of going to compute shader?  Might the 5800 series do better here?

I did notice that opencl has slower push and pull times for the streams

Brook: CPU->GPU 0.06 s, GPU->CPU 0.07 s

OpenCL: CPU->GPU 0.08 s, GPU->CPU 0.14 s

It seems OpenCL has half the transfer speed GPU->CPU of brook.

 

 

kernel void optimized4gs_matmult(int loopVar0, float4 A[][], float4 B[][], out float4 C[][]) { // Setting zero float4 zero = float4(0.0f, 0.0f, 0.0f, 0.0f); // Declaring and initializing accumulators float4 accumulator1 = zero; float4 accumulator2 = zero; float4 accumulator3 = zero; float4 accumulator4 = zero; // Row number of output position int i = instance().y; // Column number of output position int j = instance().x; int k = 0; for(; k < loopVar0; ++k) { // Fetching values from A float4 A1 = A[i*4]; float4 A2 = A[i*4+1]; float4 A3 = A[i*4+2]; float4 A4 = A[i*4+3]; // Fetching values from B float4 B1 = B[k*4]; float4 B2 = B[k*4+1]; float4 B3 = B[k*4+2]; float4 B4 = B[k*4+3]; accumulator1 += A1.xxxx * B1.xyzw + A1.yyyy * B2.xyzw + A1.zzzz * B3.xyzw + A1.wwww * B4.xyzw; accumulator2 += A2.xxxx * B1.xyzw + A2.yyyy * B2.xyzw + A2.zzzz * B3.xyzw + A2.wwww * B4.xyzw; accumulator3 += A3.xxxx * B1.xyzw + A3.yyyy * B2.xyzw + A3.zzzz * B3.xyzw + A3.wwww * B4.xyzw; accumulator4 += A4.xxxx * B1.xyzw + A4.yyyy * B2.xyzw + A4.zzzz * B3.xyzw + A4.wwww * B4.xyzw; } C[i*4] = accumulator1; C[i*4+1] = accumulator2; C[i*4+2] = accumulator3; C[i*4+3] = accumulator4; }

0 Likes

In which clause does the transfer in OpenCL happen emuller?

In Brook+ its clear that transfer happens in streamRead, streamWrite, .read, and .write

I am happy to say I don't want to read OpenCL spec to learn...

0 Likes

Originally posted by: riza.guntur In which clause does the transfer in OpenCL happen emuller?

 

In Brook+ its clear that transfer happens in streamRead, streamWrite, .read, and .write I am happy to say I don't want to read OpenCL spec to learn...

 

Transfers can be done explicitly by using  : clEnqueue Read/Write Buffer commands.

Or you can use CL_MEM_USE_HOST_PTR / CL_MEM_ALLOC_HOST_PTR flags in clCreateBuffer command to use cpu memory as storage bits for the buffer and it will be automatically transfered through interconnect.

0 Likes

i have done quick test on friend 4870X2. i tested kernel bellow. i get execution time 10ms on 10 000 000 float. so it is 200GFLOPS.

maybe when i will use float4 i get better utilization of 5D and get four times better performance.

__kernel void templateKernel(__global float *a, __global float *b, __global float *c) { uint tid = get_global_id(0); float x,y; x = a[tid]; y = b[tid]; x=mad(4.0f,x,y);x=mad(4.0f,x,y);x=mad(4.0f,x,y);x=mad(4.0f,x,y); /* 200 times x=mad(4.0f,x,y); */ c[tid] = x; }

0 Likes

Originally posted by: riza.guntur In which clause does the transfer in OpenCL happen emuller?

In Brook+ its clear that transfer happens in streamRead, streamWrite, .read, and .write

I am happy to say I don't want to read OpenCL spec to learn...

@ riza.guntur

In the PyOpenCL code I posted, these calls are analagous to Stream creation:

d_a_buf = cl.Buffer(ctx, mf.READ_ONLY, size=h_a.nbytes)
d_b_buf = cl.Buffer(ctx, mf.READ_ONLY, size=h_b.nbytes)
d_c_buf = cl.Buffer(ctx, mf.WRITE_ONLY, size=h_c.nbytes)

and these are the stream.read equivalents:

write_events = [cl.enqueue_write_buffer(queue, d_b_buf, h_b)]
write_events+= [cl.enqueue_write_buffer(queue, d_a_buf, h_a)]

cl.wait_for_events(write_events)

 

What I like about OpenCL is the events ... they have some nice features compared to Brook+ async features.

 

0 Likes

Originally posted by: emuller
Originally posted by: riza.guntur In which clause does the transfer in OpenCL happen emuller?

 

In Brook+ its clear that transfer happens in streamRead, streamWrite, .read, and .write

 

I am happy to say I don't want to read OpenCL spec to learn...

 

 

@ riza.guntur

 

In the PyOpenCL code I posted, these calls are analagous to Stream creation:

 

d_a_buf = cl.Buffer(ctx, mf.READ_ONLY, size=h_a.nbytes) d_b_buf = cl.Buffer(ctx, mf.READ_ONLY, size=h_b.nbytes) d_c_buf = cl.Buffer(ctx, mf.WRITE_ONLY, size=h_c.nbytes)

 

and these are the stream.read equivalents:

 

write_events = [cl.enqueue_write_buffer(queue, d_b_buf, h_b)] write_events+= [cl.enqueue_write_buffer(queue, d_a_buf, h_a)] cl.wait_for_events(write_events)

 

 

 

What I like about OpenCL is the events ... they have some nice features compared to Brook+ async features.

 

 

 

emuller, wow that Phyton wrapper does shortening OpenCL code fairly good. Thanks, never thought it would be this short. It's good compared to C code which is too long and I don't really understand.

Now time to install Phyton and how to compile it.

0 Likes

i rewrite kernel to use float4 and tested it. here are result. i increase mad count to 200. on 10000000 float it run 4.97 ms. so 400 flop * 10e6 / 0.00497 = 804 GFLOPS. kernel was utilize only 4 unit. that 1 unit add 200GFLOPS so reach 1TFLOPS in OpenCL on 4870 is possible. disassemble kernel have 226 instructions slots. so there is still some overhead. 200 is MULADD_e which is 86% useful instructions. that corespond to to 83% of kernel performance.

second OpenCL find only one device. FindNumDevice from CAL samples find one device.

01 TEX: ADDR(1296) CNT(2) 22 RD_SCATTER R2, DWORD_PTR[0+R0.x], ELEM_SIZE(3) UNCACHED 23 RD_SCATTER R1, DWORD_PTR[0+R1.x], ELEM_SIZE(3) UNCACHED 02 ALU: ADDR(65) CNT(126) 24 x: MULADD_e R123.x, (0x40000000, 2.0f).x, R1.x, R2.x y: MULADD_e R123.y, (0x40A00000, 5.0f).y, R1.w, R2.w z: MULADD_e R123.z, (0x40800000, 4.0f).z, R1.z, R2.z w: MULADD_e R123.w, (0x40400000, 3.0f).w, R1.y, R2.y 25 x: MULADD_e R123.x, (0x40000000, 2.0f).x, PV24.x, R2.x y: MULADD_e R123.y, (0x40A00000, 5.0f).y, PV24.y, R2.w z: MULADD_e R123.z, (0x40800000, 4.0f).z, PV24.z, R2.z w: MULADD_e R123.w, (0x40400000, 3.0f).w, PV24.w, R2.y

0 Likes

@ nou

I had the problem that with the beta driver upgrade, OpenCL only finds 1 device.  I'm using the Linux driver ... for the beta OpenCL driver release I had to

export DISPLAY=:0

Then OpenCL finds all 4 devices (2x 4870x2)

if you then

export DISPLAY=:0.X

you get only 1 device again, that of card X.

Could be useful acctually.

 

 

 

 

0 Likes

Originally posted by: nou i have done quick test on friend 4870X2. i tested kernel bellow. i get execution time 10ms on 10 000 000 float. so it is 200GFLOPS.


Maybe there is some dependency problem ( you use result of previous operations for next one - variable x ) - but I'm not really sure about this. You could try to compile to IL/asm - we could see what compiler is doing with this code ( "how to" is available in other thread ).

 

0 Likes

Originally posted by: hazemanMaybe there is some dependency problem ( you use result of previous operations for next one - variable x ) - but I'm not really sure about this. You could try to compile to IL/asm - we could see what compiler is doing with this code ( "how to" is available in other thread ).

 

i tryed that OpenCL wrapper and it works. but i dont save that IL code. i go to my friend this weekend so i post that IL asm.

@jcpalmer: i think it will be two devices. i must examine it. it was very quick test because my friend doesnt have much time.

0 Likes

nou,

Quick question how many devices does a 4870X2 show up as?

On you test, I am always am a little leary about the performance #'s coming out of very small programs, especially if it is being used to compare hardware of different vendors.  It is less concerning for comparing to Brook+ on the exact same hardware.  I envision programs of this type used by GPU vendor marketing departments to validate the performance sticker they slap on the box

When it comes to scaling it up, yes I/O is probably your limiting factor.  Also how you solve this problem could be crutial, if you wish to deploy on multiple vendors GPU's.  To get the "best overall" performance with a single version of a program, A RGBA texture side steps most of the differences, while getting 16 bytes at a shot.  It would be better if one dimensional textures with a high dimension size were available.

Still, there would have to be runtime examination of the device vendor info, so that optimal worksize can be determined.  I am pretty sure a multiple of 32 is not going to be great for ATI.  Most of you may not care about other vendors, but that is OpenCL purpose in life.

 

0 Likes

Originally posted by: nou i have done quick test on friend 4870X2. i tested kernel bellow. i get execution time 10ms on 10 000 000 float. so it is 200GFLOPS.

 

maybe when i will use float4 i get better utilization of 5D and get four times better performance.



Here is the ISA generated on 8xx from your kernel :

 

03 TEX: ADDR(82) CNT(2) 25 VFETCH R2.x___, R0.w, fc154 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 26 VFETCH R0.x___, R1.y, fc154 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 04 ALU: ADDR(69) CNT(9) 27 x: MULADD_e R123.x, (0x40800000, 4.0f).x, R0.x, R2.x VEC_201 t: LSHR R0.x, R1.x, (0x00000002, 2.802596929e-45f).y 28 w: MULADD_e R123.w, (0x40800000, 4.0f).x, PV27.x, R2.x 29 z: MULADD_e R123.z, (0x40800000, 4.0f).x, PV28.w, R2.x 30 x: MULADD_e R2.x, (0x40800000, 4.0f).x, PV29.z, R2.x 05 MEM_RAT_CACHELESS_STORE_RAW: RAT(0)[R0].x___, R2, MARK VPM

0 Likes

did i read it right and it did not utilize whole 5D unit?

0 Likes

nou,

Mad is counted as two flops so if you are doing 200 times this is 400 flops. Your total is 400 Gflops. You have instructions dependency as it is clear from the ISA so every ALU clause have only one instruction instead of five. If you remove The instruction dependency you can do 5x400 = 2 tearflops. Looks like it is using both cards.

0 Likes

sorry a i made a mistake when i write that post. i use only 100 mad() in kernel. i write 200 because i know mad is two flop.

anyway. i try use a float4 as input. or i try not only x,y but x1,x2,x3,x4,x5 and y1-5 to remove that dependency.

0 Likes

Originally posted by: MicahVillmow OpenCL currenly only has one as the OpenCL programming model is pointer based, so all data has to be fully coherent(this is ignoring images which is read_only or write_only, not both). This does not allow the use of the texture unit in the same way that brook+/IL can use the texture unit. Brook+ does not allow you to alias pointers(unless you explicitly allow it) and IL you do so at your own risk. Writing to memory and reading from that same memory with the texture unit does not produce deterministic behavior. OpenCL requires that all writes and reads to global memory are coherent, so this approach is not feasible. This is a performance hit compared to a streaming model because the GPU is natively a streaming device. There is another performance hit for the R7XX since it was not designed with OpenCL in mind


Micah, I understand all dificulties about implementing OpenCL, especially those features that apparently was added just to make it's implementation harder... But it's not because OpenCL have a lot of funky features that developers will use all of them (and I hope all of us will take wise decisions at this point...), while it's usefull to have read/write buffers in some cases in (many) others they are useless, it's not that difficult for the compiler to check if a specific pointer is being used to read or write or if it's value is changed or assigned and then allow better optimizations (like accessing them throught TMUs in older hardware), also, there is the const and restrict keywords to give some hints.

 

0 Likes

Pointer aliasing resolution is actually very difficult in general.  This is why there are lots of odd optimization flags for that which behave slightly differently with many C compilers.  If you cast pointers, embed pointers in structs, things get fun.  Also, statically, the compiler doesn't know if 2 arguments are really the same pointer at runtime.  Since you can pass in many different buffers, the permutations can get quite large.

Perhaps a better way to describe Brook vs OpenCL is that streaming models like Brook make memory based optimizations and scheduling "almost trivial" compared to a general C language. 

Are there are of course likely things that can be done to improve performance as the various OpenCL implementations mature.  But, it also may take extended compiler options and attribute tags to do it, basically a contract between the programmer and compiler about how exactly things are used.

0 Likes