cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

afo
Adept I

SDK 2.3 and multiGPU

First impressions on multiGPU for the new SDK in HD5970

Hi,

I would like to share some insights about multigpu in SDK 2.3 (WinXP32 & Linux64), and yes, I know that AMD officially doesn't support the second GPU on the HD5970, I hope they will give support before xmas 2011...

WinXP32:

cal findnumdevices shows 2 devices; clinfo shows 2GPUs + 1 CPU (in inverse order, now there is GPU data first and then the CPU data)

My application has a parameter on which GPU work; 0 for the first one, 1 for the second and so on; I used to have 2xHD5970, I moved one to another machine. If I use GPU 0,everything goes fine, if I use GPU 1; the application crashes...

Linux64

Again, cal findnumdevices shows 2 devices; clinfo shows 2GPUs + 1 CPU

if I open 2 terminals, two instances of the application using different GPUs can work and give correct results. But system performance drops down: moving a terminal window goes in slow motion (this didn't happen with SDK 2.2), the four processors in my system are working at 80% and ati-config shows 30%usage for the first GPU and 70% usage for the secong GPU (with minor variations).

I would like to know if someone see something similar or has different experiences. Thanks a lot

best regards,

Alfonso

0 Likes
32 Replies
dravisher
Journeyman III

Pretty disappointing that the HD5970 isn't supported yet

Don't know if anyone from AMD can answer this, but will the upcoming dual-GPU card based on Cayman chips (presumably HD6990) also not be working correctly with SDK 2.3? Currently I want to wait for that, but if have to wait for an unspecified amount of time for dual GPU + OpenCL to work I'd rather just get a HD6970.

0 Likes

Dravisher,

There are no plans for launching a 6990. Where did you got this information?

 

Edit: Thanks nou for clarification. I was not aware of this at that time.

0 Likes

all over the internet. for example here http://news.softpedia.com/news/Dual-GPU-AMD-HD-6990-Pushed-Back-to-Q1-2011-165713.shtml

0 Likes
zeland
Journeyman III

Multi-gpu issue did not solve in sdk 2.3 in linux x64.

I run two instance of one program on two different 5870 card.

two instance  run in 550 seconds each in parallel . And one instance take only 320 seconds.

0 Likes

Originally posted by: himanshu.gautam Dravisher,

 

There are no plans for launching a 6990. Where did you got this information?



That the dual-GPU card will be called HD6990 seems to be the general consensus in the media, like the article nou linked to, but to be fair the only info from AMD is that a dual-GPU card named "Antilles" is coming in Q1 2011. For instance in the AMD Codename Decoder – November 9, 2010 blog:

“Antilles”
Market: Discrete GPUs
What is it? AMD Radeon™ HD 6000 Series graphics card for ultra-enthusiasts that will feature two GPUs on one board.
Planned for introduction: Q1 2011


 So I guess the proper question is: will Antilles have both GPUs working properly with OpenCL at release?

0 Likes

no. why do you think that new dual card will work when current multi GPU didn't work?

0 Likes

Originally posted by: nou no. why do you think that new dual card will work when current multi GPU didn't work?

 

Maybe the current dual-GPU card (5970) has a design error that precludes it from being properly supported as two GP-GPUs.

0 Likes

I have a 5970 and I'm using both GPUs using CAL under Linux where each GPU gets its own controlling *process* (using "multiprocessing" in python), i.e. I am not using threads.  The two GPUs do work concurrently (I do recall memory copies host<->device for the two GPUs seemed to be serialized though).  The results also seem to be OK.  I am currently using Catalyst 10.10 and SDK 2.2 with my 5970.

So is it possible its with OpenCL that the 5970 is not supported?  Or is it possible that you can get it to work using processes (MPI, ... or in python PyOpenCL and "multiprocessing") instead of threads?

BTW, on my other box using 10.12 and 2.2 SDK, 2D problems such as described by afo appear, such as improper scolling in windows, slow window moving ... which I could address by enabling desktop effects (rendering is done by OpenGL then I presume). So this is a 10.12 problem, and not a 2.3 problem I guess.  Either try 2.3 with 10.10 or enable desktop effects as a work around.

 

0 Likes

Originally posted by: nou no. why do you think that new dual card will work when current multi GPU didn't work?

 

Surely if a software fix is so difficult, they would change the design of the next dual-GPU card so that it will work. Hard to say since I know next to nothing about what the problem is, but I would think AMD could fix it in hardware if they wanted to.  For instance if the problem is the always-on crossfire, they could make it easier to toggle crossfire off on the upcoming card.

0 Likes

We have a test machine that has 3 5970 cards in it, and our experiences are that performance is somewhat unreliable in OpenCL. If CAL is able to handle devices well, I do not know why OpenCL has such a hard time. It is also my experience, that completely independant workloads smartly distributed among GPUs do not finish 2-3-4-5-6 times faster than the single GPU counterpart. Scaling of using COMPLETELY independant threads is abysmal. I use MPI to launch parallel threads for OpenCL multiGPU usage, but it is starting to anger me also how one year hasn't been enough to solve these sorts of issues.

I would accept this fact, if someone at AMD would take the time and appreciate the efforts of all those who struggle with testing and finding the bugs of the SDK and the runtime, by explaining what keeps them from solving these problems? Time, financial/programming resources, architecture or what??

I think I do not live in a dreamworld when I say one might even expect to see that dualGPU solutions such as 5970 should be seen as a single device in OpenCL. (no global sync available yet, merging the two GPUs on device level is only a matter of software support of syncing global memory objects and some events across the two GPUs, which could be fast enough (I believe) on the internal CF connection) However not even NV could solve this, and their GPGPU programming resources somewhat surpass AMDs. But to see that 5970 cannot even be used as two seperate devices?! I know it is not officially supported, but it would be nice to get some insight as to what are the problems that have been in place for a year now.

0 Likes

merge two GPU into one device is not possible. imagine as one workitem can read and write from whole memory. and each GPU has own separate memory. how do you want merge writed results from two GPUs RAM?

you must take some assumptions about how each work item write result of calculation. but this assumptions will be incompatible with OpenCL specification.

0 Likes

No. Multi-GPU on linux does not work at all. I tested all architectures: Single context Single Thread, Multiple context Single Thread and Multiple context Multiple Thread. None of them worked. The 2nd GPU of 5970 also does not work.

0 Likes

Originally posted by: quadboon No. Multi-GPU on linux does not work at all. I tested all architectures: Single context Single Thread, Multiple context Single Thread and Multiple context Multiple Thread. None of them worked. The 2nd GPU of 5970 also does not work.

 

That's not 100% true. We have a system with 2 HD5870 on a X58, running 10.12 & SDK 2.3. While the OpenCL implementation isn't fully reentrant it seems (I understand from the specifications it should be), I was able to get speed-up on a multithreaded application. I used OpenMP critical sections liberally on functions such as clCreateBuffer() or clCreateCommandQueue(), and it eventually worked (obviously, kernel launch wasn't in a critical section). The speed-up was x2, as this test code is embarassingly parallel, and spend 99.99% of its time in the kernel (medium test case, the kernel takes 2.5 *hours* ... on both GPU in parallel :-).

You may want to try again with environment variable "GPU_USE_SYNC_OBJECTS" set to 1. This alternative codepath is much better in my exprience.

 

0 Likes

Thanks for the info Melkhior, I might just try that one also. BTW is there any collection of implementation specific env variables, such as GPU_USE_SYNC_OBJECTS or GPU_MAX_HEAP_SIZE and all of those? I would experiment with other things also if possible.

0 Likes

i can confirm multi gpu on linux works with nearly 100% speed when setting GPU_USE_SYNC_OBJECTS to 1

0 Likes

Originally posted by: Melkhior

 

You may want to try again with environment variable "GPU_USE_SYNC_OBJECTS" set to 1. This alternative codepath is much better in my exprience.



strings libatiocl64.so | grep GPU_
-D__GPU__=1
DEBUG_GPU_FLAGS
GPU_MAX_COMMAND_QUEUES
GPU_COMPILER_BACKEND_OPTIONS
GPU_MEMORY_COHERENCY
GPU_INTEROP_EMULATION
GPU_COMPILER_OPTIONS
GPU_MAX_WORKGROUP_SIZE
GPU_DOUBLE_PRECISION
GPU_DEVICE_ORDINAL
GPU_REPORT_EXTENSIONS
GPU_INITIAL_HEAP_SIZE
GPU_MAX_HEAP_SIZE
GPU_HEAP_GROWTH_INCREMENT
GPU_STAGING_BUFFER_SIZE
GPU_DUMP_DEVICE_KERNEL
GPU_BINARY_DUMP_FLAGS
GPU_BUFFER_ALIGNMENT
GPU_BLIT_ENGINE_TYPE
GPU_FLUSH_ON_EXECUTION
GPU_USE_SYNC_OBJECTS
GPU_USE_NEWLIB
GPU_ZERO_COPY_ENABLE
GPU_OPEN_VIDEO
OCL_GPU_NOINLINE
GPU_GLOBAL_RETURN_BUFFER
GPU_PRE_RA_SCHED
GPU_NEW_ALLOC_SCHEME
GPU_BARRIER_DETECTION
GPU_PINNED_XFER_SIZE
GPU_DEBUG_MODE
GPU_ARENA_SEGMENT_SUPPORT
GPU_DISABLE_RAW_UAV

0 Likes

Hi,

This is an update using 2xHD5970:

- Adding "GPU_USE_SYNC_OBJECTS" does some magic and 2 instances run at about 70/80% each one (numbers vary); but if I launch 3 instances, the 3rd runs at 40%; when I launched 4 instances, the system crashed violently (maybe the PSU was not enought for the 4 GPUs). This is independant of what combination of GPUs it uses.

- I see no time difference when I run an instance alone and with other instance in parallel, that is coherent with the GPU usage informed by aticonfig.

so, I believe that in SDK 2.3 we will not see a drop of performance if we use GPU_USE_SYNC_OBJECTS and we limit the multiGPU to 1xHD5970 using both GPUs (and the other good practices that we discovered so far...)

best regards,

Alfonso

0 Likes

Thanks nou, I have found too that the env vars are inside the atiocl binary, I just don't know if there are any without the "GPU" string inside.

I will certanly play around with them, although I'm not quite sure off start what some could mean.

GPU_DEBUG_MODE for eg. or GPU_MEMORY_COHERENCY. It would really be nice, if there was some way for the GPU to report segmentation faults. But I guess that will be left for CPU testing. Also GPU_BARRIER_DETECTION sounds very useful, no idea what happens if set to zero. Or GPU_DOUBLE_PRECISION set to 1 on a Juniper would cause the world to blow up in what way...

Anyhow, thanks for the list and I'll try to play around a little.

0 Likes

Is GPU_USE_SYNC_OBJECTS supposed to work on Win 7? Or this is only for linux?

0 Likes

Why would any assumptions have to be made? OpenCL defines a way how memory objects are handled, if one is modified, when will it's changes be visible to other contexts or work-items.

Inside a context, where resource is considered to be a single GPU, what sort of memory relaxation is present? mem_fence is the only point at which memory consistency has to be made. (Now that DMA is usable, commandqueue syncing is needed by the developer to ensure that data is ready to be read from __global before kernels are launched)

If you would have two devices considered as one, it is only a matter of solving mem_fences to span over all the Compute Units, not just inside one GPU. This is surely solvable, although I cannot tell how self-defeating it would be. If it takes too long, than there is no benefit of having twice as much compute power that seems like a single device. But if mem_fence calls across the internal CF would only take 2-3 times longer, one might consider using CF over devices.

I have asked in another topic, what are the "Global synchronisation registers" used for, that are visible on the architectural diagrams of 5xxx and 6xxx cards, but noone knew the answer. I suspect memory read/write syncing is done there, and by merging GPUs one would have to declare those registers read/writeable for both GPUs, but this is only a guess. It would be nice if someone with relevant knowledge told me, why is this such a rough idea.

The only thing coming to my mind which nou might be refering to, is that it cannot be foretold at compile time, whether memory objects a kernel recieve reside in the VRAM of that very GPU or they (partially or entirely) reside in the other GPUs VRAM. This way kernel read/write commands cannot be compiled properly. One solution to this problem could be, that merging GPUs via CF does not duplicate available memory, rather it is "mirrored".

I know it is not simple mirror, because different work-items modify different parts of memory, and thus syncing these memory objects might become cumbersome to manage, but even this issue and compile time definitions of read/write commands could be solved, if the merging of memory could be done somewhat similar to a striped RAID for e.g.

I do realise this last statement is quite wild and would impose significant overhead to memory commands, but smart drivers could do great magic. (Most likely this last idea is defeated by the fact that onboard memory controllers do not have the wiring or compute capacity to calculate RAID-like functions, even though a simple stripe does not require much compute power)

Any thoughts?

0 Likes

AFAIK GPU has separated RAM. so any buffers which is needed on both GPU must be copyied on both GPU. so make this simple kernel which i attached below.

index is read only buffer. out is write only. index contain numbers from 0 to N in random order. now you launch this kernel with global work size N.

so now you have dual GPU OpenCL device. so for example first half of workitems is launched on first GPU and second half on second GPU. after finish you must MERGE two distinct buffers from both GPU. and here is a problem. what was writed from first GPU and what from second.

main problem is separated memory.

 

__krenel void k(__global int *index, __global int *out) { int id = get_global_id(0); out[index[id]] = id; }

0 Likes

You've got a point. This is what I had in mind when I said it would become cumbersome to manage such memory objects, keeping track of all modifications would take awfully too long.

However striping data over the two device memory locations? Wouldn't that be able to work? I'm sure that over CF connection there are serious optimizations of reaching data and other resources.

0 Likes

Unless documented, I would not rely on the behavior of any environment variables as they are not guaranteed to exist in a future release.
0 Likes

Of course... but I have 5970 to develop opencl apps for almost an year, and I'm desperate to use the hardware. 

0 Likes

I agree with you Micah, I do not like to rely on undocumented solutions also. If I would phrase it harshly, I would say we do not have any other choice.

It is somewhat unclear to me why AMD seems to put zero effort into making dual GPU solutions such as 5970 work properly. It is like as if AMD deliberately would sabotage it's own ability to enter GPGPU market. The different strategies NV and AMD employ (monolithic vs. smaller die) predestines AMD to create dual GPU solutions to be able to keep up with NV Tesla card performance.

I have asked this earlier, and I will ask it again: what is the hardest part in creating proper software support for these cards that prevents AMD from creating proper drivers and libraries for over a year now? Do not take this personal, we would just like to know whether we should invest any more time in trying to create decent multi GPU programs (which is clearly the future of HPC), or should we go over to the dark side of the force?

So to reflect on Micah's point, it is not a neat solution indeed, but I myself am furious about this SYNC_OBJECTS variable, because I have had a paper messed up becuase of this. I developed a program in a group, and everything worked fine last summer. This winter however we wanted to finish our project, I have added tonns of functionality to the program, and scaling broke. I have tried to debug the program, but could not do it. Turns out, it was the doing of SDK 2.2, which introduced this variable, and I developed the first part of my program with 2.1. I completely forgot about the fact that I changed SDKs since, but even if I didn't, not even in my wildest nightmares I would've imagined the new SDK would do such a thing. So it is true that using undocumented variables is unreliable, but these variables prevent us from a) working b) trying to get AMD where it wants to get.

I am overly polite, when I say that a little more documentation and support would be appreciated.

0 Likes

Micah,

thank you for your answer. I am sorry if I seemed to be on assault, but I think my frustration is understandable. This issue rendered weeks of my time meaningless.

We are all programmers on this forum, so we all know the pain it causes to document things, however we are individuals and AMD is a company, so expectations differ somewhat. Documenting even preview features would be wise, because SDKs are released roughly every 6 months. If nobody explained these these unannounced variables, I might just trash my cards for being useless in 6 months time. (There are fellow shader programming friends of mine, who keep telling me they would've ceased all this constant fight with the API and use CUDA instead. Not cross-vendor, but stable. I do research, they do dead-line work. They need rock stability, I have the liberty of trying new things (to some extent).)

My point is, hiding changes of this magintude in the SDK from programmers is highly risky and can cause a lot frustration. Taking that 30 minutes to document preview or experimental variables might take away the element of surprise when introducing new features in an upcoming SDK, but can lower the blood pressure of many people around the globe.

Somewhat in connection to what Rick has said: I understand no insider info can be released as to why dual GPU cards don't work, but some very general (and null-info) would be nice too. Memory consistency, kernel handling, synchronization, clock issues or whatever... These maybe do not fall under the industrial secret category. If I would have to guess, it would be some hard-wired optimization that is installed to increase efficiency while gaming (in CrossFire) but is hard to dissolve with software for OpenCL, even when CrossFire is disabled.

0 Likes

Meanwhile, let me link another topic under General Discussions which I believe would hold greater community intrest than what the number of replies show.

http://forums.amd.com/devforum/messageview.cfm?catid=203&threadid=143068&enterthread=y

General Discussions seem to be less frequented, but I believe OpenCL community is intrested in actual implementation of dense GPU servers, and might even have some constructive criticism.

0 Likes

So, what is the latest status on supporting dual-GPU cards in SDK 2.3?

Will the SDK 2.3 work with the xfx hd-597x-enfn, which is a dual-GPU 2x1600 cores?

Are things stable?

Is it possible to use 4 such dual-GPU cards in a server and access all 8 GPUs?

How about use 8 of them and access all 16 GPUs?

0 Likes

Originally posted by: mike3 So, what is the latest status on supporting dual-GPU cards in SDK 2.3?

 

Will the SDK 2.3 work with the xfx hd-597x-enfn, which is a dual-GPU 2x1600 cores?

 

Are things stable?

There are known issues in multi-GPU. we have been improving multi-gpu support.

 

Is it possible to use 4 such dual-GPU cards in a server and access all 8 GPUs?

 

How about use 8 of them and access all 16 GPUs?

 

You should be able to access all gpus. May not be able to use efficiently.

0 Likes

maximum is currently 8 GPU. with more Xserver segfault. but there is currently some issue with 5970 and OpenCL. under CAL they run correctly.

http://blog.zorinaq.com/?e=46

0 Likes

Meteorhead,
We are working hard at getting the dual GPU solutions working with our SDK. We can not give out specifics on what has been holding back full support, but we have been making steady progress and should allow developers to fully utilize the cards in the near future.

I have also put in a request to update the documentation for the environment variables. I can not promise anything as many of them are testing or experimental features, and if the feature gets past that stage, then the feature gets documented if required and the environment variable usually gets removed.
0 Likes

If I had to guess, I would say supporting the 5970 is hard because both devices are fused in Crossfire mode. I'm not totally sure on the specifics of how SLI/Crossfire work, but I imagine that the problem comes with memory allocation and kernel launching.

In order to render different frames in parallel (or different parts of the screen), both GPUs must have an entire copy of the scene. This would mean that allocations and data copies would be broadcast to both GPUs by the backend. Also, I think both GPUs run the same shader at the same time. I'm not sure how much of this is handled by the hardware and how much is handled by the driver, but given that Micah has said it can be done leads me to believe it's more done at the driver level.

OpenCL is still in its infancy and there are still kinks and shortcomings in both NVIDIA and AMD's implementations. Time is the only thing that's going to make them better.

0 Likes