32 Replies Latest reply on Feb 15, 2011 6:19 AM by nou

    SDK 2.3 and multiGPU

    afo
      First impressions on multiGPU for the new SDK in HD5970

      Hi,

      I would like to share some insights about multigpu in SDK 2.3 (WinXP32 & Linux64), and yes, I know that AMD officially doesn't support the second GPU on the HD5970, I hope they will give support before xmas 2011...

      WinXP32:

      cal findnumdevices shows 2 devices; clinfo shows 2GPUs + 1 CPU (in inverse order, now there is GPU data first and then the CPU data)

      My application has a parameter on which GPU work; 0 for the first one, 1 for the second and so on; I used to have 2xHD5970, I moved one to another machine. If I use GPU 0,everything goes fine, if I use GPU 1; the application crashes...

      Linux64

      Again, cal findnumdevices shows 2 devices; clinfo shows 2GPUs + 1 CPU

      if I open 2 terminals, two instances of the application using different GPUs can work and give correct results. But system performance drops down: moving a terminal window goes in slow motion (this didn't happen with SDK 2.2), the four processors in my system are working at 80% and ati-config shows 30%usage for the first GPU and 70% usage for the secong GPU (with minor variations).

      I would like to know if someone see something similar or has different experiences. Thanks a lot

      best regards,

      Alfonso

        • SDK 2.3 and multiGPU
          dravisher

          Pretty disappointing that the HD5970 isn't supported yet

          Don't know if anyone from AMD can answer this, but will the upcoming dual-GPU card based on Cayman chips (presumably HD6990) also not be working correctly with SDK 2.3? Currently I want to wait for that, but if have to wait for an unspecified amount of time for dual GPU + OpenCL to work I'd rather just get a HD6970.

            • SDK 2.3 and multiGPU
              himanshu.gautam

              Dravisher,

              There are no plans for launching a 6990. Where did you got this information?

               

              Edit: Thanks nou for clarification. I was not aware of this at that time.

                  • SDK 2.3 and multiGPU
                    zeland

                    Multi-gpu issue did not solve in sdk 2.3 in linux x64.

                    I run two instance of one program on two different 5870 card.

                    two instance  run in 550 seconds each in parallel . And one instance take only 320 seconds.

                  • SDK 2.3 and multiGPU
                    dravisher

                     

                    Originally posted by: himanshu.gautam Dravisher,

                     

                    There are no plans for launching a 6990. Where did you got this information?



                    That the dual-GPU card will be called HD6990 seems to be the general consensus in the media, like the article nou linked to, but to be fair the only info from AMD is that a dual-GPU card named "Antilles" is coming in Q1 2011. For instance in the AMD Codename Decoder – November 9, 2010 blog:

                    “Antilles”
                    Market: Discrete GPUs
                    What is it? AMD Radeon™ HD 6000 Series graphics card for ultra-enthusiasts that will feature two GPUs on one board.
                    Planned for introduction: Q1 2011

                     


                     So I guess the proper question is: will Antilles have both GPUs working properly with OpenCL at release?

                      • SDK 2.3 and multiGPU
                        nou

                        no. why do you think that new dual card will work when current multi GPU didn't work?

                          • SDK 2.3 and multiGPU
                            d.a.a.

                             

                            Originally posted by: nou no. why do you think that new dual card will work when current multi GPU didn't work?

                             

                            Maybe the current dual-GPU card (5970) has a design error that precludes it from being properly supported as two GP-GPUs.

                              • SDK 2.3 and multiGPU
                                emuller

                                I have a 5970 and I'm using both GPUs using CAL under Linux where each GPU gets its own controlling *process* (using "multiprocessing" in python), i.e. I am not using threads.  The two GPUs do work concurrently (I do recall memory copies host<->device for the two GPUs seemed to be serialized though).  The results also seem to be OK.  I am currently using Catalyst 10.10 and SDK 2.2 with my 5970.

                                So is it possible its with OpenCL that the 5970 is not supported?  Or is it possible that you can get it to work using processes (MPI, ... or in python PyOpenCL and "multiprocessing") instead of threads?

                                BTW, on my other box using 10.12 and 2.2 SDK, 2D problems such as described by afo appear, such as improper scolling in windows, slow window moving ... which I could address by enabling desktop effects (rendering is done by OpenGL then I presume). So this is a 10.12 problem, and not a 2.3 problem I guess.  Either try 2.3 with 10.10 or enable desktop effects as a work around.

                                 

                              • SDK 2.3 and multiGPU
                                dravisher

                                 

                                Originally posted by: nou no. why do you think that new dual card will work when current multi GPU didn't work?

                                 

                                Surely if a software fix is so difficult, they would change the design of the next dual-GPU card so that it will work. Hard to say since I know next to nothing about what the problem is, but I would think AMD could fix it in hardware if they wanted to.  For instance if the problem is the always-on crossfire, they could make it easier to toggle crossfire off on the upcoming card.

                                  • SDK 2.3 and multiGPU
                                    Meteorhead

                                    We have a test machine that has 3 5970 cards in it, and our experiences are that performance is somewhat unreliable in OpenCL. If CAL is able to handle devices well, I do not know why OpenCL has such a hard time. It is also my experience, that completely independant workloads smartly distributed among GPUs do not finish 2-3-4-5-6 times faster than the single GPU counterpart. Scaling of using COMPLETELY independant threads is abysmal. I use MPI to launch parallel threads for OpenCL multiGPU usage, but it is starting to anger me also how one year hasn't been enough to solve these sorts of issues.

                                    I would accept this fact, if someone at AMD would take the time and appreciate the efforts of all those who struggle with testing and finding the bugs of the SDK and the runtime, by explaining what keeps them from solving these problems? Time, financial/programming resources, architecture or what??

                                    I think I do not live in a dreamworld when I say one might even expect to see that dualGPU solutions such as 5970 should be seen as a single device in OpenCL. (no global sync available yet, merging the two GPUs on device level is only a matter of software support of syncing global memory objects and some events across the two GPUs, which could be fast enough (I believe) on the internal CF connection) However not even NV could solve this, and their GPGPU programming resources somewhat surpass AMDs. But to see that 5970 cannot even be used as two seperate devices?! I know it is not officially supported, but it would be nice to get some insight as to what are the problems that have been in place for a year now.

                                      • SDK 2.3 and multiGPU
                                        nou

                                        merge two GPU into one device is not possible. imagine as one workitem can read and write from whole memory. and each GPU has own separate memory. how do you want merge writed results from two GPUs RAM?

                                        you must take some assumptions about how each work item write result of calculation. but this assumptions will be incompatible with OpenCL specification.

                                          • SDK 2.3 and multiGPU
                                            quadboon

                                            No. Multi-GPU on linux does not work at all. I tested all architectures: Single context Single Thread, Multiple context Single Thread and Multiple context Multiple Thread. None of them worked. The 2nd GPU of 5970 also does not work.

                                              • SDK 2.3 and multiGPU
                                                Melkhior

                                                 

                                                Originally posted by: quadboon No. Multi-GPU on linux does not work at all. I tested all architectures: Single context Single Thread, Multiple context Single Thread and Multiple context Multiple Thread. None of them worked. The 2nd GPU of 5970 also does not work.

                                                 

                                                That's not 100% true. We have a system with 2 HD5870 on a X58, running 10.12 & SDK 2.3. While the OpenCL implementation isn't fully reentrant it seems (I understand from the specifications it should be), I was able to get speed-up on a multithreaded application. I used OpenMP critical sections liberally on functions such as clCreateBuffer() or clCreateCommandQueue(), and it eventually worked (obviously, kernel launch wasn't in a critical section). The speed-up was x2, as this test code is embarassingly parallel, and spend 99.99% of its time in the kernel (medium test case, the kernel takes 2.5 *hours* ... on both GPU in parallel :-).

                                                You may want to try again with environment variable "GPU_USE_SYNC_OBJECTS" set to 1. This alternative codepath is much better in my exprience.

                                                 

                                                  • SDK 2.3 and multiGPU
                                                    Meteorhead

                                                    Thanks for the info Melkhior, I might just try that one also. BTW is there any collection of implementation specific env variables, such as GPU_USE_SYNC_OBJECTS or GPU_MAX_HEAP_SIZE and all of those? I would experiment with other things also if possible.

                                                    • SDK 2.3 and multiGPU
                                                      nou

                                                       

                                                      Originally posted by: Melkhior

                                                       

                                                      You may want to try again with environment variable "GPU_USE_SYNC_OBJECTS" set to 1. This alternative codepath is much better in my exprience.



                                                      strings libatiocl64.so | grep GPU_
                                                      -D__GPU__=1
                                                      DEBUG_GPU_FLAGS
                                                      GPU_MAX_COMMAND_QUEUES
                                                      GPU_COMPILER_BACKEND_OPTIONS
                                                      GPU_MEMORY_COHERENCY
                                                      GPU_INTEROP_EMULATION
                                                      GPU_COMPILER_OPTIONS
                                                      GPU_MAX_WORKGROUP_SIZE
                                                      GPU_DOUBLE_PRECISION
                                                      GPU_DEVICE_ORDINAL
                                                      GPU_REPORT_EXTENSIONS
                                                      GPU_INITIAL_HEAP_SIZE
                                                      GPU_MAX_HEAP_SIZE
                                                      GPU_HEAP_GROWTH_INCREMENT
                                                      GPU_STAGING_BUFFER_SIZE
                                                      GPU_DUMP_DEVICE_KERNEL
                                                      GPU_BINARY_DUMP_FLAGS
                                                      GPU_BUFFER_ALIGNMENT
                                                      GPU_BLIT_ENGINE_TYPE
                                                      GPU_FLUSH_ON_EXECUTION
                                                      GPU_USE_SYNC_OBJECTS
                                                      GPU_USE_NEWLIB
                                                      GPU_ZERO_COPY_ENABLE
                                                      GPU_OPEN_VIDEO
                                                      OCL_GPU_NOINLINE
                                                      GPU_GLOBAL_RETURN_BUFFER
                                                      GPU_PRE_RA_SCHED
                                                      GPU_NEW_ALLOC_SCHEME
                                                      GPU_BARRIER_DETECTION
                                                      GPU_PINNED_XFER_SIZE
                                                      GPU_DEBUG_MODE
                                                      GPU_ARENA_SEGMENT_SUPPORT
                                                      GPU_DISABLE_RAW_UAV

                                                        • SDK 2.3 and multiGPU (2xHD5970)
                                                          afo

                                                          Hi,

                                                          This is an update using 2xHD5970:

                                                          - Adding "GPU_USE_SYNC_OBJECTS" does some magic and 2 instances run at about 70/80% each one (numbers vary); but if I launch 3 instances, the 3rd runs at 40%; when I launched 4 instances, the system crashed violently (maybe the PSU was not enought for the 4 GPUs). This is independant of what combination of GPUs it uses.

                                                          - I see no time difference when I run an instance alone and with other instance in parallel, that is coherent with the GPU usage informed by aticonfig.

                                                          so, I believe that in SDK 2.3 we will not see a drop of performance if we use GPU_USE_SYNC_OBJECTS and we limit the multiGPU to 1xHD5970 using both GPUs (and the other good practices that we discovered so far...)

                                                          best regards,

                                                          Alfonso

                                                          • SDK 2.3 and multiGPU
                                                            Meteorhead

                                                            Thanks nou, I have found too that the env vars are inside the atiocl binary, I just don't know if there are any without the "GPU" string inside.

                                                            I will certanly play around with them, although I'm not quite sure off start what some could mean.

                                                            GPU_DEBUG_MODE for eg. or GPU_MEMORY_COHERENCY. It would really be nice, if there was some way for the GPU to report segmentation faults. But I guess that will be left for CPU testing. Also GPU_BARRIER_DETECTION sounds very useful, no idea what happens if set to zero. Or GPU_DOUBLE_PRECISION set to 1 on a Juniper would cause the world to blow up in what way...

                                                            Anyhow, thanks for the list and I'll try to play around a little.

                                                      • SDK 2.3 and multiGPU
                                                        Meteorhead

                                                        Why would any assumptions have to be made? OpenCL defines a way how memory objects are handled, if one is modified, when will it's changes be visible to other contexts or work-items.

                                                        Inside a context, where resource is considered to be a single GPU, what sort of memory relaxation is present? mem_fence is the only point at which memory consistency has to be made. (Now that DMA is usable, commandqueue syncing is needed by the developer to ensure that data is ready to be read from __global before kernels are launched)

                                                        If you would have two devices considered as one, it is only a matter of solving mem_fences to span over all the Compute Units, not just inside one GPU. This is surely solvable, although I cannot tell how self-defeating it would be. If it takes too long, than there is no benefit of having twice as much compute power that seems like a single device. But if mem_fence calls across the internal CF would only take 2-3 times longer, one might consider using CF over devices.

                                                        I have asked in another topic, what are the "Global synchronisation registers" used for, that are visible on the architectural diagrams of 5xxx and 6xxx cards, but noone knew the answer. I suspect memory read/write syncing is done there, and by merging GPUs one would have to declare those registers read/writeable for both GPUs, but this is only a guess. It would be nice if someone with relevant knowledge told me, why is this such a rough idea.

                                                        The only thing coming to my mind which nou might be refering to, is that it cannot be foretold at compile time, whether memory objects a kernel recieve reside in the VRAM of that very GPU or they (partially or entirely) reside in the other GPUs VRAM. This way kernel read/write commands cannot be compiled properly. One solution to this problem could be, that merging GPUs via CF does not duplicate available memory, rather it is "mirrored".

                                                        I know it is not simple mirror, because different work-items modify different parts of memory, and thus syncing these memory objects might become cumbersome to manage, but even this issue and compile time definitions of read/write commands could be solved, if the merging of memory could be done somewhat similar to a striped RAID for e.g.

                                                        I do realise this last statement is quite wild and would impose significant overhead to memory commands, but smart drivers could do great magic. (Most likely this last idea is defeated by the fact that onboard memory controllers do not have the wiring or compute capacity to calculate RAID-like functions, even though a simple stripe does not require much compute power)

                                                        Any thoughts?

                                                          • SDK 2.3 and multiGPU
                                                            nou

                                                            AFAIK GPU has separated RAM. so any buffers which is needed on both GPU must be copyied on both GPU. so make this simple kernel which i attached below.

                                                            index is read only buffer. out is write only. index contain numbers from 0 to N in random order. now you launch this kernel with global work size N.

                                                            so now you have dual GPU OpenCL device. so for example first half of workitems is launched on first GPU and second half on second GPU. after finish you must MERGE two distinct buffers from both GPU. and here is a problem. what was writed from first GPU and what from second.

                                                            main problem is separated memory.

                                                             

                                                            __krenel void k(__global int *index, __global int *out) { int id = get_global_id(0); out[index[id]] = id; }

                                                              • SDK 2.3 and multiGPU
                                                                Meteorhead

                                                                You've got a point. This is what I had in mind when I said it would become cumbersome to manage such memory objects, keeping track of all modifications would take awfully too long.

                                                                However striping data over the two device memory locations? Wouldn't that be able to work? I'm sure that over CF connection there are serious optimizations of reaching data and other resources.

                                              • SDK 2.3 and multiGPU
                                                MicahVillmow
                                                Unless documented, I would not rely on the behavior of any environment variables as they are not guaranteed to exist in a future release.
                                                  • SDK 2.3 and multiGPU
                                                    ivalylo

                                                    Of course... but I have 5970 to develop opencl apps for almost an year, and I'm desperate to use the hardware. 

                                                    • SDK 2.3 and multiGPU
                                                      Meteorhead

                                                      I agree with you Micah, I do not like to rely on undocumented solutions also. If I would phrase it harshly, I would say we do not have any other choice.

                                                      It is somewhat unclear to me why AMD seems to put zero effort into making dual GPU solutions such as 5970 work properly. It is like as if AMD deliberately would sabotage it's own ability to enter GPGPU market. The different strategies NV and AMD employ (monolithic vs. smaller die) predestines AMD to create dual GPU solutions to be able to keep up with NV Tesla card performance.

                                                      I have asked this earlier, and I will ask it again: what is the hardest part in creating proper software support for these cards that prevents AMD from creating proper drivers and libraries for over a year now? Do not take this personal, we would just like to know whether we should invest any more time in trying to create decent multi GPU programs (which is clearly the future of HPC), or should we go over to the dark side of the force?

                                                      So to reflect on Micah's point, it is not a neat solution indeed, but I myself am furious about this SYNC_OBJECTS variable, because I have had a paper messed up becuase of this. I developed a program in a group, and everything worked fine last summer. This winter however we wanted to finish our project, I have added tonns of functionality to the program, and scaling broke. I have tried to debug the program, but could not do it. Turns out, it was the doing of SDK 2.2, which introduced this variable, and I developed the first part of my program with 2.1. I completely forgot about the fact that I changed SDKs since, but even if I didn't, not even in my wildest nightmares I would've imagined the new SDK would do such a thing. So it is true that using undocumented variables is unreliable, but these variables prevent us from a) working b) trying to get AMD where it wants to get.

                                                      I am overly polite, when I say that a little more documentation and support would be appreciated.

                                                        • SDK 2.3 and multiGPU
                                                          Meteorhead

                                                          Micah,

                                                          thank you for your answer. I am sorry if I seemed to be on assault, but I think my frustration is understandable. This issue rendered weeks of my time meaningless.

                                                          We are all programmers on this forum, so we all know the pain it causes to document things, however we are individuals and AMD is a company, so expectations differ somewhat. Documenting even preview features would be wise, because SDKs are released roughly every 6 months. If nobody explained these these unannounced variables, I might just trash my cards for being useless in 6 months time. (There are fellow shader programming friends of mine, who keep telling me they would've ceased all this constant fight with the API and use CUDA instead. Not cross-vendor, but stable. I do research, they do dead-line work. They need rock stability, I have the liberty of trying new things (to some extent).)

                                                          My point is, hiding changes of this magintude in the SDK from programmers is highly risky and can cause a lot frustration. Taking that 30 minutes to document preview or experimental variables might take away the element of surprise when introducing new features in an upcoming SDK, but can lower the blood pressure of many people around the globe.

                                                          Somewhat in connection to what Rick has said: I understand no insider info can be released as to why dual GPU cards don't work, but some very general (and null-info) would be nice too. Memory consistency, kernel handling, synchronization, clock issues or whatever... These maybe do not fall under the industrial secret category. If I would have to guess, it would be some hard-wired optimization that is installed to increase efficiency while gaming (in CrossFire) but is hard to dissolve with software for OpenCL, even when CrossFire is disabled.

                                                      • SDK 2.3 and multiGPU
                                                        MicahVillmow
                                                        Meteorhead,
                                                        We are working hard at getting the dual GPU solutions working with our SDK. We can not give out specifics on what has been holding back full support, but we have been making steady progress and should allow developers to fully utilize the cards in the near future.

                                                        I have also put in a request to update the documentation for the environment variables. I can not promise anything as many of them are testing or experimental features, and if the feature gets past that stage, then the feature gets documented if required and the environment variable usually gets removed.
                                                          • SDK 2.3 and multiGPU
                                                            rick.weber

                                                            If I had to guess, I would say supporting the 5970 is hard because both devices are fused in Crossfire mode. I'm not totally sure on the specifics of how SLI/Crossfire work, but I imagine that the problem comes with memory allocation and kernel launching.

                                                            In order to render different frames in parallel (or different parts of the screen), both GPUs must have an entire copy of the scene. This would mean that allocations and data copies would be broadcast to both GPUs by the backend. Also, I think both GPUs run the same shader at the same time. I'm not sure how much of this is handled by the hardware and how much is handled by the driver, but given that Micah has said it can be done leads me to believe it's more done at the driver level.

                                                            OpenCL is still in its infancy and there are still kinks and shortcomings in both NVIDIA and AMD's implementations. Time is the only thing that's going to make them better.