cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

stevenovakov
Adept I

clAmdFft - Multi-Device enqueueTransform Failure

I hope you guys can tell how obsessed I am with trying to get this to work due to the sheer volume of my recent posts on the topic

Anyways, essentially, I have a large matrix of dual polarizations, and I'm needing to fft each column individually. Here is is a summary of how this plays out in the handler, (shown below).

for columnIterator:

    for device:

          clAmdFftEnqueueTransform (  deviceCommandQueue , columnIterator->data() )

          deviceCommandQueue->flush()

          columnIterator++

    for device

          deviceCommandQueue->finish()

Every time, without failure, for any number of columns greater than 1 column, I get the following error in console, (here there are 2 columns):

Created CommQueue, Dev: 0

Created CommQueue, Dev: 1

Enqueueing Column : 0

Enqueueing Column : 1

OPENCL_V< CLFFT_INVALID_PROGRAM_EXECUTABLE > (1201): clEnqueueNDRangeKernel failed

OPENCL_V< CLFFT_INVALID_PROGRAM_EXECUTABLE > (1201): clEnqueueNDRangeKernel failed

FINAL Read 0 Complete

FINAL Read 1 Complete

However, upon closer inspection, it seems that, in fact, it is the second DEVICE which is consistently failing to enqueue the kernels:

Created CommQueue, Dev: 0

Created CommQueue, Dev: 1

Enqueueing Column : 0

Enqueueing Column : 1

OPENCL_V< CLFFT_INVALID_PROGRAM_EXECUTABLE > (1201): clEnqueueNDRangeKernel failed

OPENCL_V< CLFFT_INVALID_PROGRAM_EXECUTABLE > (1201): clEnqueueNDRangeKernel failed

FINAL Read 0 Complete

FINAL Read 1 Complete

Enqueueing Column : 2

Enqueueing Column : 3

OPENCL_V< CLFFT_INVALID_PROGRAM_EXECUTABLE > (1201): clEnqueueNDRangeKernel failed

OPENCL_V< CLFFT_INVALID_PROGRAM_EXECUTABLE > (1201): clEnqueueNDRangeKernel failed

FINAL Read 2 Complete

FINAL Read 3 Complete

Enqueueing Column : 4

Enqueueing Column : 5

OPENCL_V< CLFFT_INVALID_PROGRAM_EXECUTABLE > (1201): clEnqueueNDRangeKernel failed

OPENCL_V< CLFFT_INVALID_PROGRAM_EXECUTABLE > (1201): clEnqueueNDRangeKernel failed

FINAL Read 4 Complete

FINAL Read 5 Complete


...etc... for N devices. Notice that the 0, 2, 4, etc enqueues (Device 0), work perfectly well.     The handler is attached as a text file.

Any ideas? Thanks.

0 Likes
10 Replies
himanshu_gautam
Grandmaster

Asked relevant people to respond to this. Please be patient.

0 Likes
stevenovakov
Adept I

sorry, typo, wasn't actually being impatient

0 Likes

Hi,

Thanks for reporting this.

What are the 2 devices you have in your system that you are trying to compute on? Are they same kind or different? Although theoretically the library has been designed for running in a multi-device setup, this is something that has not been tested. Have you tried running on just 1 device at a time, meaning run the program using only device 0 and followed by running it with only device 1?

Would it be possible to attach a short reproducible test code instead of just parts of code? That would help in investigating.

0 Likes

Hi Bragadeesh,

Right now I have 2x NVIDIA Gtx 460 , (I know, it's not supported officially, and it still has this problem:  Re: OpenCL: clAmdFft (OpenCL FFT lib from AMD) on NVIDIA GPUs ), however, I don't need the code to produce the correct result, per se, at the moment, because I am expecting one or two HD7970's from a collaborator in the next week or so, and clAmdFft generally seems to work with AMD products quite well from what I've read. I just need this function, as part of a much larger application, to at least execute correctly with 1+ devices.

I will try the 1 device at a time right now and try to get you a single file of testable code in an hour or two. Do you mind writing your own makefile? Obviously I have no idea what your system is, paths etc.

0 Likes

Reporting that it seems to execute, (though still getting garbage data), on the first device w/o error.

I've attached a stand-alone version of my handler. It is fairly simple, please let me know if you need additional clarification.

W.r.t. using just the second device only, that does not seem to work. if you go ahead and force "nDevices = 1" at the top, and then change every "qit = deviceQueues.begin();" to " qit = deviceQueues.begin() + 1;" , (force the iterator to the second device command queue).

I get the following from executing in gdb:

[New Thread 0x7ffff587b700 (LWP 17580)]

[New Thread 0x7ffff4274700 (LWP 17581)]

[New Thread 0x7ffff2d6b700 (LWP 17582)]

[New Thread 0x7ffff256a700 (LWP 17583)]

[New Thread 0x7ffff1d69700 (LWP 17584)]

[New Thread 0x7ffff1568700 (LWP 17585)]

[New Thread 0x7ffff0d67700 (LWP 17586)]

Created CommQueue, Dev: 0

Enqueueing Column : 0

Program received signal SIGSEGV, Segmentation fault.

0x00007ffff798bfcc in clGetCommandQueueInfo () from /usr/lib/libOpenCL.so.1

(gdb) bt

#0  0x00007ffff798bfcc in clGetCommandQueueInfo () from /usr/lib/libOpenCL.so.1

#1  0x00007ffff7702bd8 in CompileKernels(_cl_command_queue*, unsigned long, clAmdFftGenerators, FFTPlan*) ()

  from /opt/clAmdFft-1.10.321/lib64/libclAmdFft.Runtime.so.1.10.321

#2  0x00007ffff7707db2 in clAmdFftBakePlan ()

  from /opt/clAmdFft-1.10.321/lib64/libclAmdFft.Runtime.so.1.10.321

#3  0x00007ffff76f432d in clAmdFftEnqueueTransform ()

  from /opt/clAmdFft-1.10.321/lib64/libclAmdFft.Runtime.so.1.10.321

#4  0x00007ffff7bc1b0f in HostClass::forwardFFTAMD (this=0x6021c0)

    at HostClass.cpp:2024

#5  0x00007ffff7bb0b02 in main (argc=2, argv=0x7fffffffdfa8) at main.cpp:164

#6  0x00007ffff734c76d in __libc_start_main ()

  from /lib/x86_64-linux-gnu/libc.so.6

#7  0x0000000000400579 in _start ()

And of course, HostClass.cpp:2024 in that program is:

                clAmdFftEnqueueTransform(  fftPlan,
                                            CLFFT_FORWARD,
                                            1,
                                            &((*qit)()),
                                            0,
                                            NULL,
                                            NULL,
                                            &(fftBufferX.back()()),
                                            NULL,
                                            NULL
                                        );


The rest is behind your magical proprietary black box

Thanks for the help, much appreciated. (see attachment below)

0 Likes

Thanks for the stand-alone repro code. There were some problems with the code, but I was able to resolve them and reproduce the issue.

It is a problem in the library. Unfortunately, it won't be a quick turnaround for the library fix. Please keep in mind that we don't claim support for multiple devices officially. And as I have mentioned earlier, this is something we have not tested. But it is clear from your use case and from other inputs that this is something that we have to support. We will make an effort to address this soon. I can provide a better time estimate for a library update after discussing this issue internally.

0 Likes

Alright, I'll be eagerly waiting.

Can I ask what the problems with the code were? It may affect my other application. Please let me know, and, again, thanks for the help.

0 Likes

I cannot make it work either following the instructions from the clAmdFFT manual:

  1. Currently, multi-device operation must be managed by the user. OpenCL contexts can be created that are associated with multiple devices, but clAmdFft only uses a single device from that context to transform the data. Multi-device operation can be managed by the user by creating multiple contexts, where each context contains a different device, and the user is responsible for scheduling and partitioning the work across multiple devices and contexts.

I get a slightly different error though:

OPENCL_V< CLFFT_INVALID_CONTEXT > (1201): clEnqueueNDRangeKernel failed

Single GPU operations work fine and the context is not invalid, points to a single GPU, has its own command queue, etc...

I have found a temporary solution to what is essentially a show stopper by running the program 4 times, using a single different GPU for each instantiation.

0 Likes

We are planning a library update in about a month's time. We should have some answers by then.

0 Likes

Thank you for the heads up. Good timing. Hopefully you will also grace the new Apple MacPro with AMD GPUs with a version of the library.

0 Likes