cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jpsollie
Adept II

Opencl ReleaseCommandQueue hangs

Hello everyone,

I'm experimenting with OpenCL driven event programming.

I have a situation where the system hangs indefinitely while performing the cleanup of the whole system.

The target is a TeraScale 5 device, no GCN, I am using fglrx 15.10, and am creating a context for CPU familiy, and one for gpu familiy.

the function cleanup() in my software is supposed to:

-release all buffer objects,

-cleanup the devices,

-release the context

... for all contexts in the system.

this is my gdb output:

enqueueing AA for execution on 0  //the CPU

enqueueing AA for execution on 1 // the VLIW device

enqueueing AA for execution on 2 // the GCN device

received completion of kernel 2

result : �������  (this is an openCL bug in the openCL kernel, I'm working on it)

-4

-1

-1

-1

-1

-1

-1

Breakpoint 1, cleanup () at ./bf_deepsearch2.c:244

244             while(active_devices != 0) sleep(10);

(gdb) step

received completion of kernel 1

received completion of kernel 0

245             for(i = 0; i < num_of_platforms; i++) {

(gdb) step

246                     clReleaseMemObject(input[0]);

(gdb) step

247                     clReleaseMemObject(input[1]);

(gdb) step

248                     clReleaseMemObject(input[2]);

(gdb) step

249                     clReleaseMemObject(output);

(gdb) step

250                     clReleaseProgram(program);

(gdb) step

251                     clReleaseKernel(clkernel);

(gdb) step

252                     for(j = 0; j < num_of_devices; j++) {

(gdb) step

253                             clFlush(command_queue);

(gdb) step

254                             clReleaseCommandQueue(command_queue);

(gdb) step

[Thread 0x7fffeae40700 (LWP 18023) exited]

[Thread 0x7fffebe55700 (LWP 18022) exited]

[Thread 0x7fffd9fd5700 (LWP 18027) exited]

[Thread 0x7fffc6fea700 (LWP 18029) exited]

[Thread 0x7fffb7fff700 (LWP 18031) exited]

[Thread 0x7fffe9e2b700 (LWP 18024) exited]

[Thread 0x7fffa6fea700 (LWP 18035) exited]

[Thread 0x7fff57fff700 (LWP 18049) exited]

[Thread 0x7fff56fea700 (LWP 18050) exited]

[Thread 0x7fff47fff700 (LWP 18052) exited]

[Thread 0x7fffb5fd5700 (LWP 18033) exited]

[Thread 0x7fffb6fea700 (LWP 18032) exited]

252                     for(j = 0; j < num_of_devices; j++) {

(gdb) step

[Thread 0x7fff55fd5700 (LWP 18051) exited]

[Thread 0x7fffa7fff700 (LWP 18034) exited]

[Thread 0x7fffc5fd5700 (LWP 18030) exited]

[Thread 0x7fff46fea700 (LWP 18053) exited]

[Thread 0x7fff65fd5700 (LWP 18048) exited]

[Thread 0x7fff67fff700 (LWP 18046) exited]

[Thread 0x7fff66fea700 (LWP 18047) exited]

[Thread 0x7fff87fff700 (LWP 18040) exited]

[Thread 0x7fff75fd5700 (LWP 18045) exited]

[Thread 0x7fff76fea700 (LWP 18044) exited]

[Thread 0x7fff85fd5700 (LWP 18042) exited]

[Thread 0x7fffc7fff700 (LWP 18028) exited]

[Thread 0x7fff86fea700 (LWP 18041) exited]

[Thread 0x7fffa5fd5700 (LWP 18036) exited]

[Thread 0x7fffdafea700 (LWP 18026) exited]

[Thread 0x7fffebe96700 (LWP 18021) exited]

[Thread 0x7fffdbfff700 (LWP 18025) exited]

256                     clReleaseContext(context);

(gdb) step

[Thread 0x7fff77fff700 (LWP 18043) exited]

[Thread 0x7fff97fff700 (LWP 18037) exited]

[Thread 0x7fff95fd5700 (LWP 18039) exited]

[Thread 0x7fff96fea700 (LWP 18038) exited]

257                     free(command_queue);

(gdb) step

258                     free(device_id);

(gdb) step

245             for(i = 0; i < num_of_platforms; i++) {

(gdb) step

246                     clReleaseMemObject(input[0]);

(gdb) step

247                     clReleaseMemObject(input[1]);

(gdb) step

248                     clReleaseMemObject(input[2]);

(gdb) step

249                     clReleaseMemObject(output);

(gdb) step

250                     clReleaseProgram(program);

(gdb) step

251                     clReleaseKernel(clkernel);

(gdb) step

252                     for(j = 0; j < num_of_devices; j++) {

(gdb) print num_of_devices[1]

$1 = 2

(gdb) step

253                             clFlush(command_queue);

(gdb) step

254                             clReleaseCommandQueue(command_queue);

(gdb) step

any ideas? why does releaseCommandQueue hang indefinitely?

thanks

(and sorry for the long output, but as a newbie, I'm not supposed to make replies, apparently)

0 Likes
18 Replies
jpsollie
Adept II

small update:

- the VLIW device (turks PRO) terminates the clReleaseCommandQueue after +- 2h  of 100% cpu usage on 1 core.  as it is a 32 core opteron machine, it is not really a problem for the hardware, but i still get stuck with a zombie process.  If I jump over the releaseCommand instruction for turks and invoke the releasecommand fo the GCN device, it finishes almost immediately, but when I quit the debugging session, the behaviour of 100% cpu usage for a long time stays.

- the quit operation is not a problem when I quit before starting the kernel on the VLIW device.  Do I need to release the kernel invokation before freeing the command queue? and how do I do this?

*edit: additional info:

because the 15.12 drivers (the ones from AMD) are no longer VLIW compatible, I installed the libs from the APPSDK-3.0, which do detect the VLIW device. not sure if this is the problem though

0 Likes

Hi,

You've been white-listed now.

As per the OpenCL spec,  clReleaseCommandQueue calls an implicit clFlush() and waits for all the previously enqueued commands to be finished before releasing the command queue. So, it may hang in case there is any pending command waiting for events (   clReleaseCommandQueue at program exit never returns ). I hope that is not your case. If you can share a repro, we could take a look.

Btw, I think HD 5000/6000 series cards are supported by the Crimson 15.12 (Desktop)  (please check the "supported products" tab).

Regards,

0 Likes

what would you like to see? my opencl code or the host code? both of them are +- 300 lines, so you may have some work walking through the code

also, I may need to comment everything before posting, as it is still an experiment, I did not bother writing everything according to C code rules...

and yes, you are right, the crimon driver supports the turks series for OpenGL and regular work, but unfortunately, not for openCL(I took it from here after googling on "no openCL with crimson").  As there are no monitors connected to any of the cards, I may have to go back to the 15.9 series of drivers, but I have to migrate the kernel then, and I guess this is not the wanted stuff

0 Likes

A test-case that manifests the issue. It should be host code plus any kernel code as necessary to reproduce the hang. Instead of posting inline, you can use the "Use advanced editor" option to attach the code as zip file.

I think you are right about the Crimson 15.12 for legacy devices. I just forgot that thing.

Regards,

0 Likes

all right, I do not feel confident about posting my whole kernel code, so I made a test case with the openBSD SHA1 libraries, which gives me exactly the same result (thank you openBSD people!), but it is not functional: it will not find your SHA1 back, it is modified not to work.

I commented the host code as much as possible, and added a clinfo.txt

@admins: if you think this is unsuitable, please remove it, I'll find another test case 😉

*edit: program args to be invoked:

1: length of host-sized string iterator

2: the password hash to look for, needs to be 32 chars of length in hex format

3: the salt

this version does not accept spaces for salt

0 Likes

Thanks for sharing the test-case. We'll check and get back to you shortly.

BTW, could you please provide an example about the valid input arguments.

Regards,

0 Likes

sure, i always invoked it as:

gdb --args ./a.out 2 e772ab34e42a30a2d8eeb410d0fd466ad42a1678fbe5729590bbcc009c6c8227 examplesalt

0 Likes

Thanks.

0 Likes

Hi,

It seems that the program logic prevents the control flow to reach the cleanup() function and the main thread runs around the sleep() function forever. Please check the code once.

After taking a quick glance at the program logic, it looks like there is a lot of dependency on event and event call back function.  I've a doubt though regarding the event call back functions. Are they really thread-safe? [ I can see modification of some global variables in that functions]. They needs to be thread-safe as per the OpenCL spec.  Please do ensure that.

Btw, did you try to reproduce the issue for a simpler scenario without any event dependency? I guess that would ensure that the issue is not related to this particular program logic.

Regards,

Hi Dipak,

Thanks for your analysis, I'll take a look into that.

Yes, you are right, the code changes a lot of volatile and non-volatile variables after each event:

-mapping the output region (non-volatile): as this is a separate pointer per OpenCL device, it is barely a problem.

-writing the generated string in volatile pwbuf[] to a separate memory item, I might improve this by : generate from (const char * input to char* input[i++].  I'll take a look into that.

ugh, what do you mean, sleep forever? I have no problem when I set a breakpoint at cleanup ...

it may, however, take some time, and you have to wait till the execution of ALL running kernels is finished + 30 sec. otherwise, it will not work.

*edit:

If you comment out the pkcs5_pbkdf2(password, ds + 6, salt, wpal, key, 32); sentence in the cl kernel code, (.cl file), the system iterates through the whole program string and finishes correctly.  not sure whether it is the case with you too, but it makes the case more interesting, doesn't it?

Also, what compiler are you using? and compile options/debugger?  I am using GCC 5.4.0 with glibc 22 with gdb 7.10.  the compile args are:

gcc -Wall -g2 ./wpacrack3.c /usr/lib64/libOpenCL.so

0 Likes

Actually to run it faster, I was running a empty dummy kernel. That might be the reason.

I was trying to run it on a m/c having Devastator card (TeraScale3) on Windows7 just to ensure that everything is working fine. I tried few times but it never stopped even after few minutes. As per the logic, the program checks data_size and active_devices before calling the cleanup() function. I'm not sure whether they reflect the correct value or not. Just for experiment, I tried following command, and the program was running around the sleep function forever.

./a.out 1 e772ab34e42a30a2d8eeb410d0fd466ad42a1678fbe5729590bbcc009c6c8227 examplesalt

In the call back functions run() and turnOffStatus(), the active_devices variable has been increased and decreased respectively. Is the modification thread-safe? I've doubt that's why I was checking with you. Also the terminating condition that depends on data_size.

Regards,

0 Likes

I also observed a high amount of memory usage (sometimes almost 2GB) when I ran the program. Did you observe too?

0 Likes

about your mem usage: I'll check this, I run on 128GB ram and 'top' never reported more than 0.2% ram usage.  seems like I need a windows system :s

0 Likes

ah, then I think I know what's happening:

if it found a match, the openCL kernel writes something into the output buffer.  If the prepareNextKernel function on the host sees that result[0] and result[1] are !=0, it will set the data size to 0 instead of 2 (input arg), so the password iterator stops (pw_failure), and all callback functions see data_size == 0 and return without 'installing' a next step...

if, however, it does not find a match, it will generate next strings forever, until the password iterater generates a password failure.  In that case, you have a problem: data_size is != 0, and the password itertor refuses to generate more data strings, so all callback functions will stop.  as long as the data_size is not set to 0 (this is, a result has been found), main() will wait.

Are you sure your dummy kernel writes something to the output buffer?

0 Likes

Running the original kernel forced the graphics driver to restart . At that state, the host thread was waiting for data_size to become zero.

Is there any way to run the program with a simple kernel? I don't think running a kernel is so important to test the clReleaseCommandQueue issue.

Regards,

0 Likes

sure there is, just take my openCL code and comment out the following line in the following function:

kernel void mainkernel(__constant uchar* pwbuf, uchar ds,  __constant uchar* wpahash, ushort wpahlength, __constant uchar* wpaname, ushort wpal, __global uchar* results)

now, comment out the line pkcs5_pbkdf2(password, ds + 6, salt, wpal, key, 32); at line 346 and the program will run just fine: it will write the input string to the output buffer as long as the global id is (0,0,0). But it will also remove my problem

an intermediate solution may be here: comment out the hmac_sha1 on line 305, it still has the problem, but will run much faster

0 Likes

Hi Dipak,

Already found something? I just bought myself a Radeon R9 nano, and will phase out the Terascale device, but i'd still like to know what's happening.

If you want, I could give you SSH access to the specific linux server, so you can see by yourself. Is this an option for you?

0 Likes

Not really. I couldn't test it on that Windows setup due to TDR problem which forced the graphics driver to restart each time I ran the program. Currently, I'm trying to manage a Linux setup having TeraScale card. As you mentioned, using a Hawaii card (GCN), I also didn't observe any issue even on Linux.

You may provide the SSH access. Actually, I may need to add few debugging statements to investigate the issue. It would be helpful if I can do so.

Regards,

0 Likes