cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ebfe
Journeyman III

Second GPU on HD5970 produces incorrect results

How threadsafe is ATIs OpenCL-implementation?

Hi,

I'm the author of Pyrit and have yet another possible bug in ATIs OpenCL-implementation to ask about. The upstream-bug is tracked at http://code.google.com/p/pyrit/issues/detail?id=123

A user reports that his setup crashes with a HD5970. The cause of the error is a self-test in my code that checks if the GPU is actually computing correct results; this catches the second GPU on the HD5970 producing bogus results (see comment #3).

What puzzles me is that there is a *ton* of error checking in my OpenCL code, which checks every single API-call. No error is reported, all functions return CL_SUCCESS...

Does someone have an idea what might cause this problem?

One possible explanation for this might be that Pyrit uses one host-thread for every GPU, possibly calling the OpenCL-library from different threads at the same time. This *should* not be a problem as there are no global variables (all data is local to the calling thread) and the OpenCL-library is *supposed* to be thread-safe.

You can find the whole source-code at http://code.google.com/p/pyrit/source/browse/#svn/trunk/cpyrit_opencl

0 Likes
12 Replies
ebfe
Journeyman III

Update: The problem also occurs when only the second GPU is used. This means that also only one host thread operates on the OpenCL-library so it's probably not a locking-issue.

0 Likes
ebfe
Journeyman III

More information about this:

* I've dumped the .il and the .isa file that get generated for both GPUs on the HD5970. They are identical.
* It takes a very long time on the second GPU for clFinish() to return (seval seconds). It finally returns CL_SUCCESS (the call to clFinish() takes roughly a few milliseconds on the first GPU).
* CL_EVENT_COMMAND_EXECUTION_STATUS for clEnqueueNDRangeKernel() is always CL_COMPLETE
* The kernel-output on the second GPU is always wrong, but always the same every time the kernel is called. It behaves regardless of the input (!).

0 Likes

CAL itself isn't multi-thread safe as it stated at documentation, so I won't be surprised if there numerous problems with OpenCL as well.

Also it's usually better to write direct mails to streamdeveloper@amd.com rather than use these forums to report problems/bugs.

 

0 Likes

to clarify: the problem turned out to persist if the second gpu is called alone and from the the main thread only.

0 Likes

try export DISPLAY=0.1 it should use only second GPU.

run some CAL samples if it get incorrect result too. it is possible that it can be HW issue

0 Likes

ebfe,
We are looking into reported issues of the second device not working correctly. I'll talk to the developer working on this to see if he has seen this behavior. Also if you have a small test case that can reproduce the problem, it would help us in tracking it down.
0 Likes

Simple testcase:

  * Download http://pyrit.googlecode.com/files/cpyrit-opencl-0.3.0.tar.gz
  * Compile and install the python module via setup.py
  * Run the following command from a console:

python -c "from cpyrit import _cpyrit_opencl; print [(i, _cpyrit_opencl.OpenCLDevice(0,i).solve('bla',('blablabla',)*3)) for i in xrange(_cpyrit_opencl.OpenCLPlatform(0).numDevices)]"

Example output:


[(0,
('\xb8\xb7\xf5E^\xfe\xb1\x8b\xf5\x98\xce\x95z$%\xd4\x05\xdb4\xb6E\xeb\xf4\xfd\x92\xf0\xa18\x18\xae\xaa\x1c',
'\xb8\xb7\xf5E^\xfe\xb1\x8b\xf5\x98\xce\x95z$%\xd4\x05\xdb4\xb6E\xeb\xf4\xfd\x92\xf0\xa18\x18\xae\xaa\x1c',
'\xb8\xb7\xf5E^\xfe\xb1\x8b\xf5\x98\xce\x95z$%\xd4\x05\xdb4\xb6E\xeb\xf4\xfd\x92\xf0\xa18\x18\xae\xaa\x1c')),
(1,
('\xb8\xb7\xf5E^\xfe\xb1\x8b\xf5\x98\xce\x95z$%\xd4\x05\xdb4\xb6E\xeb\xf4\xfd\x92\xf0\xa18\x18\xae\xaa\x1c',
'\xb8\xb7\xf5E^\xfe\xb1\x8b\xf5\x98\xce\x95z$%\xd4\x05\xdb4\xb6E\xeb\xf4\xfd\x92\xf0\xa18\x18\xae\xaa\x1c',
'\xb8\xb7\xf5E^\xfe\xb1\x8b\xf5\x98\xce\x95z$%\xd4\x05\xdb4\xb6E\xeb\xf4\xfd\x92\xf0\xa18\x18\xae\xaa\x1c')),
(2,
('\xc1\xed\xb1dJ\x93\xcb5sg0b\xfd\xc2\x8eO5\xef\xea*\x9c\xff\xc7I\xfc\x16\x14\x0e\xe3\x8f\xa3\xa8',
'\x89XU"\xfa\xfb\xc7\x96\xda\x8c\xc5\xdd\xf4\xe5\xefG\xa4\xb5(B\xc2\x7f\xf6\xbc\xb7\xe0\xfdg\r\x99v<',
'(h|L\xce\xceq\xd46#\xa2{\xcaq\xf7\xd7\xd7\xa0IT\xd1\r\xe4\x00\x1b(DCT\x18\xc1Q'))]


The results for device 2 (the second GPU) is not equal to the results on device 0 (CPU) and device 1 (first GPU)
0 Likes

ebfe,
Thanks for reporting this. We already are investigating this issue so it should be fixed in a future release.
0 Likes

I can report that the issue has not been fixed in 10.3-final. The second GPU still produces random, invalid results without the API indicating any error condition.

I know that ATI is used to ignore their user-base. It would be *so* sweet if you'd at least inform the developers like us about known problems, upcoming changes or the lack thereof. This is really frustrating.

0 Likes

ebfe,
In most cases, there is a minimum two-three month delay to when a problem is reported to when the solution makes it into a public catalyst release. Phoronix did a article about the driver development cycle which can be found here, http://www.phoronix.com/scan.p...rticle&item=735&num=4. As I mentioned in my previous post, it will be fixed in a future release.
0 Likes

Does this issue still persist in the current ATI Stream SDK/Catalyst driver?

0 Likes
Raistmer
Adept II

AFAIK, yes, it still exists in Cat 10.7 + SDK2.2
At least my program still suffers on scondary 5970 core. It makes no progress on that core though it executes quite OK on first one.
Also, this bug listed in SDK2.2 release notes.
0 Likes