cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

galmok
Journeyman III

SDK 2.4/CCC 11.4 BSOD/Freeze

We have a problem with the 11.4 driver (SDK2.4/CCC 11.3 is ok) in that it cause a specific kernel of ours. The problem is that it causes the driver to either freeze the computer/screen or causing a BSOD in ATIKMPAG.SYS.

The kernel in question is attached and is a transpose kernel that takes from src and puts into dst.

The local workgroupsize is 32,8,0 (2 dimensional) and the global size is 256,256.

The first time we ran this kernel, it ran just fine and produced the correct result. The next invocation of the same exe file causes windows 7 to lock up (or it bluescreens).

double extension is enabled using:

#ifdef cl_amd_fp64
#pragma OPENCL EXTENSION cl_amd_fp64: enable
#endif

Original mem pointers are src and dst and from those, clCreateSubBuffer is used to make an offset into the array (the offset for this test is 0): // create pointer to subbuffer subSRC = createSubBuffer(src, CL_MEM_READ_ONLY); subDST = createSubBuffer(dst, CL_MEM_READ_WRITE); // set kernel arguments ccSetKernelArgs(kernel, 4, "cl_mem", &subDST, "cl_uint", dst.lda, "cl_mem", &subSRC, "cl_uint", src.lda);

0 Likes
30 Replies
galmok
Journeyman III

As I feared, clicking the Attach Code button twice erased the first content which was the kernel. It is supplied here.

 

#ifdef cl_amd_fp64 #pragma OPENCL EXTENSION cl_amd_fp64: enable #endif #define BLOCK_SIZE 8 kernel void transpose( global double *dst, int ldd, global double *src, int lds ) { int2 blockIdx = { get_group_id(0), get_group_id(1) }; int2 threadIdx = { get_local_id(0), get_local_id(1) }; src += blockIdx.x*32 + threadIdx.x + ( blockIdx.y*32 + threadIdx.y ) * lds; dst += blockIdx.y*32 + threadIdx.x + ( blockIdx.x*32 + threadIdx.y ) * ldd; local double a[32][33]; // // load 32x32 block // for( int i = 0; i < 32; i += BLOCK_SIZE ) a[i+threadIdx.y][threadIdx.x] = src[i*lds]; barrier(CLK_LOCAL_MEM_FENCE); // // store transposed block // for( int i = 0; i < 32; i += BLOCK_SIZE ) dst[i*ldd] = a[threadIdx.x][i+threadIdx.y]; }

0 Likes

My previously reported inplace transpose works correctly with 11.4 now which is however nice. To bad the already working dst=transposed(src) isn't working anymore. 😕

0 Likes

galmok,

Have you been able to figure out a silution to your problem. If not please explain the problem a little more. Mention you system configuration and post complete kernel code and host code.

0 Likes

The system is a Windows 7 system using a 5870 (but also has a Redwood card that isn't being used for OpenCL). The binary is compiled for 32 bit.

I cannot post the complete kernel file nor the complete host code is it isn't for the general public. The whole kernel that crashes the pc has been posted, though. I'll have to create a standard host code to run the kernel, but basically it is just allocating 2 gpu buffers of size 256*256*sizeof(double) and uploads that amount of data to src and then tries to download the same amount from dst. Profiling is enabled.

The crash could be a correlation with the general size of the .cl file (23KB, about 13 kernels total) but all other kernels work fine. The malfunctioning kernel is however the first one in the file.

0 Likes

So is your code producing wrong results or is it crashing. In any case you can get important information about the issue by using GDB for debugging.

 

If it might be something related to kernel file size( I don't guess 23kb is very large), you can divide your kernel into different files.

0 Likes

The kernel/program either produces the correct result, but if not, Windows crashes hard (freeze/BSOD). The application itself either works or never returns (again, as Windows crashes). The application itself doesn't crash and therefore a debugger isn't worth much. 😕

0 Likes

I found my old thread about this issue and resurrected it. It has the code attached, ready to run and crash your Windows 7.

0 Likes

Well first thing would be that BSOD is not an supported operating system, so its better if you can test on some supported platform( Supported Platforms:

http://developer.amd.com/gpu/AMDAPPSDK/pages/DriverCompatibility.aspx) if you expect some official help. Sorry i slipped that before, but many users have been able to run it on unsupported platforms. If the issue is reproducible there i would suggest you to post a test case either at forums or file a ticket.

 

0 Likes

Are you serious? You do not know what BSOD means?

BSOD = Blue Screen Of Death.

The normal way for Windows (any Windows) to signal that the operating system has experienced a serious error and has halted in response to that.

The code is run on Windows 7 64-bit.

0 Likes

sorry for that. That was really embarrassing

Please provide a test case.

0 Likes

As i wrote a few messages above, the resurrected thread:

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=148654&enterthread=y

has the code that will crash Windows 7. Previously, this code just caused the wrong result, but now Windows crashes hard.

0 Likes

Me too I have freeze problem with the new SDK 2.4/CCC 11.4 combo in an program of mine (see showcase code attached) while the same was not happening with SDK 2.3/CCC 11.3 (although I can't confirm if it is related to the CCC 11.4 or the SDK 2.4) (win 7 pro x64).

The system will have screen frozen (audio playback will confirm that system is not frozen, however), but only if both of the following conditions are fulfilled at the same time:

1- there's 128 or more OpenGL buffers acquired by the OpenCL context;

2- the kernel have at least one of these buffers as argument (although no buffer is really used by the kernel).

It will freeze when executing the code line number 329:

if ((status = clWaitForEvents(1, &events)) != 0) return 1;

just after queuing kernel execution.

 

lao

 

 

#include <string> #include "GL/glew.h" #include "GL/glut.h" #include <CL/cl.h> #include <Cl/cl_gl.h> #include <windows.h> void initGLUT(int argc, char *argv[]); int InitializeComponents(void); int SetCLPlataform(void); int SetCLContext(void); int SetCLDevices(void); int SetCLCommandQueue(void); int SetCLProgram(void); int SetCLKernel(void); int generateGLBuffers(void); int CreateCLAndAcquireGLBuffers(void); int RunTestKernel(void); cl_context context; cl_context_properties * cprops; cl_platform_id platform; cl_device_id device_test; cl_command_queue commandQueue; cl_program program_test; cl_kernel kernel_test; int NbOfBuffers = 127; GLuint GL_IndicesBuffer, * GL_PositionBuffers; cl_mem CL_IndicesBuffer, * CL_PositionBuffers; #define KERNEL_HAVE_ARG_IN 1 int main(int argc, char * argv[]) { cl_uint status; initGLUT(argc, argv); if ((status = InitializeComponents()) != 0) return 1; if ((status = generateGLBuffers()) != 0) return 2; if ((status = CreateCLAndAcquireGLBuffers()) != 0) return 3; if ((status = RunTestKernel()) != 0) return 4; printf("All went fine!\n"); return 0; } void initGLUT(int argc, char *argv[]) { glutInit(&argc, argv); glutInitDisplayMode(GLUT_DOUBLE | GLUT_RGB | GLUT_DEPTH); glutInitWindowSize(800, 600); glutCreateWindow("Test"); glewInit(); } int InitializeComponents(void) { cl_int status = 0; if (SetCLPlataform() != 0) return 1; if (SetCLContext() != 0) return 2; if (SetCLDevices() != 0) return 3; if (SetCLCommandQueue() != 0) return 4; if ((status = SetCLProgram()) != 0) return 5; if ((status = SetCLKernel()) != 0) return 6; return 0; } int SetCLPlataform(void) { cl_int status; cl_uint numPlatforms; status = clGetPlatformIDs(0, NULL, &numPlatforms); if(status != CL_SUCCESS) return 1; cl_platform_id* platforms; if(numPlatforms > 0) { platforms = new cl_platform_id[numPlatforms]; status = clGetPlatformIDs(numPlatforms, platforms, NULL); if(status != CL_SUCCESS) return 2; for(unsigned int i=0; i < numPlatforms; ++i) { char pbuff[100]; status = clGetPlatformInfo( platforms, CL_PLATFORM_VENDOR, sizeof(pbuff), pbuff, NULL); if(status != CL_SUCCESS) return 3; platform = platforms; if(!strcmp(pbuff, "Advanced Micro Devices, Inc.")) break; } } else return 4; return 0; } // Create openCL context from the openGL one int SetCLContext(void) { cl_int status; cl_context_properties cpsGL[] = { CL_CONTEXT_PLATFORM, (cl_context_properties) platform, CL_WGL_HDC_KHR, (intptr_t) wglGetCurrentDC(), CL_GL_CONTEXT_KHR, (intptr_t) wglGetCurrentContext(), 0}; cprops = (NULL == platform) ? NULL : cpsGL; if (cprops == NULL) return 1; // Create context for GPU type device context = clCreateContextFromType(cprops, CL_DEVICE_TYPE_GPU, NULL, NULL, &status); if(status != CL_SUCCESS) return 2; return 0; } // Identify and set opencl devices, if any int SetCLDevices(void) { cl_int status; size_t deviceListSize; ///////////////////////////////////////////////////////////////// // First, get the size of device list data ///////////////////////////////////////////////////////////////// status = clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &deviceListSize); if(status != CL_SUCCESS) return 1; if(deviceListSize == 0) return 2; ///////////////////////////////////////////////////////////////// // Now, get the device list data ///////////////////////////////////////////////////////////////// cl_device_id* devices = (cl_device_id *)malloc(deviceListSize); status = clGetContextInfo(context, CL_CONTEXT_DEVICES, deviceListSize, devices, NULL); if(status != CL_SUCCESS) return 3; ///////////////////////////////////////////////////////////////// // Identify GPU device, if any. ///////////////////////////////////////////////////////////////// device_test = devices[0]; free(devices); return 0; } // Create the command-queue data structure to coordinate execution of the kernels on the device int SetCLCommandQueue(void) { cl_int status; //Create command-queue and enable commands profiling commandQueue = clCreateCommandQueue(context, device_test, CL_QUEUE_PROFILING_ENABLE /*NULL*/, &status); if(status != CL_SUCCESS) return 1; return 0; } // Create and build program int SetCLProgram(void) { cl_int status; ///////////////////////////////////////////////////////////////// // Load the cl kernel string ///////////////////////////////////////////////////////////////// #if KERNEL_HAVE_ARG_IN std::string test_kernel = "\n\ __kernel void test( __global uint* idx) \n\ { \n\ return; \n\ }\0"; #else std::string test_kernel = "\n\ __kernel void test() \n\ { \n\ return; \n\ }\0"; #endif ///////////////////////////////////////////////////////////////// // Build ///////////////////////////////////////////////////////////////// const char * source = test_kernel.c_str(); program_test = clCreateProgramWithSource(context, 1, &source, NULL, &status); if(status != CL_SUCCESS) return 1; status = clBuildProgram(program_test, 1, &device_test, NULL, NULL, NULL); if(status != CL_SUCCESS) return 2; return 0; } // Create kernel int SetCLKernel(void) { cl_int status; kernel_test = clCreateKernel(program_test, "test", &status); if(status != CL_SUCCESS) return 1; return 0; } int generateGLBuffers(void) { int status; // Triangle index buffer glGenBuffers(1, &GL_IndicesBuffer); status = glGetError(); if (status != 0) return 1; glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, GL_IndicesBuffer); status = glGetError(); if (status != 0) return 2; glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(GLuint) * 3, NULL, GL_DYNAMIC_DRAW); status = glGetError(); if (status != 0) return 3; GL_PositionBuffers = new GLuint[NbOfBuffers - 1]; for (int k = 0; k < (NbOfBuffers - 1); k++) { // Vertex position buffers glGenBuffers(1, &GL_PositionBuffers); status = glGetError(); if (status != 0) return 4; glBindBuffer(GL_ARRAY_BUFFER, GL_PositionBuffers); status = glGetError(); if (status != 0) return 5; glBufferData(GL_ARRAY_BUFFER, sizeof(float) * 3 * 3, NULL, GL_DYNAMIC_DRAW); status = glGetError(); if (status != 0) return 6; } glFinish(); status = glGetError(); if (status != 0) return 7; return 0; } int CreateCLAndAcquireGLBuffers(void) { cl_int status; CL_PositionBuffers = new cl_mem[NbOfBuffers - 1]; memset(CL_PositionBuffers, 0, sizeof(cl_mem) * (NbOfBuffers - 1)); for (int k = 0; k < (NbOfBuffers - 1); k++) { CL_PositionBuffers = clCreateFromGLBuffer(context, CL_MEM_READ_WRITE, GL_PositionBuffers, &status); if(status != CL_SUCCESS) return 1; } status = clEnqueueAcquireGLObjects(commandQueue, NbOfBuffers - 1, CL_PositionBuffers, 0, NULL, NULL); if(status != CL_SUCCESS) return 2; CL_IndicesBuffer = clCreateFromGLBuffer(context, CL_MEM_WRITE_ONLY, GL_IndicesBuffer, &status); if(status != CL_SUCCESS) return 3; status = clEnqueueAcquireGLObjects(commandQueue, 1, &CL_IndicesBuffer, 0, 0, NULL); if(status != CL_SUCCESS) return 4; status = clFinish(commandQueue); if(status != CL_SUCCESS) return 5; return 0; } int RunTestKernel(void) { cl_int status; cl_event events; cl_ulong startTime, endTime, Test_kernel_time; #if KERNEL_HAVE_ARG_IN status = clSetKernelArg(kernel_test, 0, sizeof(cl_mem), (void *)&CL_IndicesBuffer); if (status != CL_SUCCESS) return 1; #endif size_t g = 1, l = 1; status = clEnqueueNDRangeKernel(commandQueue, kernel_test, 1, NULL, &g, &l, 0, NULL, &events); if(status != CL_SUCCESS) return 2; if ((status = clWaitForEvents(1, &events)) != 0) return 1; clGetEventProfilingInfo(events, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &startTime, NULL); clGetEventProfilingInfo(events, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &endTime, NULL); Test_kernel_time = endTime - startTime; if ((status = clReleaseEvent(events)) != 0) return 1; return 0; }

0 Likes

I've just tested with the 11.5a hot-fix and problem persists.

Any comments on what changed from 2.3/11.3 (or possibly 2.4/11.3) that could lead to this problem and about how much time it will take to see a release that will fix it?

cheers

0 Likes

laobrasuca,

I am not able to reproduce the hang on my system with internal packages.So the issue appears to be fixed.I will try to check on sdk 2.4 also.

0 Likes

nice, thx for the news.

0 Likes

Originally posted by: himanshu.gautam laobrasuca,

 

I am not able to reproduce the hang on my system with internal packages.So the issue appears to be fixed.I will try to check on sdk 2.4 also.

 

still having the same problem with catalyst 11.6 + APP SDK 2.4. When you say internal packages, are you referring to APP SDK packages or display driver packages (using opencl runtime coming with it)?

0 Likes

with CCC 11.7 and embedded SDK runtime it works fine now, no freezing problems.

however, since the new sdk is not yet released, can I use .h and .lib files from the current sdk in order to build my apps?

0 Likes

Nice to hear, you got the problem fixed.

IMHO, you can use the headers and libs. Do you face some problems?

0 Likes

I haven't tried the headers/libs yet, but it'll work hopefully

thx 😉

0 Likes

well, now that the SDK 2.5 is released I'm using it. At the time I didn't try using headers and stuff, because I had other aspects of my code to treat in priority.

But, here I am again about the same problem. As I said, with 128 buffers it works (SDK 2.5 / CCC 11.9), but I had to increase the number of buffers and I noticed that with, for example, 350 it will freeze.... so, there's still a problem related the number of buffers to acquire. Any idea why the number of CL buffers acquired from GL would make it freeze? As a side note, if the buffers created are CL only, meaning, I don't create CL buffers from GL buffers, I don't have any problem.

And I'd like to ask a question about clEnqueueAcquireGLObjects(): is there any difference between:

clEnqueueAcquireGLObjects(commandQueue, NbOfCLBuffers, CL_GL_Buffers, 0, NULL, NULL);

and

for(int buff = 0; buff < NbOfCLBuffers; ++buff)

clEnqueueAcquireGLObjects(commandQueue, 1, CL_GL_Buffers+buff, 0, NULL, NULL);

I mean, in the clEnqueueAcquireGLObjects implementation does it use a for loop?

0 Likes

I have a similar problem on Linux (SuSE 11.4/64). There, the machine is still alive, but Xorg spins at 100% CPU somewhere in the kernel (cannot be interrupted so I can't tell in which module). Screen completely frozen, not even mouse movement.

Is there any chance I could get hold of the prelim Linux 11.7 drivers that seemed to fix that for Windows? If not, is there an approximate date for 11.7 release?

Thaks a lot

0 Likes

maybe this wednesday.

0 Likes

Hi,

I have test the 11.7 driver but it fix nothing 😞

On the GPU it freeze my computer... and now... even my CPU device crash my application 😞

What can we do to help AMD to fix the problems ? (Except that you can simply run the CLPP open source project on GPU to see the problem !!)

0 Likes

viewon01,
It looks like the scan kernel is invalid.


You are doing an early return so not all work-items that enter the loop hit all barriers.

What you will need to do is somethign like this.

bool invalidIdx = offsetIdx > (size - 1);

if (!invalidIdx) {
execute code
}
barrier()
if (!invalidIdx) {
execute code
}

for(uint i = 0; i < passesCount; ++i)\n" " {\n" " const uint offset = i * TC + (bidx * B);\n" " const uint offsetIdx = offset + idx;\n" " \n" " if (offsetIdx > size-1) return;\n" " \n" " // Step 1: Read TC elements from global (off-chip) memory to local memory (on-chip)\n" " T input = localBuf[idx] = dataSet[offsetIdx]; \n" " \n" " /*\n" " // This version try to avoid bank conflicts and improve memory access serializations !\n" " if (lane < 1)\n" " {\n" " __global T* currentOffset = inputDatas + offsetIdx;\n" " vstore16(vload16(0, currentOffset), 0, localBuf);\n" " vstore16(vload16(0, currentOffset + 16), 16, localBuf);\n" " }\n" " barrier(CLK_LOCAL_MEM_FENCE);\n"

0 Likes

Wow !!!! Well see...

 

I have to call all the barriers too !!!!???

It will be difficult to solve 😛 because in sub-functions I also have a lot of 'barriers' !!!! Should I do a lot of 'if (isInvalid)..." even in sub-functions ?

BTW: I have no problem with this code on NVidia SDK !

Thanks

Krys

0 Likes
Eudorawhite
Journeyman III

This driver is currently the only version to be able to play Dragon Age 2 without crash. However it would cause the GPU load of the primary card to be stuck at 100% unless I reboot.i cant put up with in any more?so i uninstall it by uninstaller.

0 Likes

viewon01,
This is one of the side effects of our architecture. Another way to solve this problem is to run work-group size equal to wavefront size.
0 Likes

Thanks Micah,

So, I have add a fix for this problem in the scan, I have simply replace with :

But the computer continue to block ! So, I suppose the problem is somewhere else !?

if (offsetIdx > size-1) { // To avoid to lock ! barrier(CLK_LOCAL_MEM_FENCE); barrier(CLK_LOCAL_MEM_FENCE); barrier(CLK_LOCAL_MEM_FENCE); barrier(CLK_LOCAL_MEM_FENCE); barrier(CLK_LOCAL_MEM_FENCE); barrier(CLK_LOCAL_MEM_FENCE); barrier(CLK_LOCAL_MEM_FENCE); barrier(CLK_LOCAL_MEM_FENCE); continue; }

0 Likes

viewon01,

can you please describe how would the modification you did above help in removing the hang.What is the use of putting so many barriers one after the other.

0 Likes

Micah tell me that the problem is that from a workitem I return, and from hte other workitems I continue (and so, it call some barriers).

The result is that theses barriers are waiting for the workitem where I have returned from !!!!

So, in order to avoid  a blocking situation, I continue but only call the barriers !

There are 8 barriers to call (3 in the same method and 5 in a sub-function)

That's right ?

0 Likes