There is a recent interview with some of the AMD devs (http://forums.amd.com/devblog/blogpost.cfm?catid=335&threadid=120276) which includes the comment "...the OpenCL CPU implementation levertages the CPU hardware debug features to provide excellent debug capabilities, using familiar debug environments, at full CPU speeds.".
I've probably missed it, but is there any debug support for Visual Studio 2008 on Vista planned for kernels running on the CPU, or perhaps within a GPU emulator? It would be great to catch kernel memory and build issues in Visual Studio.
I second this query. Even without Visual Studio integration, is there a way to view kernel compiler error messages? Now there is just a numeric code returned that the program build failed when clBuildProgram is executed.
Originally posted by: jmundyis there a way to view kernel compiler error messages? Now there is just a numeric code returned that the program build failed when clBuildProgram is executed.
You can get the build log using clGetProgramBuildInfo() API call.
Yes.. it's pretty close, but you get references like
C:\Users\daiken\AppData\Local\Temp\OCL454.tmp.cl(54): warning: variable "lsb" is used before its value is set
If you double-click on them in the output window they will navigate to the appropriate line in the editor.. or they would if the temporary file still existed. Really what you want, though, is the path to the original .cl file. It's possible to sweep through the output with a regex, replacing the file paths, but a simple fix to the OpenCL implementation would make it much easier.
This isn't a big issue for me currently. Catching subtle memory overwrites is. I'm working with a radix sort pulled from the NVidia SDK (it uses the recent paper from Satish et al) and it crashes in clFinish(). I suspect it's due to a memory error, but the code is quite low-level so it's difficult to isolate. They are NVidia kernels so i'm waiting for permission to post it here. If there is some way to use the AMD source or an emulator with runtime error checking i'll do the work myself.
Originally posted by: david_aiken Yes.. it's pretty close, but you get references like
C:\Users\daiken\AppData\Local\Temp\OCL454.tmp.cl(54): warning: variable "lsb" is used before its value is set
If you double-click on them in the output window they will navigate to the appropriate line in the editor.. or they would if the temporary file still existed. Really what you want, though, is the path to the original .cl file. It's possible to sweep through the output with a regex, replacing the file paths, but a simple fix to the OpenCL implementation would make it much easier.
Presently, clCreateProgramWithSource is only supported. you can do what you are expecting from clCreateProgramWithBinary. This will be available in upcoming releases.
This isn't a big issue for me currently. Catching subtle memory overwrites is. I'm working with a radix sort pulled from the NVidia SDK (it uses the recent paper from Satish et al) and it crashes in clFinish(). I suspect it's due to a memory error, but the code is quite low-level so it's difficult to isolate. They are NVidia kernels so i'm waiting for permission to post it here. If there is some way to use the AMD source or an emulator with runtime error checking i'll do the work myself.
is it crashing for both CPU and GPU?
It crashes when running it against an Intel Core 2 Quad Q6600 and AMD Turion 64 X2. I don't have an AMD GPU yet, regrettably.
Originally posted by: david_aiken It crashes when running it against an Intel Core 2 Quad Q6600 and AMD Turion 64 X2. I don't have an AMD GPU yet, regrettably.
What modifications you did while porting sample?
Post the code here once you get permission
Taking the original RadixSort.cl from the NVidia SDK v.2.3, I did the following to get it working with AMD Stream v2.0-beta4:
1) copied scan.cl from oclScan NVidia example next to RadixSort.cl. The code also has to be changed to refer to this file rather than the missing "scan_b.cl".
2) create separate builds for AMD and NVidia.
3) modify the code and project settings to work with the AMD environment. Some of the convenience routines and logging were changed and a memory monitor added. Also added check for CL_DEVICE_TYPE_CPU.
4) copy the following AMD dlls into the AMD output directory:
aticalcl.dll, aticalrt.dll (pulled from recent driver)
OpenCL.dll (from AMD SDK)
5) running results in errors in both scan.cl and radixsort.cl:
<cl file> internal error: array_element_type: non-array type
__local uint numtrue;
^
1 catastrophic error detected in the compilation of <cl file>
Compilation aborted.
This is resolved by passing "-DAMD_BUILD" to clBuildProgram for the AMD builds and conditionally removing the __local in both files.
6) once the .cl files build without errors, running with AMD results in a crash on calling clFinish():
> OCL46C9.tmp.dll!001e14d7()
[Frames below may be incorrect and/or missing, no symbols loaded for OCL46C9.tmp.dll]
OCL46C9.tmp.dll!001e166d()
OpenCL.dll!1001612c()
It is failed to allocation device memory for mBlockOffsets on GPU(line number 57, RadixSort.cpp).
Try with following
select small value for numElements.
WORKGROUP_SIZE must be <= 256 for GPU.
Yes, It is crashing for CPU at my end also. algorithm is too complex.
Are you saying that it works for you on the GPU if you change these settings? If so, it would help if you could tell me which GPU you use and how many elements can you sort.
The algorithm is adapted from "
I tried with different values of numElements. It is crashing different places.
It takes lot of time to understand code. Hope we will reply back as early as possible.
Is it possible to get access to the AMD OpenCL CPU code under NDA? A call stack with source would really help to track down these mysterious crashes.
Can you tell me where the process for setting the size of these pools is described?
Well.. i reduced the numElements down to 16Kb and, as also reported by genaganna, still got a crash. I can play with different buffers, but i don't know if i'm addressing an underlying problem or just moving the symptoms around.
Which variable in particular do you think would be best?
You have it at the rapidshare link posted above. The kernel is almost identical to the NVidia kernel, but there was a complaint from the AMD compiler regarding one of the local variables. The issue didn't seem like it would cause a problem.
It's an implementation of Satish's recent paper and at time of publication was considered to be the fastest GPU sort. I need to extend it and add other operations and your CPU-based approach seems good, but source would allow us to take full advantage of the dev environment (and GPUs). It would be nice if OpenCL was Open Source .
The problem seems to be due to the local memory variable numtrue in rank4(). Removing the __local isn't a valid workaround because the variable must be updated by the group in order to calculate a valid rank. An invalid rank causes memory corruption in the calling function.
I tried the workaround suggested by mjharvey (http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=120374). It still crashed.
I also tried passing a __local into the kernel. This still crashes with my AMD environment.
The workaround approaches run ok in my NVidia environment. The project is at http://rapidshare.com/files/301718896/oclRadixSortSentToAMDForumWithLocalMemFix.zip.html