I am debugging a kernel that is not very complex. It is extremely slow when stepping over source code. When I let it to another breakpoint that is in the middle of the code, it is there taking long long time. I reduced the work items to only have 16x16 global and 8x8 local work group. It is still SLOW. This makes the debugger useless.
The kernel does use a lot of global and/or texture memory. This memory is needed to do the computation and it can not be reduced.I am not sure if this is the cause of the slowness. I really don't think a debugger would be this slow even if it uses a lot of memory space.
Please publish your hardware/software/platform configuration.
1. What GPU Card? What CPU?
2. AMD Driver Version?
3. APP SDK Version if used
4. CodeXL Version
5. OS - Windows or Linux, 32-bit or 64-bit?
6. Are you running your OpenCL code on CPU or GPU?
The catlyst driver installer used was the January 2013 version, but under Device Manager, the driver version shows 8.961.0.0, and date is 4/5/12.
APP SDK is 2.8
Windows 7 64-bit
The PC only has one graphics card (7970), and it's using Intel CPU. The program tries to find the platform named under "AMD ...", and uses that platform. So it should use the GPU.
You have a very old driver... Older than Cat 12.10
Cat is walking now at 13.1
And the Beta Cat is walking at 13.3
Did you upgrade your driver? (or) Did you install it fresh?
Because, there are some known un-install failures. But, AMD provides a clean un-install tool here:
The following site lists the normal un-install steps
At the moment, I am not sure how they both are diferent from each other. But, I would recommend you to first normally un-install (likw what is said in the second link) and then run the un-install utility which will clean wipe all AMD driver DLLs and libraries out. After this, Reboot and then Fresh install the latest 13.1 stable driver (or) any beta releases as you please.
I will ask someone from codexl team to respond here. But i would recommend to post issues related to CodeXL in its own category.
Can you also specify, the steps to experience this slowness. Does it happen with any opencl code or specifically your kernel? If it is a application specific issue, it is better to attach a small testcase.
Anyways when you say long time, can you be specific. (is it in seconds or minutes or hours)
EDIT: I had moved this thread to CodeXL category, as it is more likely to get the attention of right people here. Sorry did not informed you earlier.
I followed the procedure to clean and install the software again. Double check the driver version is correct (Dec 2012). During installation, it failed on drag and drop transcoding and media transcoding part, and the application installation failed on APP SDK sample installation - don't know why. But I guess these would not affect the debugging.
Use the CodeXL again and it's the same thing. It took 18 seconds to jump to the second line of the kernel (first line is a simple assignment). For the following statements like this:
for (i=0; i<8; i++)
for (j=0; j<8; j++)
b[k++] = read_image(frame, imageSampler, (int2)(j, i)).x;
The for loop line could take up to 18 seconds to run, and the read_image statement could take 2.5 to 5 minutes to run!
I don't think this is specific kernel related. I simplified the kernel to the simplest. It merely reads values from image and writes to global memory.
Please do post to CodeXL and give the link after post it. I am having hard time to find the forum - it took me to a "old" forum which seems not to be used.
Hi, sorry for the slow reply.
I have added the following code inside an APP SDK sample (SimpleImage), and it seemed to run very well with the current build of CodeXL (which will soon be released as CodeXL 1.2) and the Catalyst driver (Catalyst 13.4).
int i, j;
int k = 0;
__local float b;
for (i=0; i<8; i++)
for (j=0; j<8; j++)
b[k++] = read_imagef(input, imageSampler, (int2)(j, i)).x;
1. I've noticed something fairly weird in your sample code, it seems (with the small reference frame that you gave) that your kernel uses multiple work items to read and write the same data from the image (since (int2)(j, i) and k all seem to be private variables).
a. If the work items are indeed reading from the same image and writing to the same buffer, the performance I saw is fairly good. See 2, 3 below why this might be. Suggest making the kernel more effective by making it work in parallel, e.g:
int id = get_global_id(0)
j = (id / 8);
i = (id % 8);
b[id] = read_imagef(input, imageSampler, (int2)(j, i)).x;
b. If you prevent the work items from writing to the same location somehow (e.g initializing k to a different starting offset for each work item, such as 64 * get_global_id(0) and having b be a larger buffer), the performance I see is the same.
c. If the prevention is by doing something like using barriers or making k volatile - that in itself might cause the slowdown in stepping. That also would create a slowdown in the kernel when you're not debugging it, so I'd suggest moving away from that algorithm, as you're effectively serializing the operation and losing the benefits of OpenCL. Try using CodeXL's GPU profiling capabilities to see how to improve this.
2. If you were debugging through the CodeXL Visual Studio extension, specifically on Visual Studio 2012, there is a known issue causing slowdown on that version in CodeXL 1.1. Try using the standalone CodeXL application (CodeXL.exe) until CodeXL 1.2 (which fixes this) comes out.
3. Otherwise, the way CodeXL currently performs kernel debugging requires it to recompile the kernel as you step through it. This is a CPU intensive operation. Have a look at the application you're debugging in Windows' task manager during the 2-5 minute wait you're experiencing.
a. If the application is working and set to a high CPU percentage (usually this will be 100 / the number of cores your CPU has, e.g. 50%, 25% or 16%), the wait is probably in the compiler. Try upgrading to the latest Catalyst version or freeing up the CPU and see if that improves anything. Otherwise, if your CPU is not very new, it might simply be that it does not match up to the GPU's capabilities. You can also try using CodeXL's CPU profiler to see if your application itself is what's taking up the resources - if most profiling samples are inside the AMD drivers (C:\Windows\[System32|SysWOW64]\amd*.dll ), it's probably the compiler thing.
b. If the application is mostly sitting at 0% CPU, and only occasionally jumps up (or never jumps up at all), this is more likely a wait in the GPU. Either you are running many graphic / compute applications at the same time, or your kernel is too complex or too serialized (such as in case 1c above). Try making the kernel more parallelized or running it through CodeXL's GPU profiler to see what can be improved on this front.
Hope this information is useful for you, and sorry again for the slow reply.