0 Replies Latest reply on Aug 21, 2013 3:57 AM by Meteorhead

    compiler bugs: __local memory unallocated, debug performance




      I have encountered two rather large problems, which seem to root inside the compiler itself. I have been debugging kernel code after a major refactorization, and I could not find the source of my problems up until now.


      1. I have designed my code to behave slightly different when it detects debug builds. This info is conveyed onto the kernels which use debug indexers in these cases (printf statements should any kernel instance overindex a buffer). Inside the kernels, practically this is the only difference there is. The AMD OpenCL compiler is significantly slower in debug mode, and by this I mean SIGNIFICANTLY! Infact in debug mode, my code cannot even compile, because it eats up all of my RAM and virtual memory (Win8 64-bit, 4GB of RAM). The kernel code is no black magic, there are 2 nested switch statements with the work inside (with the debug indexers inside, no more loops or any plus complexity involved). I would not expect flow control of this complexity to crash the compiler in debug mode. Compiling to CPU is fine, but GPU compilation cannot finish.
      2. I have given up on finding the bug on the GPU itself, so I went for the faster CPU compilations and highly reduced work sizes. I have commented out nearly all of the code inside the kernel, only to find that the problem (or at least on of them) is caused by the fact that one of the 3 __local allocations remain unallocated. Trying to acces the [0]th element of the array fires "Acces Violation reading location 0xFFFFFFFF". Using exactly the same parameters, Intel OCL implementation works fine. (I have not checked kernel output yet, but it does not segfault on accessing local memory)


      The problematic kernel is inside QGripper_Kernels.cl file, with the function starting at line 215. All data initialization and nearly everything else is commented out, problematic line is 259. Original assignment is kept in comment. Changing Psi_t to either Psi or Psi_r renders the application to finish properly (though practically not doing any computation). These local allocations are conveyed from the calling kernel.


      The host code is part of a much larger application using Qt, but I know people are hesitant to install 3rd party addins only to reproduce an issue, so I have removed all the Qt stuff, the VS project should compile with a valid installation of APPSDK, optionally with an Intel OCLSDK. I've tried using Catalyst 13.8 Beta and Beta 2 both.