5 Replies Latest reply on Dec 14, 2014 3:58 PM by kulan

    Reuse of Kernel in aparapi / memoryleak ?




      have I missed something how I can tell aparapi / openCL that I will use a given Kernel more than once, with other data?


      The thing is, I have written a Java Prog which uses aparapi to calculate matrix products. And for benchmarking I plan to use something like 8000 x 8000 or even 10.000 x 10.000

      input is byte [N][N] and output is int [N][N].


      I have also written an Iterator so that I can make a loop like :


      for (WorkUnit wu : workUnitBuilder) {





      Problem is, when I do it like this, I got a fatal error : Exception_access_violation

      To avoid this, I started to create a new kernel in every iteration. Which seams to work for instances like 4000 x 4000, but when I come to 8000 x 8000 I see that after the ~67. kernel I got the warning from aparapi, that it has failed to compile the kernel to OpenCL (which is irritating due it has worked 65+ times before), so I started to monitor the RAM usage of the GPU (AMD HD 5650 Mobile) and the results was, every time a new calculation starts (new kernel + new data) the memory usage increases about the amount of the array chunks, the kernel uses, but it always increases... And when it fails to compile future kernels, the memory usage of the GPU constant as long as I kill the java-process, than it drops to almost 0.


      Is there any way to free up not used kernel-space ? Maybe with kernel.setExplicit(true) and than copy the arrays by hand ?


      thanks in advice

        • Re: Reuse of Kernel in aparapi / memoryleak ?

          As you mentioned, please try “kernel.setExplicit(true)” and see if that helps. It avoids unnecessary buffer transfers between iterations.

          If it does not work as you expected, can you please send a complete test code (Host + kernel) so that we can reproduce it here?



          1 of 1 people found this helpful
            • Re: Re: Reuse of Kernel in aparapi / memoryleak ?

              Thanks for the answer, I tried it several times.


              The outcome was :

              when using kernel.setExplizit(true) it don't have the memory leak and it is a bit faster. But the drawback is, that I always get empty results


              I wrote a checker which checks the result-chunks. For the not-explizit way it looks plausible, but for the explizit way I only have 0 values.


              I have capsuled the put() + get() methods in the kernel.


              Then I tried :



              here the Results after the 16. Kernel-run has finished:

              explizit = true :

              following execution of kernel{

              Resource report 1 resources of type cl_command_queue still in play  1c5f9ff0(1198)

              Resource report 3 resources of type cl_mem still in play  1dfcf2a0(404) 1dfcf130(404) 1dfcefc0(404)


              (I got the same results / adresses 16 times)


              explizit = false:

              following execution of kernel{

              Resource report 16 resources of type cl_command_queue still in play  210ae610(1198) 210ae0d0(1198) 21124020(1198) [...]

              Resource report 48 resources of type cl_mem still in play  26f920f0(404) 26f91f80(404) 26f91e10(404) 26f91850(404) 26f916e0(404) 26f91570(404) [...]



              For managing the workUnits + workResults there is the WorkBuilder, which divides the 2 input arrays / matrices according the numLines, and stores all the result-chunks (references)

              The plan was, that his WorkBuilder "generates" the work-chunks for the kernels and stores the results of each kernel(-run) and combines the result-chunks back to the result array.


              I also get empty arrays back if I stay inside the "maxWorkItemSizes" of (256, 256, 256) and make chunks of 250 x 250.


              When I understand the way kernel.get(array) works, is that I give it an array (for example the result array) and it reads the *right* data from the GPU (.memory) and after the get(array) I should have the correct Data in that array, even if I have a reference to it for example in the WorkBuilder.


              An other interesting thing is, when I increase N to 8000 (instead of 4000) I have to reduce numLines (the chunk-size) to about 500 otherwise the AMD driver crashes, and Windows 7 restarts it.

              I don't figured out, if it depends on the memory-consumtion of a kernel or the range (which is numLinex x numLines in 2D). I have tested it with the latest (14.12-omega) driver


              Best regards