4 Replies Latest reply on Aug 20, 2010 2:34 PM by ryta1203

    SDK 2.2 profiler timings problem

    ryta1203

      Is anyone else getting inconsistent kernel times with SDK 2.2?

      I was getting some pretty varied results (up to ~40% swing) in my kernel timings with SDK 2.2/10.7b.

      I do not get this problem with SDK 2.1/10.5.

      Is this a known bug?

      For example, for the DCT sample with SDK 2.2 I get a swing from ~6.6 to ~10.xx but with SDK 2.1 it's a consistent ~6.6 and for the Twister sample I get a swing from ~14.xxx to ~18.xx with SDK 2.2 but with SDK 2.1 its a consistent ~9.3xxx....

      And I mean like with SDK 2.1, the timings are pretty much rock solid but with SDK 2.2 they are all over the place in those ranges.

        • SDK 2.2 profiler timings problem
          coggy

          I see the same effect for kernels acting on input data with relatively small sizes on a Radeon HD 5870. For example, I have a short kernel acting on 614400 doubles taking anywhere between 30ms and 60ms. Often, a few consecutive kernel calls seem to take similar amounts of time, though. The running time for enqueueReadBuffer() followed by queue.finish() seems to vary just as much.

          Frequency scaling has been ruled out by setting both the CPU and GPU, as well as the cards memory, to their maximum clock frequencies. Maybe there are other (power management?) states the hardware can be in, resulting in different running times depending on the current state?

            • SDK 2.2 profiler timings problem
              ryta1203

               

              Originally posted by: coggy I see the same effect for kernels acting on input data with relatively small sizes on a Radeon HD 5870. For example, I have a short kernel acting on 614400 doubles taking anywhere between 30ms and 60ms. Often, a few consecutive kernel calls seem to take similar amounts of time, though. The running time for enqueueReadBuffer() followed by queue.finish() seems to vary just as much.

              Frequency scaling has been ruled out by setting both the CPU and GPU, as well as the cards memory, to their maximum clock frequencies. Maybe there are other (power management?) states the hardware can be in, resulting in different running times depending on the current state?

              bpurnomo seems to think it's invovled with async mem transfers or something, but I'm not sure how this is possible and why this doesn't occur in SDK 2.1 and just in SDK 2.2 (Unless this is an OpenCL 1.1 thing).

              Also, I'm seeing this effect for 2k*2k problem sizes (not all that large but not really all that small either).  ~4 million threads.

                • SDK 2.2 profiler timings problem
                  laobrasuca

                  echo

                  me 2, my kernels are crazy!!!! But i'm not sure it is about the kernel computing themselves or the memory manipulation that does so, but I have variations from like 9ms to 45ms for a kernel of mine! I dont think it is about the 1.1 version of opencl, really dont think so, at least not with the specs, it's probably the amd way of implementing them. need to run the kernel in nvidia hardware/drivers to see if such fluctuations persists.

                  ps: my card is a hd5770

                • SDK 2.2 profiler timings problem
                  ryta1203

                   

                  Originally posted by: coggy I see the same effect for kernels acting on input data with relatively small sizes on a Radeon HD 5870. For example, I have a short kernel acting on 614400 doubles taking anywhere between 30ms and 60ms. Often, a few consecutive kernel calls seem to take similar amounts of time, though. The running time for enqueueReadBuffer() followed by queue.finish() seems to vary just as much.

                  Frequency scaling has been ruled out by setting both the CPU and GPU, as well as the cards memory, to their maximum clock frequencies. Maybe there are other (power management?) states the hardware can be in, resulting in different running times depending on the current state?

                  Yes, it seems that when I run 4k^2 on DCT the timing stability issues of 2.2 seem to go away.

                  It seems like AMD is doing some runtime optimizations?