badkya

understanding measured performance

Discussion created by badkya on Mar 30, 2009
Latest reply on Mar 30, 2009 by ryta1203

I have several questions regarding the performance, scalability and measurement issues on an ATI GPU when using Brook+ on the SDK 1.3beta.

I have simple "map" kernel that has 3 stream inputs, 2 stream outputs and a a bunch of constants. Inside the kernel I am performing multiply-add,sqrts and divides on the 3 streams to produce the 2 output streams. There is no change in datarates of output and each index is processed in fully data-parallel manner. The example is very similar in structure to BlackScholes equations in the samples.

1. I compiled the program to run on my laptop which has an ATI Mobility Radeon HD3650 GPU. I also have a Firestream 9250 GPU which I can access remotely. When I ran it on my laptop and the Firestream processors, I get almost comparable performance. The Firestream processor is only marginally faster ~2-3%. Why isn't performance scaling to use more parallelism on the new architecture?

2. I initially suspected I may have a stream lengths that are too small. When I tried increasing the stream length, my laptop GPU gave up much (cannot allocate stream error) before the Firestream device. With a few experiments I managed to notice a difference is atleast 2-4x in max-supported stream lengths. But for the range of stream lengths that both devices could support, I saw little difference in performance.

3.  I then thought maybe my data-transfer time is a bottleneck. But I found no way separately measure data-transfer time vs. compute-time. Is this possible with Brook [maybe low-level CAL has some support for doing this]?

4. Could my kernels be register limited? If so, will performance fail to scale when using the larger GPU? Does the Brook compiler tell me how many registers are used?

5. Is there a way to measure what % of the GPU is being used? Maybe I canlaunch multiple kernels in parallel? Is that possible?

6. I even tried forcing domain Size between 2 to 128 in 2^x increments but there was no change in performance.

7. Also, the measured runtime of the first iteration of the kernel is a lot higher than the rest. I guess this is related to CAL runtime startup time?

I know this is large laundry list, but feel free to chime in for any subset of questions...

I also had some miscellaneous questions about Brook as I am just starting out. Is there a boolean datatype? Are there type-conversion functions between float4->int4 and vice versa?

Thanks..

Outcomes