I have several questions regarding the performance, scalability and measurement issues on an ATI GPU when using Brook+ on the SDK 1.3beta.
I have simple "map" kernel that has 3 stream inputs, 2 stream outputs and a a bunch of constants. Inside the kernel I am performing multiply-add,sqrts and divides on the 3 streams to produce the 2 output streams. There is no change in datarates of output and each index is processed in fully data-parallel manner. The example is very similar in structure to BlackScholes equations in the samples.
1. I compiled the program to run on my laptop which has an ATI Mobility Radeon HD3650 GPU. I also have a Firestream 9250 GPU which I can access remotely. When I ran it on my laptop and the Firestream processors, I get almost comparable performance. The Firestream processor is only marginally faster ~2-3%. Why isn't performance scaling to use more parallelism on the new architecture?
2. I initially suspected I may have a stream lengths that are too small. When I tried increasing the stream length, my laptop GPU gave up much (cannot allocate stream error) before the Firestream device. With a few experiments I managed to notice a difference is atleast 2-4x in max-supported stream lengths. But for the range of stream lengths that both devices could support, I saw little difference in performance.
3. I then thought maybe my data-transfer time is a bottleneck. But I found no way separately measure data-transfer time vs. compute-time. Is this possible with Brook [maybe low-level CAL has some support for doing this]?
4. Could my kernels be register limited? If so, will performance fail to scale when using the larger GPU? Does the Brook compiler tell me how many registers are used?
5. Is there a way to measure what % of the GPU is being used? Maybe I canlaunch multiple kernels in parallel? Is that possible?
6. I even tried forcing domain Size between 2 to 128 in 2^x increments but there was no change in performance.
7. Also, the measured runtime of the first iteration of the kernel is a lot higher than the rest. I guess this is related to CAL runtime startup time?
I know this is large laundry list, but feel free to chime in for any subset of questions...
I also had some miscellaneous questions about Brook as I am just starting out. Is there a boolean datatype? Are there type-conversion functions between float4->int4 and vice versa?
Here are my best guesses and could, of course, be way wrong:
1. Sounds like the percentage of your code that is actually using the card is small. For example, if your is doing a lot of CPU-GPU, GPU-CPU transfer then that is going to depend on the bus type and not the card, maybe that is your bottleneck.
2. How large were these streams? And in what dimensions?
3. Sadly, AMD does not have a profiler.... yet (yet is wishful thinking).
4. You can use the Stream KernelAnalyzer to find information like this, among other information specific to kernel code. This program is not a profiler though and lacks a lot of important information needed to really enhance performance.
5. As far as I know multiple kernels can only be launched if you have multiple devices... meaning 1 kernel/1 device.
7. This could also be due to caching.
1) As ryta mentioned your bottleneck is not the graphics card. It seems your map kernel is memory bound. If you want to improve performance you should process more than a single element per thread, maybe process 4, 8, or 16.
2) Usually the largest stream size is allocatable at a single time is 1/2 the memory of the graphics card, or the size of address space for PCI memory(usually between 192-256MB on windows), but can be different due to other constraints of the system/OS.
3) This is a Brook+ issue because it does not submit a kernel to the graphics card until absolutely necessary in order to batch up requests improving performance.
4) Unless your using 20+ registers the chance of being register limited on the 9250 is low, but this could be an isue on the 3650 as you have a smaller register pool.
5) This can be guestimated by calculating the number of wavefronts being executed based on your domain size, calculating the number of wavefronts that can run based on the register count and then dividing the two numbers. If your ratio is larger than 1, then most likely you are using 100% of the GPU.
6) The GPU needs thousands of threads to run in parallel to get the GPU to full capacity, such small domain sizes do not come close to achieving this.
7) Brook+ postpones compilation of IL to ISA until the first time a kernel is executed and then the generated ISA is cached so the next time it is executed recompiliation is not required. This is done to allow Brook+ to run on any Stream-compatible hardware without having to compile the IL for every possible card.
Thank you for the replies.. I have some followup questions...
1. I would like to figure out if my performance is memory-bound. What I would ideally want is to be able to separately measure memory transfer time and compute time. I suspect my memory transfer time is not that high.. I perform CPU->GPU communication only on the input and output streams... Will the intermediate variables within a Brook kernel stay on the GPU? If I split my kernel into multiple kernels, how do I tell an intermediate stream to stay on the GPU?
2. My streams are currently 8192x64 large. But, since I am trying to just understand performance, I can change this to a number large enough until it fails. For this stream, at each index computation is fully data-parallel... does each wavefront bunch together an 8x8 block from the stream? If this is true, it implies I will have 1024x8 independent threads.. My laptop GPU has ~120 processors while the Firestream device has ~800... in either case I have far more threads than the number of processors in the GPU... so there should be more than enough work to keep the GPU busy... So I am guessing the bottleneck is elsewhere? I am not entirely sure, but maybe I am using more than 20+ registers in the kernel...
3. I am installing the Kernel Analyzer to see if it tells me how many registers each kernel has... Sounds like something the Brook compiler should be able to report pretty easily?
1. Yes, data can be sent across kernels with having to read it back to the CPU. This is easily achieved:
kernel(a, b, c) // a, b inputs, c output
kernel2(c, e) // c input, e output
kernel3(e, f) // e input, f output
If you are copying data, calling 1 kernel and then copying data back... it's probably memory bound. Just a guess since I don't know what your code looks like. If you are not memory bound and you are doing this then your kernel is probably really large and uses some insane number of registers or such, in which case you need to think about dividing it into smaller kernels maybe.
Brook+ does not ship with a profiler ( I agree that it should and I really still don't understand why there is not a profiler for CAL after almost a year). Like I said before, without a profiler there just isn't enough information for the average developer.