1 of 1 people found this helpful
any other clues as to what would explain the discrepancy between what the profiler and analyzer are reporting?
It is normal. The static analyzer performs evaluations in a completely insulated world AFAIK while the profiler measures real data from the GPU, after it has been mangled by the driver. The driver can change and every version can produce different code. In general, I would say the compiler is smarter in the driver. You can reasonably use the analyzer only for "order of magnitude" decisions.
Is there a strategy for converting VGPR registers to SGPR registers? The kernel only uses ~22 SGPRs now, so it is apparent that the SGPRs so a better balance should yield better performance.
I haven't seen much. SGPRs are limited in goal and features. If you need to lower VGPR you might have more luck in LDS storage... that said, are you sure you need more occupancy? I've seen kernels work at 100% load even with 100+ VGPRs!
Well, I am uncertain if less VGPRs would yield better performance or not. In order for me to check this, it would seem that I would need to remove code to reduce VGPRs and then check the performance. This would then cause it to run faster because there is less code, so this experiment is not fruitful. It does seem that as I have reduced the VGPRs used that it hasn't improved performance much. Currently I am addicted to reducing VGPR usage so that I can get to the next threshold (48 VGPRs, yielding 6 wavefronts in flight - I am currently at 53 with the profiler, but 44 with the analyzer).
There are two accesses to global memory in the kernel, one of them I move the data to LDS and the other I am relying on the cache to .. cache (which seems to be working). There are ~19K VALU instructions. So perhaps the instructions are large relative to the stalls due to global memory accesses and therefore reduced VGPRs would not improve performance...
Currently I am addicted to reducing VGPR usage so that I can get to the next threshold...
It is indeed something I can understand!
Never forget to look for VALUBusy% and the memory stall metrics. With 19k instructions you should be able to move quite some data before stalling even with 50 VGPRs... IF the computation is heavy enough. Memory access patterns are important; try to have the workitems collaborate in moving data. Async memory transfers are awesome for this if you can afford the LDS price to pay.
You can estimate the effectiveness of your pattern by looking at the amount of memory i/o. If the amount of memory is much higher than your reasonable guess, then the pattern is likely sub-optimal. Some time ago I reformulated an algorithm to work across two work-items. I had a small performance hit but bandwidth usage was cut in half!
OK, it looks like my VALUBusy% is around 37 to 40% and SALUBusy % is around 6.5%. VALUUtilization is nearly 100%. MemUnitStalled % = ~46%-55%, WriteUnitStalled % = ~30%
Does this tell you anything interesting?
I speculate you're thinking "1 work item ~ 1 thread" and have the output from each WI being written as:
output = finalValue;
output = finalValue;
output[N-1] = finalValue[N-1];
This will cause the "nearby" work-items to write out strided memory addresses. If the stride is big you will hit the same memory channel and cause a stall.
Of course this is speculation.
You're correct. It would seem that I could use some sort of hashing algorithm to change the value that I am calculating such that they are not near each other. The problem is a 1D array where the values are calculated and pushed into the output array based on arbitrary inputs. The benefit of calculating nearby outputs is that they use nearby inputs (from global memory) so there is likely some caching advantage. Perhaps one approach would be to write the results to LDS and move them to global memory all at once? Are there other options?
(Update, I changed the write of each output to go to the LDS and then get written to global memory all at once and it slowed it down by quite a bit. I added a barrier at the end of the kernel and then an async mem copy from local to global)