Hi, I have a kernel with very low ALUBusy of 2%. It reads a value from global memory and writes back 8 values(with trival changes) back to the global memory. Any recommendation to make it run faster?
Originally posted by: NURBS Hi, I have a kernel with very low ALUBusy of 2%. It reads a value from global memory and writes back 8 values(with trival changes) back to the global memory. Any recommendation to make it run faster?
You have low ALUBusy because you are using all your time accessing memory. Try reading multiple inputs at once to decrease your memory access. You say you are doing trivial changes so really you shouldn't be doing it on the GPU. I don't know your hardware or what all you are doing but trivial changes to data usually doesn't help you at all. So do more once the data gets to the ALU (if there is anything else you want to add) and read multiple inputs at a time. Also what hardware are you using? What is your input data shape 1D, 2D, 3d? What datatypes? Why don't you just post exactly what you are doing. Also what hardware are you running it on?
A few general suggestions i can give is:
1. Do not do it on GPU unless data is already on GPU because of some previous procesing, or is required afterwards for more processing.
2. If you are still interested, then try to use caches. You have got L1 and L2 caches. If the access pattern is tile-based, try using images and get benifits from 2-D L2 cache. Else stick to sequential access patterns.
3. Make sure that consecutive workitems , specially inside same wavefront access consecutive memory locations. Also try to write code so that a wavefront only uses a single channel for global access, so many wavefronts can run together.
I hope these tips help. But the final decision is in your hands based on your problem.