I'm trying to use CodeXL instead of Code Analyst and see at least some issues that were with CadeAnalyst as well
1) CodeXL can't catch GPU API trace for my app. Tested on 2 different hosts. Only when "write every 100ms" enabled I can get anything back.
2) Please look on picture. buffer copied inside GPU memory. 4kb buffer. Buf why so different and so big times ?
Are these times real or I getting some artefacts here? Would it be better to use some copy kernel instead CopyBuffer call for these 4k of data?
Need to add that in another area of profiling timeline (same algorithm, same run, just different iteration) I got MUCH better times like 9-10us. And in other - even much more bad times like >1ms . Same 4kb buffer copied, just another iteration.
Also, that buffer was recreated from iteration to iteration (app uses quite different processing in different stages so bufer re-allocation required to decrease memory footprint). Can it be that for some of iterations Buffer was so badly misplaces that time were orders of magnitude worse? Or it's just some artefacts ? If so, how to avoid them ?