1 of 1 people found this helpful
Analyze the ISA and not the IL for the final instruction schedule or to determine bottlenecks.
I obtain the IL by setting "GPU_DUMP_DEVICE_KERNEL" to 1.
How do I know what is the final ISA?
Setting it to 3 will get you both
Thanks for the answers both of you. I went through the ISA as well as the Cayman ISA documentation and I have reasons to believe that the instruction sequence below suffers heavily from "dependence chain" problems.
The "PV" operand clearly indicates that instructions may have to wait for completion of previous instructions - rendering the ALU pipeline idle for most of the time.
I am new to AMD arch. Can somebody enlighten me? Thanks much!
112 z: MULADD_e R127.z, R17.x, R2.x, R29.w
w: MULADD_e R127.w, R13.x, R2.x, R7.z VEC_210
113 x: MULADD_e R127.x, R13.y, R2.y, PV112.w
y: MULADD_e R127.y, R15.x, R2.x, R7.x
z: MULADD_e R127.z, R17.y, R2.y, PV112.z VEC_210
114 x: MULADD_e R127.x, R17.z, R2.z, PV113.z
y: MULADD_e R127.y, R15.y, R2.y, PV113.y
z: MULADD_e R127.z, R13.z, R2.z, PV113.x VEC_210
w: MULADD_e R127.w, R19.x, R2.x, R8.x
115 x: MULADD_e R127.x, R17.w, R2.w, PV114.x
y: MULADD_e R127.y, R13.w, R2.w, PV114.z VEC_210
z: MULADD_e R127.z, R19.y, R2.y, PV114.w
w: MULADD_e R127.w, R15.z, R2.z, PV114.y
116 x: MULADD_e R127.x, R15.w, R2.w, PV115.w
y: MULADD_e R127.y, R16.x, R3.x, PV115.x
z: MULADD_e R127.z, R12.x, R3.x, PV115.y VEC_210
w: MULADD_e R127.w, R19.z, R2.z, PV115.z
117 x: MULADD_e R127.x, R14.x, R3.x, PV116.x
y: MULADD_e R127.y, R12.y, R3.y, PV116.z
z: MULADD_e R127.z, R16.y, R3.y, PV116.y VEC_210
w: MULADD_e R127.w, R19.w, R2.w, PV116.w
118 x: MULADD_e R0.x, R16.z, R3.z, PV117.z
y: MULADD_e R127.y, R14.y, R3.y, PV117.x
z: MULADD_e R127.z, R18.x, R3.x, PV117.w
w: MULADD_e R127.w, R12.z, R3.z, PV117.y VEC_210
119 x: MULADD_e R127.x, R18.y, R3.y, PV118.z
y: MULADD_e R127.y, R21.x, R2.x, R28.y
z: MULADD_e R7.z, R12.w, R3.w, PV118.w
Each ALU Group has almost 4 instructions in it (with the exception of first , second and the last group ).
So, these must be combined as a VLIW packet and issued straightaway in a single cycle.
Moreover, Each ALU group has a dependency on the previous ALU group.
Let us assume that a wavefront would be issued in 4 cycles (as 4 quarter wave-fronts).
By the end of 4 cycles the first ALU group1 would have been scheduled for all 64 threads.
The workgroup size of my kernel is 256 and is very register-intensive.
32 registers per thread. 256*32 = 8K.
Since 8K is such a round number, the number of active wavefronts could either be just 4 or 8. (Is there a way to figure this out?)
Assuming 4 active wave-fronts, the CU would schedule group1 within 16 cycles.
At the begining of 17th cycle, "dependencies" start playing up and will stall the wavefronts.
Now, the question is, will the GPU finish a MULADD instruction within 16 cycles or not?
If the number of active-wavefronts is 8, then I would 32 cycles of leisure-time before dependencies start playing up.
Not too sure how deep is the pipeline and how much latency a MULADD instruction has.
Can somebody throw light? Thanks!
1 of 1 people found this helpful
As long as he has two wavefronts active, then his code will not stall the machine on a per ALU cycle basis. The instruction latency is 8 cycles, and each wavefront executes 64 work-items over 4 cycles(16 work-items per cycle), so two wavefronts covers ALU latency. The problem comes not at the ALU latency, which in his case is pretty well packed, but at the clause boundary.
Thanks for your answer. Great to know that! And, so, dependency is not killing the pipe here. So, all my performance bottlenck is coming from Memory - which is quite understandable. What i have is a strided memory-access pattern...
I did optimize for cache... and got around 80% cache-hit in the profiler... But still my ALU is not busy.....and i am not having a dependency
issue as well.... How do I interpret that? Any clues?
Thanks for the tip. I am running linux and sprofile does not show the occupancy. I think windows has this feature. Anyway, Thanks a lot for pointing me to that. It is going to be useful someday to me.
Thanks all of you,
Hi sarnath, Occupancy calculator is supported under linux. Use -O switch when you collect performance counter or API trace to generate occupancy file.
e.g. sprofile -t -O -o ~/foo.atp ./MyApp MyAppArg1
You should be able to find foo.occupancy file after profiling. It's csv format.
Thanks a lot Lihan.. I did not know that sprofile had so much options! Been using for long now... :-)
You can use APP Profiler to find out theoretical number of active waves per CU on Pre-SI hardware. The calculation is based on GPR usage, LDS usage and work group size.
Note that APP Profiler occupancy calculator doesn't work with Catalyst 12.1.