cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

sarnath
Journeyman III

AMD IL - Instruction Schedule

Hi,

Is the instruction schedule in the IL code the final one?

OR will the schedule change as it passes to subsequent compilation phases.

I have a code that shows a lot of depedency-chain in IL.

I am not sure if this is a performance-bottleneck.

How can I verify this?

Thanks for any info,

Best Regards,

Sarnath

0 Likes
1 Solution

Setting it to 3 will get you both

View solution in original post

0 Likes
10 Replies

Analyze the ISA and not the IL for the final instruction schedule or to determine bottlenecks.

Hi,

Thanks!

I obtain the IL by setting "GPU_DUMP_DEVICE_KERNEL" to 1.

How do I know what is the final ISA?

Thanks,

Best Regards,

Sarnath

0 Likes

Setting it to 3 will get you both

0 Likes
sarnath
Journeyman III

Thanks for the answers both of you. I went through the ISA as well as the Cayman ISA documentation and I have reasons to believe that the instruction sequence below suffers heavily from "dependence chain" problems.

The "PV" operand clearly indicates that instructions may have to wait for completion of previous instructions - rendering the ALU pipeline idle for most of the time.

I am new to AMD arch. Can somebody enlighten me? Thanks much!

        112  z: MULADD_e    R127.z,  R17.x,  R2.x,  R29.w

             w: MULADD_e    R127.w,  R13.x,  R2.x,  R7.z      VEC_210

        113  x: MULADD_e    R127.x,  R13.y,  R2.y,  PV112.w

             y: MULADD_e    R127.y,  R15.x,  R2.x,  R7.x

             z: MULADD_e    R127.z,  R17.y,  R2.y,  PV112.z      VEC_210

        114  x: MULADD_e    R127.x,  R17.z,  R2.z,  PV113.z

             y: MULADD_e    R127.y,  R15.y,  R2.y,  PV113.y

             z: MULADD_e    R127.z,  R13.z,  R2.z,  PV113.x      VEC_210

             w: MULADD_e    R127.w,  R19.x,  R2.x,  R8.x

        115  x: MULADD_e    R127.x,  R17.w,  R2.w,  PV114.x

             y: MULADD_e    R127.y,  R13.w,  R2.w,  PV114.z      VEC_210

             z: MULADD_e    R127.z,  R19.y,  R2.y,  PV114.w

             w: MULADD_e    R127.w,  R15.z,  R2.z,  PV114.y

        116  x: MULADD_e    R127.x,  R15.w,  R2.w,  PV115.w

             y: MULADD_e    R127.y,  R16.x,  R3.x,  PV115.x

             z: MULADD_e    R127.z,  R12.x,  R3.x,  PV115.y      VEC_210

             w: MULADD_e    R127.w,  R19.z,  R2.z,  PV115.z

        117  x: MULADD_e    R127.x,  R14.x,  R3.x,  PV116.x

             y: MULADD_e    R127.y,  R12.y,  R3.y,  PV116.z

             z: MULADD_e    R127.z,  R16.y,  R3.y,  PV116.y      VEC_210

             w: MULADD_e    R127.w,  R19.w,  R2.w,  PV116.w

        118  x: MULADD_e    R0.x,  R16.z,  R3.z,  PV117.z

             y: MULADD_e    R127.y,  R14.y,  R3.y,  PV117.x

             z: MULADD_e    R127.z,  R18.x,  R3.x,  PV117.w

             w: MULADD_e    R127.w,  R12.z,  R3.z,  PV117.y      VEC_210

        119  x: MULADD_e    R127.x,  R18.y,  R3.y,  PV118.z

             y: MULADD_e    R127.y,  R21.x,  R2.x,  R28.y

             z: MULADD_e    R7.z,  R12.w,  R3.w,  PV118.w

0 Likes

Each ALU Group has almost 4 instructions in it (with the exception of first , second and the last group ).

So, these must be combined as a VLIW packet and issued straightaway in a single cycle.

Moreover, Each ALU group has a dependency on the previous ALU group.

Let us assume that a wavefront would be issued in 4 cycles (as 4 quarter wave-fronts).

By the end of 4 cycles the first ALU group1 would have been scheduled for all 64 threads.

The workgroup size of my kernel is 256 and is very register-intensive.

32 registers per thread. 256*32 = 8K.

Since 8K is such a round number, the number of active wavefronts could either be just 4 or 8. (Is there a way to figure this out?)

Assuming 4 active wave-fronts, the CU would schedule group1 within 16 cycles.

At the begining of 17th cycle, "dependencies" start playing up and will stall the wavefronts.

Now, the question is, will the GPU finish a MULADD instruction within 16 cycles or not?

If the number of active-wavefronts is 8, then I would 32 cycles of leisure-time before dependencies start playing up.

Not too sure how deep is the pipeline and how much latency a MULADD instruction has.

Can somebody throw light? Thanks!

0 Likes

As long as he has two wavefronts active, then his code will not stall the machine on a per ALU cycle basis. The instruction latency is 8 cycles, and each wavefront executes 64 work-items over 4 cycles(16 work-items per cycle), so two wavefronts covers ALU latency. The problem comes not at the ALU latency, which in his case is pretty well packed, but at the clause boundary.

Hi MIcah,

Thanks for your answer. Great to know that! And, so, dependency is not killing the pipe here. So, all my performance bottlenck is coming from Memory - which is quite understandable. What i have is a strided memory-access pattern...

I did optimize for cache... and got around 80% cache-hit in the profiler... But still my ALU is not busy.....and i am not having a dependency

issue as well.... How do I interpret that? Any clues?

Hi Lihan,

Thanks for the tip. I am running linux and sprofile does not show the occupancy. I think windows has this feature. Anyway, Thanks a lot for pointing me to that. It is going to be useful someday to me.

Thanks all of you,

Best Regards,

Sarnath

0 Likes

Hi sarnath, Occupancy calculator is supported under linux. Use -O switch when you collect performance counter or API trace to generate occupancy file.

e.g. sprofile -t -O -o ~/foo.atp ./MyApp MyAppArg1

You should be able to find foo.occupancy file after profiling. It's csv format.

Thanks a lot Lihan.. I did not know that sprofile had so much options! Been using for long now... 🙂

Thanks,

0 Likes

You can use APP Profiler to find out theoretical number of active waves per CU on Pre-SI hardware. The calculation is based on GPR usage, LDS usage and work group size.

Note that APP Profiler occupancy calculator doesn't work with Catalyst 12.1.

0 Likes