cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

realhet
Miniboss

7970 ISA Vector/Scalar instruction level paralellism

Hi,

I'm planning to do an ISA optimization on my existing kernel, and just wondering if anybody have some info on how to feed the S and V ALU's to achieve maximum utilization.

On the GCN .ppt slides I've seen that there is an instruction 'arbitrator' which is feeding 4 Vector alu's and 1 Scalar alu. It also tells that the S alu is working 4x faster than the V alu's. In total it gives 1:1 V:S alu operation capacity.

For example: what if I interleave independent V_ and S_ instructions in a pattern like SVSVSVSV?

Will the instruction decoder in the SIMD engine be able to feed all four V alu's in every 4 cycles while also feeding the S alu with one instruction every cycle (for each of the 4 V alu's)?

What if I use larger VOP3 instructions and/or 32bit immediate constants, when the instruction decoder will have to look more dwords ahead? Is there any specifications on the capabilities/limitations of the instruction decoder/arbitration unit?

Why am I doing this: I have a quiet big kernel (25KB on 7970) which is working at 98% alu utilization on the 6970, but unfortunately on 7970 it runs out of VRegs (above 128 vgprs I noticed a 'task scheduler' bottleneck because it can't put 2 kernels in the queues. When a task is done, the simd engines can't start immediately another task. It means -30% performance loss in my case). My plan is to use more sgprs instead of some of the vgprs to get below the 128 vgprs 'limit'. I can swap many calculations out to S registers which are the same values for every 16 simd vector threads, I just have th learn, how effectively schedule the SOP's and VOP's to achieve maximum alu utilization.

In my worst expectations maybe there is no S/V paralellism, and in every 4 cycles either 1 VOP or 1 SOP can be executed. This way there is no need for hardwaredependency checks or special compiler instruction reordering.

I also noticed the s_buffer_load_dwordx16 thing, another reason to go down below CAL

Thank you for your answers!

0 Likes
1 Solution

My understanding is that the hardware does not schedule an s and v simultaneously from the same wavefront, only from two separate waves. So there cannot be any dependency between the two, and the latency of that code will be as it appears but the throughput would be double as you suggest.

View solution in original post

0 Likes
17 Replies
Skysnake
Adept II

Yeah, it would be nice to know more about this.

Sorry for the ot question, but how big is your VALU busy value?

I have ~80% ALU busy on my 5870 but only ~24% on my 7970 -.-

Is there still a big benefit in using vector types? and if yes, what is the best size? 4, 8 or 16?

0 Likes

I can't tell sprofile values sorry. (not using VS neither OpenCl, and launching sperform manually on my .exe just fails)

According execution time measurements, I can utilize the V alu on 98-99% ,but only in special circumstances.

On the 7xxx you can try these hints:

- Unroll a big kernel (16KB ISA size is ok, max 32KB on 7xxx and max 48KB on 4xxx..6xxx)

- Optimize your algo in a way that doesn't use more than 64 VRegs (on the 4xxx..6xxx hardware  128 registers will do fine)

I guess that poor 24% alu utilization can come from using too much VRegs on the 7xxx

"Is there still a big benefit in using vector types?"

- on 4xxx..6xxx you only need to duplicate long dependency chains in order to let CalCompile feed the 4 or 5 VLIW lanes from them. If theoretically there is a way to utilize all vliw slots, then calCompile will find it

- on 7xxx there is no ALU benefit from vectorizing code.

- but of course still need to vectorize for effective memory access.

0 Likes

realhet schrieb:

According execution time measurements, I can utilize the V alu on 98-99% ,but only in special circumstances.

On the 7xxx you can try these hints:

- Unroll a big kernel (16KB ISA size is ok, max 32KB on 7xxx and max 48KB on 4xxx..6xxx)

- Optimize your algo in a way that doesn't use more than 64 VRegs (on the 4xxx..6xxx hardware  128 registers will do fine)

My Kernel is quite small, just under 1kB. And i use only 10 VRegs and 25 SRegs ~40 VALUInst and ~12 SALUInst.

realhet schrieb:

"Is there still a big benefit in using vector types?"

- on 4xxx..6xxx you only need to duplicate long dependency chains in order to let CalCompile feed the 4 or 5 VLIW lanes from them. If theoretically there is a way to utilize all vliw slots, then calCompile will find it

- on 7xxx there is no ALU benefit from vectorizing code.

- but of course still need to vectorize for effective memory access.

Exactly how i thought it should be...

My pogram is a heatblade simulation. So access patterns are allways in raws, one element next to the other.

The funny part is, when i use LDS, my runtime increase. My Cache hit goes from ~35% up to ~90%, but the runtime is 45% higher..

Also when i change the order of accesses to a inverted pattern (because i use 2D array in LDS), there change nothing.

Really strange... I hope AMD brings soon the optimization guide for GCN.

0 Likes

Since the scalar unit works on data to be shared across the whole wavefront (such as branches, constants, fetch constants), you can't really compare "parallelism" between the vector and scalar units as they work on completely different things.  Just note that a wavefront can typically issue one VALU and one SALU instruction every 4 clocks.

0 Likes

"I have ~80% ALU busy on my 5870 but only ~24% on my 7970 -.-"

And what if you compare the computing performance of your algorithm to the raw Teraflops/sec value of your to videocards? Does it show the same difference like the 80% versus 24% busy values?

I mean maybe we don't understand how S/VALU_busy measurement works compared to 5xxx ALU_busy, so what about actual/nominal performance on 5xxx and 7xxx?

0 Likes

Ok. it took some days, but now the results:

HD5870: vanilla kernel

__kernel void heatblade_kernel1( __global float *in, __global float *out, float my){

    int gid0=get_global_id(0);

    int gid1=get_global_id(1);

    int size_x=get_global_size(0);//BREITE

    //int size_y=get_global_size(1);//HOEHE

    int offset_X  = gid0+size_x+3;

    int offset_X2 = size_x+2;

    int offset_Y  = gid1*offset_X2;

    float myn=0.1;

    //Wir benötigen size_x*size_y global workitems

    out[offset_X+offset_Y]=in[offset_X+offset_Y]+myn*(

                    in[offset_X+1+offset_Y]+

                    in[offset_X-1+offset_Y]+

                    in[offset_X+(gid1+1)*offset_X2]+

                    in[offset_X+(gid1-1)*offset_X2]-

                    4*in[offset_X+offset_Y]

                    );

}


MethodExecutionOrderThreadIDCallIndexGlobalWorkSizeWorkGroupSizeTimeLocalMemSizeVGPRSSGPRSFCStacksScratchRegsWavefrontsLDSFetchInstsLDSWriteInstsFectchSizeCacheHitWriteUnitStalledLDSBankConflictALUInstsFetchInstsWriteInstsALUBusyALUFetchRatioALUPackingFetchUnitBusyFetchUnitStalledFastPathCompletePathPathUtilization

heatblade_kernel1__k1_Cypress1     1 5952 67 {   1920    1080       1} {   64     4     1}         0,17789           0     5 NA     0     0     32400,00         0,00         0,00      8123,50        46,55         0,00         0,00        15,00         5,00         1,00        65,25         3,00        56,00        88,65         1,77      8058,75         0,00       100,00

heatblade_kernel1__k1_Cypress1     2 5952 70 {   1920    1080       1} {   64     4     1}         0,17456           0     5 NA     0     0     32400,00         0,00         0,00      8123,50        46,75         0,02         0,00        15,00         5,00         1,00        65,33         3,00        56,00        88,74         1,72      8061,75         0,00       100,00

kernel with LDS:

__kernel void heatblade_kernel3( __global float *in, __global float *out, float my){

    int gid0=get_global_id(0);

    int gid1=get_global_id(1);

    int lid0=get_local_id(0);

    int lid1=get_local_id(1);

    int size_x=get_global_size(0);//width

    //int size_y=get_global_size(1);//height

    int offset_X  = gid0+size_x+3;

    int offset_X2 = size_x+2;

    int offset_Y  = gid1*offset_X2;

    __local float localBuffer[4+2][64+2];

    float result;

    //TODO

    float myn=0.1;

    //the first elements. In every raw 2 elements are left, and the last two raws are completely left

    localBuffer[lid1][lid0]=in[gid1*offset_X2+gid0];

    //copy the last 2 raws

    if(lid1==0 || lid1==1){

        localBuffer[get_local_size(1)+lid1][lid0]=in[(gid1+get_local_size(1))*offset_X2+gid0];

    }

    //copy now the last 2 elements in every raw (not in the last 2 raws)

    if(lid0==0 || lid0==1){

        localBuffer[lid1][get_local_size(0)+lid0]=in[gid1*offset_X2+gid0+get_local_size(0)];

    }

    //copy the last 4 elements

    if((lid0==0 || lid0==1) && (lid1==0 || lid1==1)){

        localBuffer[get_local_size(1)+lid1][get_local_size(0)+lid0]=in[(gid1+get_local_size(1))*offset_X2+gid0+get_local_size(0)];

    }

    barrier(CLK_LOCAL_MEM_FENCE);

    //calculate the new value

    result=localBuffer[lid1+1][lid0+1]+myn*(

                    localBuffer[lid1+1][lid0+2]+

                    localBuffer[lid1+1][lid0]+

                    localBuffer[lid1+2][lid0+1]+

                    localBuffer[lid1][lid0+1]-

                    4*localBuffer[lid1+1][lid0+1]

                    );

    //copy the result to the output

    out[offset_X+offset_Y]=result;

}


MethodExecutionOrderThreadIDCallIndexGlobalWorkSizeWorkGroupSizeTimeLocalMemSizeVGPRSSGPRSFCStacksScratchRegsWavefrontsLDSFetchInstsLDSWriteInstsFectchSizeCacheHitWriteUnitStalledLDSBankConflictALUInstsFetchInstsWriteInstsALUBusyALUFetchRatioALUPackingFetchUnitBusy

FetchUnitStalled

FastPathCompletePathPathUtilization

heatblade_kernel3__k1_Cypress1     1 4204 67 {   1920    1080       1} {   64     4     1}         0,33744        1664     5 NA     0     1     32400,00         3,00         3,00      8123,50        47,72         0,00         0,00        42,00         3,00         1,00        95,64        14,00        33,10        17,08         0,00      8058,63         0,00       100,00

heatblade_kernel3__k1_Cypress1     2 4204 70 {   1920    1080       1} {   64     4     1}         0,33478        1664     5 NA     0     1     32400,00         3,00         3,00      8123,50        47,74         0,00         0,00        42,00         3,00         1,00        95,60        14,00        33,10        17,07         0,03      8057,75         0,00       100,00

HD7970 vanilla Kernel:


MethodExecutionOrderThreadIDCallIndexGlobalWorkSizeWorkGroupSizeTimeLocalMemSizeVGPRsSGPRsScratchRegsFCStacksWavefrontsLDSFetchInstsLDSWriteInstsFetchSizeCacheHitWriteUnitStalledLDSBankConflictVALUInstsSALUInstsVFetchInstsSFetchInstsVWriteInstsVALUUtilizationVALUBusyMemUnitBusyMemUnitStalledWriteSizeGDSInsts

heatblade_kernel1__k1_Tahiti1     1 5828 67 {   1920    1080       1} {   64     4     1}         0,09407           0     6    16     0 NA     32400,00         0,00         0,00      8124,13        35,88         9,51         0,00        27,00         7,00         5,00         7,00         1,00       100,00        22,36        68,77         4,95      7893,56         0,00

heatblade_kernel1__k1_Tahiti1     2 5828 70 {   1920    1080       1} {   64     4     1}         0,08726           0     6    16     0 NA     32400,00         0,00         0,00      8124,06        35,88        10,33         0,00        27,00         7,00         5,00         7,00         1,00       100,00        22,43        68,88         5,31      7898,13         0,00

LDS kernel:


Method

ExecutionOrder

ThreadIDCallIndexGlobalWorkSizeWorkGroupSizeTimeLocalMemSizeVGPRsSGPRsScratchRegsFCStacksWavefrontsLDSFetchInstsLDSWriteInsts

FetchSize

CacheHitWriteUnitStalledLDSBankConflictVALUInstsSALUInstsVFetchInstsSFetchInstsVWriteInstsVALUUtilizationVALUBusyMemUnitBusyMemUnitStalledWriteSizeGDSInsts

heatblade_kernel3__k1_Tahiti1     1 4852 67 {   1920    1080       1} {   64     4     1}         0,13585        1664    10    24     0 NA     32400,00         0,00         6,00      8124,31        35,88         7,90         0,00        41,50        17,00         3,00         7,00         1,00        73,74        23,02        66,15         1,53      7874,56         0,00

heatblade_kernel3__k1_Tahiti1     2 4852 70 {   1920    1080       1} {   64     4     1}         0,12770        1664    10    24     0 NA     32400,00         0,00         6,00      8124,25        35,88         7,31         0,00        41,50        17,00         3,00         7,00         1,00        73,74        22,76        66,50         1,39      7879,59         0,00

Yes, i know, i could increase the amount of work per thread, but i wanted too look, if it is possible to utilice the Tahiti without vector datatypes, or more work"items" per thread. So it would be much easier to utilice the GPU for >50%

And yes, i know, that i could reuse the data in LDS more often, if i would calculate a complete raw per workgroup

0 Likes

Oh, that's memory intense code. The VALU has not much work here except for waiting memory operations.

I don't get the point, why you pass through the data from RAM to LDS.

Cypress has 128GB/sec and Tahiti has 264GB/sec mem bandwidth, so a 2x speedub would be ok. The actual speedup is 0,17478 / 0,08726 = 2,0029796 which is good, I think.

0 Likes
realhet
Miniboss

But please let's get back to the topic,

So far I've found out that the V and S instructions can be interleaved in an 1:1 ratio and they will execute simultaneously.

For example:repeating this over and over:

  s_xor_b64 s0, s1, s2

  v_xor_b32 v0, v1, v2

eats only 4 cycles, not 8.

another one:

  s_mul_u32 s0, s1, s2

  v_mad_u24_u32 v0, v1, v2,v3

also run in paralell (in this case there is an instruction decoder bottleneck because the two 64bit instructions, so the S:V ratio can be at most 3:4)

With this I was able to get additional 88 Gops of mul32 performance out of the 7970

Now it generates a question: Is there a hardware mechanism that ensures that no S and V instructions can be executed in paralell when they're depend on each or other? Or is it the compiler's responsibility that it never send register-dependent S/V code to the GPU, like in the example below?

  s_xor_b64 s0(!!!!), s1, s2

  v_xor_b32 v0, s0(!!!!), v2   ;does the instruction decoder delay the V instruction until the result is calculated by the S instruction? Or execute both using the previous s0 register value in the V instruction.

(My opinion is that there is no extra hardware for this (thus more transistors can do math), and the compiler must take care of these situations, but I'm not sure)

If anyone has accurate info on this, please tell me!

0 Likes

The compiler will handle the dependencies by either scheduling work between the dependent operations or inserting a wait command if needed.

0 Likes

My understanding is that the hardware does not schedule an s and v simultaneously from the same wavefront, only from two separate waves. So there cannot be any dependency between the two, and the latency of that code will be as it appears but the throughput would be double as you suggest.

0 Likes

That must be it!

Your answer also explains that why I measured so slow times when testing a big kernel filled with svsvsvsv instructions and used more than 128 vector registers: there was only one wave  and no S+V paralell executions at all.

Thank you all for answers!

0 Likes

Even with more than 128 vector registers, you can still schedule one wavefront per SIMD, i.e. four per CU.

0 Likes

Let's say 256 vregs/wave -> 4bytes*256regs*64wavesize*4simd/cu=256KB which is exactly the amount of vregister ram in a CU. Exactly like in http://developer.amd.com/afds/assets/presentations/2620_final.pdf

But I'm pretty sure that its not good to use more than 128, and the best performance can be achieved when not using more than 64 vregs. Just don't know, why is it . I'll do some simple tests tomorrow to point out this behaviour.

0 Likes

If you only have 4 wavefronts per CU, then that will limit how much latency you can hide.  Of course, if your kernel is completely ALU-bound, then that won't matter.  Just note that latency comes in different flavors based on the type of operation.  For example, global memory fetches have a different class of latency than local memory (LDS) fetches.

Also, don't forget about instruction cache latency.  If you have a long, unrolled kernel and only 4 waves per CU, then all the waves will feel the pain of filling the instruction cache.  It's better to use loops so that you get instruction cache hits.

0 Likes

Hello again,

I'm only testing the raw computing power, no lds/gds/uav at all, and my test are fitting well in all of the caches (instr and data).

I've ran some tests:

The test kernel was this in ISA:

  s_mov_b32 s10,24

  label_1:

     ;128 times repeated(unrolled) [inner_code]

  s_add_u32 s10,s10,-1

  s_cbranch_scc1 label_1   ;break on borrow, so it's a 25x loop

  s_endpgm

inner_code was basically 4 V instructions interleaved with 0..3 S instructions

Trying a mixture of 32bit and 64bit S/V instructions. A capital letter V or S means 64bit.and small letters are 32bit.

For example: VSVsvv stands for

  v_mad_i32_i24 v0,v0,v1,v2     ;big 3 operand V instr

  s_mul_i32 s0,12345,s1         ;big 2 operand S instr with 32bit immed

  v_mad_i32_i24 v0,v0,v1,v2

  s_mul_i32 s0,s0,s1              ;small 2 operand S instr

  v_xor_b32 v0,v0,v1                ;small 2 operand V instr

  v_xor_b32 v0,v0,v1

The kernel was issued for 40 000 000 workitems (64 workitems/workgroup) and the second time measurement was taken.

Here are the diagrams:

http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_4-12dwords.png

http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_8-16dwords.png

Also the raw data in excel:

http://x.pgy.hu/~worm/het/7970_isa_test/7970_isa_speedtest.xls

My conclusions:

ONE CAN SIMPLY NOT interleave 4 S instructions with 4 V instructions, or it will slow down the V performance by a minimum of 1-2%

Also when sending more code dwords to the S alu, there is  on higher numVGprs kernel settings:

It can be partitioned to 3 parts:

16..64  numVGprs -> excellent V and S paralellism (in my thoughts this is when only 4 waves are in the CU)

65..84 numVGprs -> this starts to hate when it gets many S instruction dwords (I think 3 waves can sit in a CU)

85..128  numVGprs -> 2 small S for 4 V is ok, bug starts to get slow (2 waves)

129..255 numVGprs -> try to avoid ALL S instructions if posibble, or else there will be terrible stalls (1 waves)

If there is a 4 stage pipeline thing in it, then multiply my waves_in_the CU expectations by 4 of course.

But I think all this S instruction 'sensitivity' is based on the complicated instruction decoding/scheduling mechanism. I'd love to know more about it, if it's not a secret

I don't even think about what things apply for the v_add instruction when it reads and writes from/to both S and V registers.

Anyways It's really a fun puzzle to find out these things. (I have still so many questions to ask lol, just don't wanna flood the forums)

0 Likes

Could you normalize the done work please, compared to the runtime.

0 Likes

Yea, take a fast exec time like 0.27318 (VVVV and no S in it).

Then Ops/sec is -> 40000000{workitems}*25{loops}*128{unrolls}*4{V instructions} /0.27318{seconds} = 1.8742e12 ops/sec

The nominal is  925e6{Hz}*2048{Streams} =  1.8944e12

actual/nominal = 98.93% which is pretty good for these short test (0.27sec warmup then 0.27sec measure)

So the shortest time bars at 64 numVGprs and moderate S utilization are representing the 99% of the ideal performance.

0 Likes