Archives Discussions

realhet · ‎03-20-2012

Hi,

I'm planning to do an ISA optimization on my existing kernel, and just wondering if anybody have some info on how to feed the S and V ALU's to achieve maximum utilization.

On the GCN .ppt slides I've seen that there is an instruction 'arbitrator' which is feeding 4 Vector alu's and 1 Scalar alu. It also tells that the S alu is working 4x faster than the V alu's. In total it gives 1:1 V:S alu operation capacity.

For example: what if I interleave independent V_ and S_ instructions in a pattern like SVSVSVSV?

Will the instruction decoder in the SIMD engine be able to feed all four V alu's in every 4 cycles while also feeding the S alu with one instruction every cycle (for each of the 4 V alu's)?

What if I use larger VOP3 instructions and/or 32bit immediate constants, when the instruction decoder will have to look more dwords ahead? Is there any specifications on the capabilities/limitations of the instruction decoder/arbitration unit?

Why am I doing this: I have a quiet big kernel (25KB on 7970) which is working at 98% alu utilization on the 6970, but unfortunately on 7970 it runs out of VRegs (above 128 vgprs I noticed a 'task scheduler' bottleneck because it can't put 2 kernels in the queues. When a task is done, the simd engines can't start immediately another task. It means -30% performance loss in my case). My plan is to use more sgprs instead of some of the vgprs to get below the 128 vgprs 'limit'. I can swap many calculations out to S registers which are the same values for every 16 simd vector threads, I just have th learn, how effectively schedule the SOP's and VOP's to achieve maximum alu utilization.

In my worst expectations maybe there is no S/V paralellism, and in every 4 cycles either 1 VOP or 1 SOP can be executed. This way there is no need for hardwaredependency checks or special compiler instruction reordering.

I also noticed the s_buffer_load_dwordx16 thing, another reason to go down below CAL

Thank you for your answers!

LeeHowes · ‎04-06-2012

My understanding is that the hardware does not schedule an s and v simultaneously from the same wavefront, only from two separate waves. So there cannot be any dependency between the two, and the latency of that code will be as it appears but the throughput would be double as you suggest.

View solution in original post

Skysnake · ‎04-01-2012

Yeah, it would be nice to know more about this.

Sorry for the ot question, but how big is your VALU busy value?

I have ~80% ALU busy on my 5870 but only ~24% on my 7970 -.-

Is there still a big benefit in using vector types? and if yes, what is the best size? 4, 8 or 16?

realhet · ‎04-02-2012

I can't tell sprofile values sorry. (not using VS neither OpenCl, and launching sperform manually on my .exe just fails)

According execution time measurements, I can utilize the V alu on 98-99% ,but only in special circumstances.

On the 7xxx you can try these hints:

- Unroll a big kernel (16KB ISA size is ok, max 32KB on 7xxx and max 48KB on 4xxx..6xxx)

- Optimize your algo in a way that doesn't use more than 64 VRegs (on the 4xxx..6xxx hardware 128 registers will do fine)

I guess that poor 24% alu utilization can come from using too much VRegs on the 7xxx

"Is there still a big benefit in using vector types?"

- on 4xxx..6xxx you only need to duplicate long dependency chains in order to let CalCompile feed the 4 or 5 VLIW lanes from them. If theoretically there is a way to utilize all vliw slots, then calCompile will find it

- on 7xxx there is no ALU benefit from vectorizing code.

- but of course still need to vectorize for effective memory access.

Skysnake · ‎04-02-2012

realhet schrieb:
According execution time measurements, I can utilize the V alu on 98-99% ,but only in special circumstances.
On the 7xxx you can try these hints:
- Unroll a big kernel (16KB ISA size is ok, max 32KB on 7xxx and max 48KB on 4xxx..6xxx)
- Optimize your algo in a way that doesn't use more than 64 VRegs (on the 4xxx..6xxx hardware 128 registers will do fine)

My Kernel is quite small, just under 1kB. And i use only 10 VRegs and 25 SRegs ~40 VALUInst and ~12 SALUInst.

realhet schrieb:
"Is there still a big benefit in using vector types?"
- on 4xxx..6xxx you only need to duplicate long dependency chains in order to let CalCompile feed the 4 or 5 VLIW lanes from them. If theoretically there is a way to utilize all vliw slots, then calCompile will find it
- on 7xxx there is no ALU benefit from vectorizing code.
- but of course still need to vectorize for effective memory access.

Exactly how i thought it should be...

My pogram is a heatblade simulation. So access patterns are allways in raws, one element next to the other.

The funny part is, when i use LDS, my runtime increase. My Cache hit goes from ~35% up to ~90%, but the runtime is 45% higher..

Also when i change the order of accesses to a inverted pattern (because i use 2D array in LDS), there change nothing.

Really strange... I hope AMD brings soon the optimization guide for GCN.

jeff_golds · ‎04-02-2012

Since the scalar unit works on data to be shared across the whole wavefront (such as branches, constants, fetch constants), you can't really compare "parallelism" between the vector and scalar units as they work on completely different things. Just note that a wavefront can typically issue one VALU and one SALU instruction every 4 clocks.

realhet · ‎04-03-2012

"I have ~80% ALU busy on my 5870 but only ~24% on my 7970 -.-"

And what if you compare the computing performance of your algorithm to the raw Teraflops/sec value of your to videocards? Does it show the same difference like the 80% versus 24% busy values?

I mean maybe we don't understand how S/VALU_busy measurement works compared to 5xxx ALU_busy, so what about actual/nominal performance on 5xxx and 7xxx?

Skysnake · ‎04-05-2012

Ok. it took some days, but now the results:

HD5870: vanilla kernel

__kernel void heatblade_kernel1( __global float *in, __global float *out, float my){

int gid0=get_global_id(0);

int gid1=get_global_id(1);

int size_x=get_global_size(0);//BREITE

//int size_y=get_global_size(1);//HOEHE

int offset_X = gid0+size_x+3;

int offset_X2 = size_x+2;

int offset_Y = gid1*offset_X2;

float myn=0.1;

//Wir benötigen size_x*size_y global workitems

out[offset_X+offset_Y]=in[offset_X+offset_Y]+myn*(

in[offset_X+1+offset_Y]+

in[offset_X-1+offset_Y]+

in[offset_X+(gid1+1)*offset_X2]+

in[offset_X+(gid1-1)*offset_X2]-

4*in[offset_X+offset_Y]

);

}

	heatblade_kernel1__k1_Cypress1	1	5952	67	{ 1920 1080 1}	{ 64 4 1}	0,17789	0	5	NA	0	0	32400,00	0,00	0,00	8123,50	46,55	0,00	0,00	15,00	5,00	1,00	65,25	3,00	56,00	88,65	1,77	8058,75	0,00	100,00
	Method	ExecutionOrder	ThreadID	CallIndex	GlobalWorkSize	WorkGroupSize	Time	LocalMemSize	VGPRS	SGPRS	FCStacks	ScratchRegs	Wavefronts	LDSFetchInsts	LDSWriteInsts	FectchSize	CacheHit	WriteUnitStalled	LDSBankConflict	ALUInsts	FetchInsts	WriteInsts	ALUBusy	ALUFetchRatio	ALUPacking	FetchUnitBusy	FetchUnitStalled	FastPath	CompletePath	PathUtilization
	heatblade_kernel1__k1_Cypress1	2	5952	70	{ 1920 1080 1}	{ 64 4 1}	0,17456	0	5	NA	0	0	32400,00	0,00	0,00	8123,50	46,75	0,02	0,00	15,00	5,00	1,00	65,33	3,00	56,00	88,74	1,72	8061,75	0,00	100,00

kernel with LDS:

__kernel void heatblade_kernel3( __global float *in, __global float *out, float my){

int gid0=get_global_id(0);

int gid1=get_global_id(1);

int lid0=get_local_id(0);

int lid1=get_local_id(1);

int size_x=get_global_size(0);//width

//int size_y=get_global_size(1);//height

int offset_X = gid0+size_x+3;

int offset_X2 = size_x+2;

int offset_Y = gid1*offset_X2;

__local float localBuffer[4+2][64+2];

float result;

//TODO

float myn=0.1;

//the first elements. In every raw 2 elements are left, and the last two raws are completely left

localBuffer[lid1][lid0]=in[gid1*offset_X2+gid0];

//copy the last 2 raws

if(lid1==0 || lid1==1){

localBuffer[get_local_size(1)+lid1][lid0]=in[(gid1+get_local_size(1))*offset_X2+gid0];

}

//copy now the last 2 elements in every raw (not in the last 2 raws)

if(lid0==0 || lid0==1){

localBuffer[lid1][get_local_size(0)+lid0]=in[gid1*offset_X2+gid0+get_local_size(0)];

}

//copy the last 4 elements

if((lid0==0 || lid0==1) && (lid1==0 || lid1==1)){

localBuffer[get_local_size(1)+lid1][get_local_size(0)+lid0]=in[(gid1+get_local_size(1))*offset_X2+gid0+get_local_size(0)];

}

barrier(CLK_LOCAL_MEM_FENCE);

//calculate the new value

result=localBuffer[lid1+1][lid0+1]+myn*(

localBuffer[lid1+1][lid0+2]+

localBuffer[lid1+1][lid0]+

localBuffer[lid1+2][lid0+1]+

localBuffer[lid1][lid0+1]-

4*localBuffer[lid1+1][lid0+1]

);

//copy the result to the output

out[offset_X+offset_Y]=result;

}

Method	ExecutionOrder	ThreadID	CallIndex	GlobalWorkSize	WorkGroupSize	Time	LocalMemSize	VGPRS	SGPRS	FCStacks	ScratchRegs	Wavefronts	LDSFetchInsts	LDSWriteInsts	FectchSize	CacheHit	WriteUnitStalled	LDSBankConflict	ALUInsts	FetchInsts	WriteInsts	ALUBusy	ALUFetchRatio	ALUPacking	FetchUnitBusy	FetchUnitStalled	FastPath	CompletePath	PathUtilization
heatblade_kernel3__k1_Cypress1	1	4204	67	{ 1920 1080 1}	{ 64 4 1}	0,33744	1664	5	NA	0	1	32400,00	3,00	3,00	8123,50	47,72	0,00	0,00	42,00	3,00	1,00	95,64	14,00	33,10	17,08	0,00	8058,63	0,00	100,00
heatblade_kernel3__k1_Cypress1	2	4204	70	{ 1920 1080 1}	{ 64 4 1}	0,33478	1664	5	NA	0	1	32400,00	3,00	3,00	8123,50	47,74	0,00	0,00	42,00	3,00	1,00	95,60	14,00	33,10	17,07	0,03	8057,75	0,00	100,00

HD7970 vanilla Kernel:

Method	ExecutionOrder	ThreadID	CallIndex	GlobalWorkSize	WorkGroupSize	Time	LocalMemSize	VGPRs	SGPRs	ScratchRegs	FCStacks	Wavefronts	LDSFetchInsts	LDSWriteInsts	FetchSize	CacheHit	WriteUnitStalled	LDSBankConflict	VALUInsts	SALUInsts	VFetchInsts	SFetchInsts	VWriteInsts	VALUUtilization	VALUBusy	MemUnitBusy	MemUnitStalled	WriteSize	GDSInsts
heatblade_kernel1__k1_Tahiti1	1	5828	67	{ 1920 1080 1}	{ 64 4 1}	0,09407	0	6	16	0	NA	32400,00	0,00	0,00	8124,13	35,88	9,51	0,00	27,00	7,00	5,00	7,00	1,00	100,00	22,36	68,77	4,95	7893,56	0,00
heatblade_kernel1__k1_Tahiti1	2	5828	70	{ 1920 1080 1}	{ 64 4 1}	0,08726	0	6	16	0	NA	32400,00	0,00	0,00	8124,06	35,88	10,33	0,00	27,00	7,00	5,00	7,00	1,00	100,00	22,43	68,88	5,31	7898,13	0,00

LDS kernel:

Method	ExecutionOrder	ThreadID	CallIndex	GlobalWorkSize	WorkGroupSize	Time	LocalMemSize	VGPRs	SGPRs	ScratchRegs	FCStacks	Wavefronts	LDSFetchInsts	LDSWriteInsts	FetchSize	CacheHit	WriteUnitStalled	LDSBankConflict	VALUInsts	SALUInsts	VFetchInsts	SFetchInsts	VWriteInsts	VALUUtilization	VALUBusy	MemUnitBusy	MemUnitStalled	WriteSize	GDSInsts
heatblade_kernel3__k1_Tahiti1	1	4852	67	{ 1920 1080 1}	{ 64 4 1}	0,13585	1664	10	24	0	NA	32400,00	0,00	6,00	8124,31	35,88	7,90	0,00	41,50	17,00	3,00	7,00	1,00	73,74	23,02	66,15	1,53	7874,56	0,00
heatblade_kernel3__k1_Tahiti1	2	4852	70	{ 1920 1080 1}	{ 64 4 1}	0,12770	1664	10	24	0	NA	32400,00	0,00	6,00	8124,25	35,88	7,31	0,00	41,50	17,00	3,00	7,00	1,00	73,74	22,76	66,50	1,39	7879,59	0,00

Yes, i know, i could increase the amount of work per thread, but i wanted too look, if it is possible to utilice the Tahiti without vector datatypes, or more work"items" per thread. So it would be much easier to utilice the GPU for >50%

And yes, i know, that i could reuse the data in LDS more often, if i would calculate a complete raw per workgroup

realhet · ‎04-05-2012

Oh, that's memory intense code. The VALU has not much work here except for waiting memory operations.

I don't get the point, why you pass through the data from RAM to LDS.

Cypress has 128GB/sec and Tahiti has 264GB/sec mem bandwidth, so a 2x speedub would be ok. The actual speedup is 0,17478 / 0,08726 = 2,0029796 which is good, I think.

realhet · ‎04-06-2012

But please let's get back to the topic,

So far I've found out that the V and S instructions can be interleaved in an 1:1 ratio and they will execute simultaneously.

For example:repeating this over and over:

s_xor_b64 s0, s1, s2

v_xor_b32 v0, v1, v2

eats only 4 cycles, not 8.

another one:

s_mul_u32 s0, s1, s2

v_mad_u24_u32 v0, v1, v2,v3

also run in paralell (in this case there is an instruction decoder bottleneck because the two 64bit instructions, so the S:V ratio can be at most 3:4)

With this I was able to get additional 88 Gops of mul32 performance out of the 7970

Now it generates a question: Is there a hardware mechanism that ensures that no S and V instructions can be executed in paralell when they're depend on each or other? Or is it the compiler's responsibility that it never send register-dependent S/V code to the GPU, like in the example below?

s_xor_b64 s0(!!!!), s1, s2

v_xor_b32 v0, s0(!!!!), v2 ;does the instruction decoder delay the V instruction until the result is calculated by the S instruction? Or execute both using the previous s0 register value in the V instruction.

(My opinion is that there is no extra hardware for this (thus more transistors can do math), and the compiler must take care of these situations, but I'm not sure)

If anyone has accurate info on this, please tell me!

jeff_golds · ‎04-06-2012

The compiler will handle the dependencies by either scheduling work between the dependent operations or inserting a wait command if needed.

LeeHowes · ‎04-06-2012

My understanding is that the hardware does not schedule an s and v simultaneously from the same wavefront, only from two separate waves. So there cannot be any dependency between the two, and the latency of that code will be as it appears but the throughput would be double as you suggest.

realhet · ‎04-06-2012

That must be it!

Your answer also explains that why I measured so slow times when testing a big kernel filled with svsvsvsv instructions and used more than 128 vector registers: there was only one wave and no S+V paralell executions at all.

Thank you all for answers!

jeff_golds · ‎04-06-2012

Even with more than 128 vector registers, you can still schedule one wavefront per SIMD, i.e. four per CU.

realhet · ‎04-06-2012

Let's say 256 vregs/wave -> 4bytes*256regs*64wavesize*4simd/cu=256KB which is exactly the amount of vregister ram in a CU. Exactly like in http://developer.amd.com/afds/assets/presentations/2620_final.pdf

But I'm pretty sure that its not good to use more than 128, and the best performance can be achieved when not using more than 64 vregs. Just don't know, why is it . I'll do some simple tests tomorrow to point out this behaviour.

jeff_golds · ‎04-06-2012

If you only have 4 wavefronts per CU, then that will limit how much latency you can hide. Of course, if your kernel is completely ALU-bound, then that won't matter. Just note that latency comes in different flavors based on the type of operation. For example, global memory fetches have a different class of latency than local memory (LDS) fetches.

Also, don't forget about instruction cache latency. If you have a long, unrolled kernel and only 4 waves per CU, then all the waves will feel the pain of filling the instruction cache. It's better to use loops so that you get instruction cache hits.

realhet · ‎04-07-2012

Hello again,

I'm only testing the raw computing power, no lds/gds/uav at all, and my test are fitting well in all of the caches (instr and data).

I've ran some tests:

The test kernel was this in ISA:

s_mov_b32 s10,24

label_1:

;128 times repeated(unrolled) [inner_code]

s_add_u32 s10,s10,-1

s_cbranch_scc1 label_1 ;break on borrow, so it's a 25x loop

s_endpgm

inner_code was basically 4 V instructions interleaved with 0..3 S instructions

Trying a mixture of 32bit and 64bit S/V instructions. A capital letter V or S means 64bit.and small letters are 32bit.

For example: VSVsvv stands for

v_mad_i32_i24 v0,v0,v1,v2 ;big 3 operand V instr

s_mul_i32 s0,12345,s1 ;big 2 operand S instr with 32bit immed

v_mad_i32_i24 v0,v0,v1,v2

s_mul_i32 s0,s0,s1 ;small 2 operand S instr

v_xor_b32 v0,v0,v1 ;small 2 operand V instr

v_xor_b32 v0,v0,v1

The kernel was issued for 40 000 000 workitems (64 workitems/workgroup) and the second time measurement was taken.

Here are the diagrams:

http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_4-12dwords.png

http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_8-16dwords.png

Also the raw data in excel:

http://x.pgy.hu/~worm/het/7970_isa_test/7970_isa_speedtest.xls

My conclusions:

ONE CAN SIMPLY NOT interleave 4 S instructions with 4 V instructions, or it will slow down the V performance by a minimum of 1-2%

Also when sending more code dwords to the S alu, there is on higher numVGprs kernel settings:

It can be partitioned to 3 parts:

16..64 numVGprs -> excellent V and S paralellism (in my thoughts this is when only 4 waves are in the CU)

65..84 numVGprs -> this starts to hate when it gets many S instruction dwords (I think 3 waves can sit in a CU)

85..128 numVGprs -> 2 small S for 4 V is ok, bug starts to get slow (2 waves)

129..255 numVGprs -> try to avoid ALL S instructions if posibble, or else there will be terrible stalls (1 waves)

If there is a 4 stage pipeline thing in it, then multiply my waves_in_the CU expectations by 4 of course.

But I think all this S instruction 'sensitivity' is based on the complicated instruction decoding/scheduling mechanism. I'd love to know more about it, if it's not a secret

I don't even think about what things apply for the v_add instruction when it reads and writes from/to both S and V registers.

Anyways It's really a fun puzzle to find out these things. (I have still so many questions to ask lol, just don't wanna flood the forums)

Skysnake · ‎04-07-2012

Could you normalize the done work please, compared to the runtime.

realhet · ‎04-07-2012

Yea, take a fast exec time like 0.27318 (VVVV and no S in it).

Then Ops/sec is -> 40000000{workitems}*25{loops}*128{unrolls}*4{V instructions} /0.27318{seconds} = 1.8742e12 ops/sec

The nominal is 925e6{Hz}*2048{Streams} = 1.8944e12

actual/nominal = 98.93% which is pretty good for these short test (0.27sec warmup then 0.27sec measure)

So the shortest time bars at 64 numVGprs and moderate S utilization are representing the 99% of the ideal performance.

Archives Discussions

7970 ISA Vector/Scalar instruction level paralellism