Archives Discussions

FangQ · ‎06-30-2010

I have an OpenCL program that uses 54 registers per thread. It runs 3x slower on 5870 compard with nvidia gtx470 using similar configurations.

I heard that 5870 only allows ~30 registers per thread and the rest will be spilled to the global memory. Is this true? anything I can optimize?

ryta1203 · ‎06-30-2010

1. How many registers does the Nvidia version use?

2. YES, I'm sure there are a TON of optimizations you could make. For starters, are you vectorizing? How much control flow? Have you tried splitting the kernel? Etc, etc...

FangQ · ‎06-30-2010

I ran the same opencl code, so the number of registers are the same for both hardware.

my opencl kernel is almost identical to the CUDA kernel, which you can browse it at http://is.gd/d88jw

the main kernel is "mcx_main_loop()". Any suggestions perticularly concerning the differences of this kernel on two hardware?

ryta1203 · ‎06-30-2010

Originally posted by: FangQ I ran the same opencl code, so the number of registers are the same for both hardware.

my opencl kernel is almost identical to the CUDA kernel, which you can browse it at http://is.gd/d88jw
the main kernel is "mcx_main_loop()". Any suggestions perticularly concerning the differences of this kernel on two hardware?

1. Why? Register allocation is done by the compiler and you are using two different compilers so this "assumption" is false (though it may be that the two do, in fact, use the same number of registers).

2. You are not going to see good performance from AMD GPUs unless you vectorize your code (or you have no data dependency and no/little control flow).

FangQ · ‎06-30-2010

1. Why? Register allocation is done by the compiler and you are using two different compilers so this "assumption" is false (though it may be that the two do, in fact, use the same number of registers).

I only know how to get register numbers for nvidia (nvcc --ptxas-options=-v), can you teach me how to find this out for ati stream? (I work with Ubuntu Linux). Thanks

2. You are not going to see good performance from AMD GPUs unless you vectorize your code (or you have no data dependency and no/little control flow).

I did ran the shader analyzer earlier this year with this code and most (80%) of the instructions were nicely packed to use the 5 VLIW slots simultaneously. I hope things have not changed too much lately.

Also, do you think connecting the video card to a display will have any impact to speed? my nvidia card is dedicated (not used for display), but my ati card is connected to dual-monitors.

ryta1203 · ‎07-01-2010

The profiler or SKA can give you the GPR used... or you can simply count the GPR used in the ISA.

Yes, 80% is nice, but maybe if you vectorized your code you might get more... hey, 10% is 10%. Admittedly though, I haven't looked at your kernel(s).

I'm not positive (though I think you can find an answer if you search this or the ATI Stream forum) but I would imagine that it would have some impact on the performance, again though, don't quote me on that.

Is there any way to split your kernel into multiple kernels? I can't be certain but this may provide some benefit if you are doing a lot of spilling.

Also, I'm not sure how good the compiler is at register allocation (a well researched topic so I can't imagine it would be bad) but would it help GPR count to vectorize your code ? Again, I would think not but it's possible.

hazeman · ‎07-01-2010

To get ISA set environment variable GPU_DUMP_DEVICE_KERNEL=3.

At the end of the ISA there is info with number of registers used.

And IL->ISA compiler is sometimes really stupid ( or more accurately badly written ) and can use excessive number of registers. Sure method to triger this problem is to compute some values inside the loop which are not dependend on the loop index. IL compiler will try to precompute before the loop and pass those values thru registers. This way kernel can use N extra registers. From what i've seen there is no limit on N. This can totally kill kernel performance by spilling registers or forcing you to limit number of threads to 1 or 2 ( which is slooow ).

FangQ · ‎07-01-2010

thanks, looks like I need to get my windows up and running again in order to have these info. will update if I come up with new questions.

hazeman · ‎06-30-2010

Number of available registers depends on number of threads ( wavefronts/warps ).

2 threads ( work group size 128 ) - 128 registers

4 threads ( work group size 256 ) - 64 registers

8 threads ( work group size 512 ) - 32 registers

Usually with 4 threads you can achive full performance.

FangQ · ‎06-30-2010

Originally posted by: hazeman Number of available registers depends on number of threads ( wavefronts/warps ).

2 threads ( work group size 128 ) - 128 registers

4 threads ( work group size 256 ) - 64 registers

8 threads ( work group size 512 ) - 32 registers

Usually with 4 threads you can achive full performance.

I could not set work-group size greater than 256 with catalyst 10.6 (with either 4890OC and 5870), and CLInfo gave me maximum thread 256x256x256

In your opinion, if I use a work-group size of 256, there should be no spilling happen, correct?

MicahVillmow · ‎06-30-2010

Fanq,
There is an environment variable GPU_MAX_WORKGROUP_SIZE that you can tweak to raise that limit. However, this is not supported and you use it at your own risk.

FangQ · ‎06-30-2010

Originally posted by: MicahVillmow Fanq, There is an environment variable GPU_MAX_WORKGROUP_SIZE that you can tweak to raise that limit. However, this is not supported and you use it at your own risk.

thank you for the tip, will certainly play with it.

Archives Discussions

maximum registers per thread?