cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

diapolo
Adept I

ALUBusy - an easy way to raise it? + Vec3 problems

I'm working on a Bitcoin-Mining kernel and took a look at the output of AMD APP Profiler. I saw that the value of ALUBusy is only at ~68% and I guess the goal should be a higher number.

What causes ALUBusy to get higher in general? Any hints?

Dia

Tags (2)
0 Likes
24 Replies
gat3way
Journeyman III

ALUBusy - an easy way to raise it?

I happen to have written a bitcoin kernel too 🙂

Basically it is an extremely ALU-bound kernel and you should not be having global or local memory reads at all (just global memory write at the end). Thus, easiest thing you could do is vectorize more. If you are using say uint2 (as most bitcoin kernels do), try uint4 or uint3 (uint3 was broken with SDK2.4 and generated bad ISA though - it is also opencl1.1 feature probably unsupported by some earlier SDKs).

Increasing vector size, you provide more ALU operations that have no dependencies so that they can fit in a VLIW pipeline. Another thing is that some ALU ops can't operate on t unit (like bitalign and bfi_int) so that larger vectors == more chance to have enough alu operations to fill the x,y,z,w,t units and get closer to 100% ALUBusy.

That said, bfi_int patching in general replaces a couple of instructions that can operate on t unit with a single one that does not work on t unit. To achieve better ALUPacking, you may need to reorder some stuff in your round function. I can give you no specific advice on that - just experiment, and profile, use GPU_DUMP_DEVICE_KERNEL=3, look at ISA dumps until you find the sweet spot.

Also keep in mind that increasing vector size after a certain threshold does not help, just the opposite. That's because it involves more GPRs and makes the kernel bigger. Number of used GPRs limit the number of wavefront thus the device utilization. Larger kernels are slower because they don't fit in GPU instruction caches and I've heard that the OpenCL compiler gives up some register allocation optimizations once kernels get too big (though it might not be true).

In my case, I found the sweet spot between uint2 and uint4 (and like I said, uint3 was broken). Thus I interlaced one uint2 sha256 operation and one uint sha256 operation and that ended to be fastest for VLIW5 hardware. On 69xx, using just uint2 was faster.

0 Likes
genaganna
Journeyman III

ALUBusy - an easy way to raise it?

1. Make sure you have more more wavefronts per group

2. Other things as per gat3way

a. Vectorization

        b. Avoid dependent statements if possible.

0 Likes
diapolo
Adept I

ALUBusy - an easy way to raise it?

Are 3 component vectors working with SDK 2.5 without the use of AMD_vec3 extension? I guess I will try this ... reorderning seems to help, but for me that's kind of trial and error, because I have no deep understanding of IL or ASM code :-/. What's your ALU OP usage and GPR usage for vec3?

How can I have more Wavefronts per group? Didn't get that statement ...

Thanks,

Dia

0 Likes
diapolo
Adept I

ALUBusy - an easy way to raise it?

Originally posted by: genaganna 1. Make sure you have more more wavefronts per group

 

2. Other things as per gat3way

 

a. Vectorization

 

        b. Avoid dependent statements if possible.

 



 

So it could be faster to have a value 2-times in different variables, if this makes the following comands independent?

Dia

0 Likes
maximmoroz
Journeyman III

ALUBusy - an easy way to raise it?

ALUBusy 68% for ALU bound kernel might mean that you failed to hide global memory access fully. What is your global worksize and group size?

Besides, what are ALUPacking and LDSBankConflict?

0 Likes
gat3way
Journeyman III

ALUBusy - an easy way to raise it?

Damn, I confused ALUPacking for ALUBusy.  ALUPacking should be the VLIW utilization while ALUBusy is the ratio of ALU ops.

If you have round constants in an __constant array, try offseting them to __private memory, this should help.

0 Likes
genaganna
Journeyman III

ALUBusy - an easy way to raise it?

Originally posted by: diapolo Are 3 component vectors working with SDK 2.5 without the use of AMD_vec3 extension? I guess I will try this ... reorderning seems to help, but for me that's kind of trial and error, because I have no deep understanding of IL or ASM code :-/. What's your ALU OP usage and GPR usage for vec3?

As per OpenCL 1.1 spec, 3 component vectors are in core spec means no need to use extensions.

3 component vectors should work in SDK2.5. I am not what was the problem with SDK2.4. As per my understanding, it should work in SDK2.4.

It is always recommanded to use vec4 instead of vec3 becuase more ALU untilization and less over head in initailzation of vec4 data.

 

How can I have more Wavefronts per group? Didn't get that statement ...

 

Make sure your work group size as big as possible and less than maximum allowed work group size as per device. i.e 256 for GPU in general.

0 Likes
gat3way
Journeyman III

ALUBusy - an easy way to raise it?

Well it generally works in 2.4 as I have used it other times. This time however, with that particular kernel (the bitcoin one) switching to uint3 from uint4 caused the runtime to crash. The kernel compiled fine with no errors - at least clBuildProgram() returned CL_SUCCESS. Then some time after that it crashes. I checked the ISA dump to find some error about used port (?!?) and with this error, isa dump ended.  Switching back to uint4 or uint2 produced valid kernel binary. I noticed the same thing with uint3 with one more kernel too. I don't know what causes it though.

0 Likes
genaganna
Journeyman III

ALUBusy - an easy way to raise it?

Originally posted by: gat3way Well it generally works in 2.4 as I have used it other times. This time however, with that particular kernel (the bitcoin one) switching to uint3 from uint4 caused the runtime to crash. The kernel compiled fine with no errors - at least clBuildProgram() returned CL_SUCCESS. Then some time after that it crashes. I checked the ISA dump to find some error about used port (?!?) and with this error, isa dump ended.  Switching back to uint4 or uint2 produced valid kernel binary. I noticed the same thing with uint3 with one more kernel too. I don't know what causes it though.


Please send simplified code for us so that it will be fixed in future releases.

If you don't want to copy code here please file at ticket at

http://developer.amd.com/support/KnowledgeBase/pages/HelpdeskTicketForm.aspx?Category=8

0 Likes