Archives Discussions

diapolo · ‎08-12-2011

I'm working on a Bitcoin-Mining kernel and took a look at the output of AMD APP Profiler. I saw that the value of ALUBusy is only at ~68% and I guess the goal should be a higher number.

What causes ALUBusy to get higher in general? Any hints?

Dia

gat3way · ‎08-12-2011

I happen to have written a bitcoin kernel too 🙂

Basically it is an extremely ALU-bound kernel and you should not be having global or local memory reads at all (just global memory write at the end). Thus, easiest thing you could do is vectorize more. If you are using say uint2 (as most bitcoin kernels do), try uint4 or uint3 (uint3 was broken with SDK2.4 and generated bad ISA though - it is also opencl1.1 feature probably unsupported by some earlier SDKs).

Increasing vector size, you provide more ALU operations that have no dependencies so that they can fit in a VLIW pipeline. Another thing is that some ALU ops can't operate on t unit (like bitalign and bfi_int) so that larger vectors == more chance to have enough alu operations to fill the x,y,z,w,t units and get closer to 100% ALUBusy.

That said, bfi_int patching in general replaces a couple of instructions that can operate on t unit with a single one that does not work on t unit. To achieve better ALUPacking, you may need to reorder some stuff in your round function. I can give you no specific advice on that - just experiment, and profile, use GPU_DUMP_DEVICE_KERNEL=3, look at ISA dumps until you find the sweet spot.

Also keep in mind that increasing vector size after a certain threshold does not help, just the opposite. That's because it involves more GPRs and makes the kernel bigger. Number of used GPRs limit the number of wavefront thus the device utilization. Larger kernels are slower because they don't fit in GPU instruction caches and I've heard that the OpenCL compiler gives up some register allocation optimizations once kernels get too big (though it might not be true).

In my case, I found the sweet spot between uint2 and uint4 (and like I said, uint3 was broken). Thus I interlaced one uint2 sha256 operation and one uint sha256 operation and that ended to be fastest for VLIW5 hardware. On 69xx, using just uint2 was faster.

genaganna · ‎08-12-2011

1. Make sure you have more more wavefronts per group

2. Other things as per gat3way

a. Vectorization

b. Avoid dependent statements if possible.

diapolo · ‎08-12-2011

Originally posted by: genaganna 1. Make sure you have more more wavefronts per group

2. Other things as per gat3way

a. Vectorization

b. Avoid dependent statements if possible.

So it could be faster to have a value 2-times in different variables, if this makes the following comands independent?

Dia

maximmoroz · ‎08-12-2011

ALUBusy 68% for ALU bound kernel might mean that you failed to hide global memory access fully. What is your global worksize and group size?

Besides, what are ALUPacking and LDSBankConflict?

gat3way · ‎08-12-2011

Damn, I confused ALUPacking for ALUBusy. ALUPacking should be the VLIW utilization while ALUBusy is the ratio of ALU ops.

If you have round constants in an __constant array, try offseting them to __private memory, this should help.

diapolo · ‎08-13-2011

Originally posted by: gat3way Damn, I confused ALUPacking for ALUBusy. ALUPacking should be the VLIW utilization while ALUBusy is the ratio of ALU ops.

If you have round constants in an __constant array, try offseting them to __private memory, this should help.

You mean copy them into a private variable during kernel execution instead of directly use the __constant?

Dia

diapolo · ‎08-12-2011

Are 3 component vectors working with SDK 2.5 without the use of AMD_vec3 extension? I guess I will try this ... reorderning seems to help, but for me that's kind of trial and error, because I have no deep understanding of IL or ASM code :-/. What's your ALU OP usage and GPR usage for vec3?

How can I have more Wavefronts per group? Didn't get that statement ...

Thanks,

Dia

genaganna · ‎08-12-2011

Originally posted by: diapolo Are 3 component vectors working with SDK 2.5 without the use of AMD_vec3 extension? I guess I will try this ... reorderning seems to help, but for me that's kind of trial and error, because I have no deep understanding of IL or ASM code :-/. What's your ALU OP usage and GPR usage for vec3?

As per OpenCL 1.1 spec, 3 component vectors are in core spec means no need to use extensions.

3 component vectors should work in SDK2.5. I am not what was the problem with SDK2.4. As per my understanding, it should work in SDK2.4.

It is always recommanded to use vec4 instead of vec3 becuase more ALU untilization and less over head in initailzation of vec4 data.

How can I have more Wavefronts per group? Didn't get that statement ...

Make sure your work group size as big as possible and less than maximum allowed work group size as per device. i.e 256 for GPU in general.

gat3way · ‎08-12-2011

Well it generally works in 2.4 as I have used it other times. This time however, with that particular kernel (the bitcoin one) switching to uint3 from uint4 caused the runtime to crash. The kernel compiled fine with no errors - at least clBuildProgram() returned CL_SUCCESS. Then some time after that it crashes. I checked the ISA dump to find some error about used port (?!?) and with this error, isa dump ended. Switching back to uint4 or uint2 produced valid kernel binary. I noticed the same thing with uint3 with one more kernel too. I don't know what causes it though.

genaganna · ‎08-12-2011

Originally posted by: gat3way Well it generally works in 2.4 as I have used it other times. This time however, with that particular kernel (the bitcoin one) switching to uint3 from uint4 caused the runtime to crash. The kernel compiled fine with no errors - at least clBuildProgram() returned CL_SUCCESS. Then some time after that it crashes. I checked the ISA dump to find some error about used port (?!?) and with this error, isa dump ended. Switching back to uint4 or uint2 produced valid kernel binary. I noticed the same thing with uint3 with one more kernel too. I don't know what causes it though.

Please send simplified code for us so that it will be fixed in future releases.

If you don't want to copy code here please file at ticket at

http://developer.amd.com/support/KnowledgeBase/pages/HelpdeskTicketForm.aspx?Category=8

diapolo · ‎08-13-2011

I tried to implement 3-component vectors, but this just crashes ... could well be my fault, but is it working for someone in SDK 2.5?

This is with Cat 11.7 / Win7 x64 and 2- and 4-component vectors work well, no errors or crashes. Tested with 5870 and 6550D! APP KernelAnalyzer doesn't show any errors, too. APP Profiler doesn't give anything useful because of the crash if uint3 is used!

The ALUpacking value is round about 98,5%.

Global Work-size: 134217728

Work-Group-size: 256

Dia

diapolo · ‎08-14-2011

To the Vec3 thing, KernelAnalyzer outputs an error message directly in the Asembly tab: "Error: Another scalar op (gpr 6) has already used GPR read port 0 for chan 1 (gpr 127)".

Any ideas?

Dia

gat3way · ‎08-14-2011

Looks like you had exactly the same problem as me with uint3. Is that using 2.5?

As for __constant: do not copy anything. Just initialize a __private variable with the needed value.

The ALUpacking value is round about 98,5%.
Global Work-size: 134217728
Work-Group-size: 256

By the way, I just tried to compile my bitcoin kernel using the offline devices extension. When I compile from source and do clBuildProgram() it is OK. The precompiled kernel for Barts does not execute though. Profiling it with sprofile displays the same "strange" global work size. I have an error about jump to non-existant address or something like that in the ISA dump.

diapolo · ‎08-14-2011

Yes, this is for SDK 2.5 and it seem's to be some kind of bug. I opened a developer ticked, but got no response till now.

If I get new informations, I will post here!

Edit: Using __private instead of __constant leads only to 1 more GPR used for 58XX cards. No improvement in KernelAnalyzer.

Dia

genaganna · ‎08-15-2011

Originally posted by: diapolo I tried to implement 3-component vectors, but this just crashes ... could well be my fault, but is it working for someone in SDK 2.5?

This is with Cat 11.7 / Win7 x64 and 2- and 4-component vectors work well, no errors or crashes. Tested with 5870 and 6550D! APP KernelAnalyzer doesn't show any errors, too. APP Profiler doesn't give anything useful because of the crash if uint3 is used!
The ALUpacking value is round about 98,5%.
Global Work-size: 134217728
Work-Group-size: 256

Is it crashing in Runtime code or kernel code? Make sure alignment are proper for vec3 as it is required memory exactly same as vec4.

diapolo · ‎08-15-2011

Originally posted by: genaganna
Originally posted by: diapolo I tried to implement 3-component vectors, but this just crashes ... could well be my fault, but is it working for someone in SDK 2.5?

This is with Cat 11.7 / Win7 x64 and 2- and 4-component vectors work well, no errors or crashes. Tested with 5870 and 6550D! APP KernelAnalyzer doesn't show any errors, too. APP Profiler doesn't give anything useful because of the crash if uint3 is used!

The ALUpacking value is round about 98,5%.

Global Work-size: 134217728

Work-Group-size: 256

Is it crashing in Runtime code or kernel code? Make sure alignment are proper for vec3 as it is required memory exactly same as vec4.

As I said in the KernelAnalyzer the right tab holds the object code, which is asembly for a specific device for which the OpenCL kernel is compiled. In this window I get the above mentioned error message (Error: Another scalar op (gpr 6) has already used GPR read port 0 for chan 1 (gpr 127)), if I try to use vec3 in my kernel.

The generated object code is way to small, so it's clear now, why the application, which uses the kernel, crashes with Vec3 enabled.

What do you mean the memory layout has to be the same as for vec3? There is 1 kernel parameter, which is uint3 for the vec3 version and uint4 for the vec4 version. They are filled with 0, 1, 2, 0, 1, 2 ... for vec3 and 0, 1, 2, 3, 0, 1, 2, 3 ... for vec4.

Dia

genaganna · ‎08-15-2011

Originally posted by: diapolo
Originally posted by:

As I said in the KernelAnalyzer the right tab holds the object code, which is asembly for a specific device for which the OpenCL kernel is compiled. In this window I get the above mentioned error message (Error: Another scalar op (gpr 6) has already used GPR read port 0 for chan 1 (gpr 127)), if I try to use vec3 in my kernel.
Is it possible for to copy kernel code here?

What do you mean the memory layout has to be the same as for vec3? There is 1 kernel parameter, which is uint3 for the vec3 version and uint4 for the vec4 version. They are filled with 0, 1, 2, 0, 1, 2 ... for vec3 and 0, 1, 2, 3, 0, 1, 2, 3 ... for vec4.

I am taking about alignment in runtime code. In kernel, compiler handles these properly.

diapolo · ‎08-16-2011

@genaganna:

I prefer to not post the kernel in here, but I could send it to you for a review. Just leave an E-Mail address.

The used Vec3 kernel parameter is an array, which consists of uints in the runtime, which works for vec2 and vec4 :-/.

Isn't any developer from AMD here, who can comment on the error I get in the KernelAnalyzer?

Dia

genaganna · ‎08-16-2011

Originally posted by: diapolo @genaganna:

I prefer to not post the kernel in here, but I could send it to you for a review. Just leave an E-Mail address.

The used Vec3 kernel parameter is an array, which consists of uints in the runtime, which works for vec2 and vec4 :-/.

Isn't any developer from AMD here, who can comment on the error I get in the KernelAnalyzer?

I am from AMD. It looks like you filed a developer ticket and did not select appropriate field. I did not get an email for that ticket. Please paste ticket number here.

diapolo · ‎08-16-2011

Originally posted by: genaganna
Originally posted by: diapolo @genaganna:

I prefer to not post the kernel in here, but I could send it to you for a review. Just leave an E-Mail address.
The used Vec3 kernel parameter is an array, which consists of uints in the runtime, which works for vec2 and vec4 :-/.
Isn't any developer from AMD here, who can comment on the error I get in the KernelAnalyzer?

I am from AMD. It looks like you filed a developer ticket and did not select appropriate field. I did not get an email for that ticket. Please paste ticket number here.

Indeed, I filed a ticked, but didn't get any ID via e-mail :-/. Can I create a new one, with just a link to this thread?

Edit: Ticket ID 1481

Dia

genaganna · ‎08-17-2011

Originally posted by: diapolo

Indeed, I filed a ticked, but didn't get any ID via e-mail :-/. Can I create a new one, with just a link to this thread?

Edit: Ticket ID 1481

We are able to reproduce the issue. Issue is reported to developers.

I have few questions.

1. Where it is crashing when you run? Is it crashing in clEnqueueNDRangeKernel or somewhere else? Please tell us at which API, your code is crashing.

2. What happens when you run on CPU device?

diapolo · ‎08-18-2011

to 1. I can't tell the exact time, when it's crashing, but I know, that the kernel gets somehow compiled, because a binary kernel file (*.elf) is generated (for GPU and CPU).

to 2. The API of the main software is Python AFAIK, I have not written it by myself, it's the Phoenix bitcoin-miner, and what you got is the kernel for it. CPU device, doesn't work, too.

Any further news about the bug?

genaganna · ‎08-19-2011

Originally posted by: diapolo to 1. I can't tell the exact time, when it's crashing, but I know, that the kernel gets somehow compiled, because a binary kernel file (*.elf) is generated (for GPU and CPU).

to 2. The API of the main software is Python AFAIK, I have not written it by myself, it's the Phoenix bitcoin-miner, and what you got is the kernel for it. CPU device, doesn't work, too.

It would good if you ask Python developers where exactly crashing.

Any further news about the bug?

Developers are looking the issue.

diapolo · ‎08-27-2011

Any news on this issue genaganna?

Thx, Dia

Archives Discussions

ALUBusy - an easy way to raise it? + Vec3 problems