cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

yurtesen
Miniboss

Using AVX registers on Bulldozer...

I have been trying to get my kernel to use AVX YMM* registers. However while it seems to be using AVX intructions, it only uses SSE XMM* registers. Is there a trick to get a kernel to use them? or is there some hints or a pointer to an example kernel which would utilize them?

Thanks!

0 Likes
8 Replies
craft_coder
Adept II

Which tool is showing you this?

My understanding is that the registers are the same.  YMM registers are just 256-bit extended views of the XMM registers.

0 Likes

The AMD APP Kernel Analyzer

http://developer.amd.com/tools/AMDAPPKernelAnalyzer/Pages/default.aspx

Yes, but I should be able to see YMM0-15 in the assembly code. If the XMM0-15 is visible only, that would mean that only the 128bit part of the registers are used.

0 Likes
feryno
Journeyman III

Hi yurtesen,

I'm currently developing UEFI PE32+ asm level debugger

I attached the current state of my project (which is not for release yet)

look into cpu_features.asm how to set CPU to support AVX

look into exceptions.asm how to save/restore AVX registers during exception

works well on my AMD FX-4100, tried to be compatible with Intel (not tested, only implemented what Intel manuals suggested, don't plane to buy such CPU)

Cheers, Feryno

0 Likes

I am pretty sure it works once you write it directly in assembly, I meant to ask usage of them in OpenCL

0 Likes

To fully utilize AVX the kernel should be using float8\16 data types on AVX enabled processor (Bulldozer/SandyBridge).

0 Likes

tzachi, what about this:

__kernel void cl_muladdcos(global const float8 *a,

                     global const float8 *b,

                     global const float8 *c,

                     global float8 *result) {

  int id=get_global_id(0);

  result[id]=acos(a[id]*b[id]+c[id]);

}

This only uses XMM registers and also does not even use FMA instructions.  Intel's OpenCL is able to use YMM registers

0 Likes

From the AMD Accelerated Parallel Processing OpenCL Programming Guide:

Use float4 and the OpenCL built-ins for vector types (vload, vstore, etc.). These enable the AMD Accelerated Parallel Processing OpenCL implementation to generate efficient, packed SSE instructions when running on the CPU. Vectorization is an optimization that benefits both the AMD CPU and GPU.

I have no experience with using vload/vstore stuff, but hope this helps.

0 Likes

If I use float4, it will simply default to SSE registers. I am not sure if there is any difference between having an input array of float8 or use vload8 from input array float... But I will try some of these tomorrow Thanks... Generally, I think it looks like there is some possibility of improvement in amd opencl sdk.

0 Likes