Archives Discussions

vladant · ‎07-30-2012

Hi,

Let us consider two multiplicatoin kernels.

1. Simple multiplication using the *operator:

__kernel void multiplication(__global uint4 *src, __global uint4 *dst)
{
size_t i = get_global_id(0);

dst = src * 3;
};

X86 Assembler:

pushl %eax

leal 8191(%esp), %eax

movl 8(%esp), %ecx

movl 12(%esp), %edx

andl $-8192, %eax # imm = 0xFFFFFFFFFFFFE000

movl -16(%eax), %eax

shll $4, %eax

movdqa (%ecx,%eax), %xmm0

pshufd $3, %xmm0, %xmm1 # xmm1 = xmm0[3,0,0,0]

pshufd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0,0,0]

movd %xmm1, %ecx

leal (%ecx,%ecx,2), %ecx

movd %ecx, %xmm1

movd %xmm2, %ecx

leal (%ecx,%ecx,2), %ecx

movd %ecx, %xmm2

movd %xmm0, %ecx

movhlps %xmm0, %xmm0 # xmm0 = xmm0[1,1]

leal (%ecx,%ecx,2), %ecx

punpckldq %xmm1, %xmm2 # xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]

movd %ecx, %xmm1

movd %xmm0, %ecx

leal (%ecx,%ecx,2), %ecx

movd %ecx, %xmm0

punpckldq %xmm0, %xmm1 # xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]

punpckldq %xmm2, %xmm1 # xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1]

movdqa %xmm1, (%edx,%eax)

popl %eax

ret

2. Simple multiplication using mul24 (or mul_hi):

__kernel void multiplication(__global uint4 *src, __global uint4 *dst)
{
size_t i = get_global_id(0);

dst = mul24(src, 3);
};

X86 Assembler:

pushl %eax

leal 8191(%esp), %eax

movl 8(%esp), %ecx

movl 12(%esp), %edx

movdqa LCPI0_0, %xmm1

andl $-8192, %eax # imm = 0xFFFFFFFFFFFFE000

movl -16(%eax), %eax

shll $4, %eax

movdqa (%ecx,%eax), %xmm0

movdqa %xmm0, %xmm2

psrldq $4, %xmm0

pmuludq %xmm1, %xmm2

psrldq $4, %xmm1

pmuludq %xmm0, %xmm1

pshufd $-40, %xmm2, %xmm2 # xmm2 = xmm2[0,2,1,3]

pshufd $-40, %xmm1, %xmm0 # xmm0 = xmm1[0,2,1,3]

punpckldq %xmm0, %xmm2 # xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]

movdqa %xmm2, (%edx,%eax)

popl %eax

ret

The question is why the code in the first case does not use pmuludq instruction as mul24 ? It is hard to follow what is going on there (where is a multiplication or addition?) but I do not believe that this code could be faster and by the way mul24 should be fastest method. pmuludq gives 64 bit result that is a good one to cast into any format CL compiler wants.

Thanks.

binying · ‎07-30-2012

Is there any example of multiplication using assembly language in Randall Hyde's "The Art of Assembly Language"?

realhet · ‎07-30-2012

Hi!

In the 1st code there is the multiplication -> leal (%ecx,%ecx,2), %ecx it means ecx <- ecx + ecx shl 1.

Seems like the traditional cpu does the heavy work, and sse is used to unpack and repack each individual dwords.

It would be only a movaps plus 2x paddd instructions if the compiler could do it with a brain

I think the 2nd code does it fast as it can, with the 2x pmuludq instructions. But have no idea why it doesn't use pmuludq for the first one also.

I think the fastest generated code would be dst = src + src<<1; for the given example source codes.

vladant · ‎07-31-2012

Hi,

I see, multiplication by 3 is too trivial, if to multiply by 13 for example everything become strightforward.

Another sad example:

__kernel void test02(__global uchar8 *src, __global uchar8 *dst)
{
size_t i = get_global_id(0);

ushort8 t = convert_ushort8(src);

t = mul_hi(t, 13);

dst = convert_uchar8(t);
}

I am not going to put whole x86 assempler code here (it is a little bit long) but it seems compiler cannot convert uchar8 vector into ushort8 although it could be done by one PUNPCKLBW instruction. Instead compiler uses pextrw/punpcklwd a lot. upsample() does not help either.

Thanks.

Archives Discussions

Multiplication in x86 OpenCl kernels