cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

divij
Journeyman III

Interface bewteen CAL and openCL

Calling a CAL routine from openCL

I have to call a routine written in CAL from openCL.

Is there a way to link/interface a CAL and openCL code?

0 Likes
11 Replies

No, we don't allow interop between CAL and OpenCL.  What do you need from CAL?

Jeff

0 Likes

Hey Jeff,

I just want to use a highly efficient sgemm routine in an openCL code.

The matrix multiplication code bundeled as a sample in the AMD-APP-SDK is able to give ~500 GFlops on HD699 for single precision.

However, I have found codes developed by people such as this which give upto 2TFlops. But all the codes are written using CAL and kernal written using IL assembly programming.

So, If I want to use openCL, does that mean I'd not be able to use highly optimised codes?


If we talk of openCL as the standard of the future, at least it should be able to achieve optimizations as compared to other languages.

 

 

0 Likes

If you're a god, you can edit the ELF file spit out of the OpenCL program binary, inject your IL code and it should run that on the fly. I have no idea what the parameter mapping between OpenCL and its IL backend is though.

0 Likes

Originally posted by: divij Hey Jeff,

 

I just want to use a highly efficient sgemm routine in an openCL code.

 

The matrix multiplication code bundeled as a sample in the AMD-APP-SDK is able to give ~500 GFlops on HD699 for single precision.

 

However, I have found codes developed by people such as this which give upto 2TFlops. But all the codes are written using CAL and kernal written using IL assembly programming.

 

So, If I want to use openCL, does that mean I'd not be able to use highly optimised codes?

 

If we talk of openCL as the standard of the future, at least it should be able to achieve optimizations as compared to other languages.



I believe you are encountering a limit of the sample, not the API.  The blocksize on the sample is much smaller than in the optimized CAL version.

Jeff

0 Likes

 

I believe you are encountering a limit of the sample, not the API.  The blocksize on the sample is much smaller than in the optimized CAL version.

 

Jeff

 

I have timed the sample provided in the SDK after increasing the blocksize also. The result still remains ~500GFlops

Yes, I agree that it is a limit of sample and not the API. But to optimize GEMM further than the provided sample, we'd have to use registers, texture cache etc. because these are the ones used in the optimized CAL version. But direct user control over the VLIW and other hardware details is not possible using openCL.

Jeff, Can you suggest any other optimization other than that already used in the sample which is possible using openCL only?

Thanks for your replies.

P.S. Do you work at AMD?

0 Likes

divij,
Try this change:
__kernel void mmmKernel(__global float4 *matrixA,
__global float4 *matrixB,
to:
__kernel void mmmKernel(const __global float4 * restrict matrixA, const __global float4 * restrict matrixB)

This will give you caching on matrixA and matrixB and should improve performance.

Second, the IL code uses an 8x8 outer product, the SDK sample uses a 4x4. That alone will dramatically reduce what you can get.
0 Likes

Originally posted by: MicahVillmow divij, Try this change: __kernel void mmmKernel(__global float4 *matrixA, __global float4 *matrixB, to: __kernel void mmmKernel(const __global float4 * restrict matrixA, const __global float4 * restrict matrixB) This will give you caching on matrixA and matrixB and should improve performance. Second, the IL code uses an 8x8 outer product, the SDK sample uses a 4x4. That alone will dramatically reduce what you can get.


Thank you for your reply.

1) I'd soon make the modifications you suggest and come back with the performance details.

2) Please correct me if I am wrong but the blockSize for the sample can be changed using the parameter -b while executing. I have tested the code with the maximum blocksize that my card supports i.e. 8x8 and it gives 465 GFlops.

 

0 Likes

Have you tried matrixmulImage sample. It is expected to have better GFLOPS value.

0 Likes

Originally posted by: himanshu.gautam Have you tried matrixmulImage sample. It is expected to have better GFLOPS value.

 

Thank you again Himanshu.

While using tile sizes of 4x8 it is able to achieve 1.6TFlops and I guess it can be further optimized to use 8x8 tile size.

I am very curious to know about what all happened here. What is the difference between the two implementations and where did the performance boost came from?

Any suggestion on reading material on these aspects?

0 Likes

divij,
You can find some information here:
http://forum.beyond3d.com/showthread.php?t=54842

Basically going from a 4x4 -> 4x8 -> 8x8 decreases the amount of bandwidth that is required to do the calculation, thus increasing the performance of the algorithm.
For example, prunedtree showed that with a 8x4, on RV770, the peak is 600 GFlops, but with 8x8, the peak is 960GFlops.
0 Likes

Just for the information:

I tried using the restrict keyword in the kernel for MatrixMultiplication as suggested by Micah Villmow and it didn't bring about a performance difference.

0 Likes