Archives Discussions

jaidotsh · ‎07-18-2011

I wanted to know whether my code( CSR Matrix multiplication ) will give optimum performance if I use all of the below optimizations together??

1. float to float4 (current implementation)

2. Blocking (Yet to add. i.e., grouping into warp sized blocks)

Does the 2nd optimization matter much in terms of performance?.

jaidotsh · ‎07-18-2011

Also, Can u suggest any other kind of optimization technique that I can use?

maximmoroz · ‎07-18-2011

Honestly, using float4 is the last thing which would came to my mind when optimizating CSR matrix multiplication. May I ask you, how did you benefit from float4?

What is your current position and target? I mean what is the current efficiency of your kernel against theoretical ALU and memory bandwidth and what would you like to achieve?

jaidotsh · ‎07-18-2011

Actually, I am improving on the sparse matrix library for OpenCL - ViennaCL. In fact they have used float4, float8, float16 in their implementation. I'm developing a custom kernel using warp based optimization techniques which I believe would give better performance, but I'm not quite sure about it.

Using float4 I can take 4 non-zeros together which executes it's own instruction. Current implenentation is fairly efficient, but as I said, I want to improve. So can you suggest me some other optimization technique?

maximmoroz · ‎07-18-2011

I have just downloaded ViennaCL and failed to see any compressed matrix multiplication (C=AxB) OpenCL code. What file the kernel is located in, please? Or maybe you will show the kernel here? It will be helpful.

jaidotsh · ‎07-19-2011

It's located in viennacl/linalg/kernels/compressed_matrix_source.h

maximmoroz · ‎07-19-2011

Thanks. So it is just matrix by vector multiplication. Here are my thoughts:

1) What is the reason to have single work-item calculating bunch of output elements (get_global_size(0)/size)? What is size here?

2) Try replacing "__global const float * vector" with "const __global float * restrict vector". This will enable cached reads from "vector" buffer. Actually, you might also specify "restrict" for any buffer in the parameters.

3) Yes, it makes sense to use float4 (or even float8) as one is unable to unroll the loop with dynamic upper value. Actually, you get speed up here not due to vector ALU operations and not due to reading 4 subsequent floats, but due to reading these 4 floats (or 8, or think about 16) in single TEX clause.

4) For the time being I don't see any way local memory would help here. So I would suggest the following approach:

a) Minimize amount of clauses (you are doing it with using float4 and float8)

b) Minimize register usage, thus you will be able to run the maximum amount of wavefronts at the single CU.

c) You might need to implement some logic for determining the best local worksize which will not introduce its own limitation on the number of active wavefronts.

jaidotsh · ‎07-19-2011

"size" is the number of work-items per work-group. But when I am multiplying it with a 1D array I need not cache anything because there is only one column.

Can you tell me what is 1.Vector ALU Operations 2.TEX clause ??

And how do I minimize the register usage?

himanshu_gautam · ‎07-19-2011

clauses are the group of instructions which can execute in one go. GPUs generally divide kernels into clauses like ALU clause, control clause, fetch clause.

Refer to opencl Programming guide to know details about clauses.

maximmoroz · ‎07-19-2011

Honestly, I don't understand the partition of the work in the kernel.

Below is the code for the non-vectorized kernel.

Outer cycle limits "row" to be less than "size". Thus the kernel writes data to "result" buffer for indexes [0, size - 1]. It looks like "size" is the number of elements in the resulting vector. And now you say that the "size" is the number of work-items per work-group. These two are quite different entities.

"Vector ALU Operations" - I used this phrase to indicate that using float4 operations helps filling VLIW slots. Kind of bad word usage, sorry.

"Minimize register usage" - Use Kernel Analyzer, check ISA-code generated.

Read AMD OpenCL Programming Guide.

__kernel void vec_mul( __global const unsigned int * row_indices, __global const unsigned int * column_indices, __global const float * elements, __global const float * vector, __global float * result, unsigned int size) { for (unsigned int row = get_global_id(0); row < size; row += get_global_size(0)) { float dot_prod = 0.0f; unsigned int row_end = row_indices[row+1]; for (unsigned int i = row_indices[row]; i < row_end; ++i) dot_prod += elements * vector[column_indices]; result[row] = dot_prod; } }

jaidotsh · ‎07-19-2011

Yeah, you are right, it's not the number of work-items per work-group.

I'm not used to the kernel analyzer, seems like it's very helpful. I'm still learning OpenCL, got to get a hold of it. Thank you 🙂

Archives Discussions

Optimum performance