cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

cipoint
Journeyman III

Best way to pass 3D vector fields to the kernel

Hello,

What's the best / most efficient way to pass three dimensional vector fields to the kernel? My current solution is to pass the data (float or double arrays) as a 1D array in this order:

X_000 Y_000 Z_000 X_001 Y_001 Z_001 ... X_010 Y_010 Z_010 X_011 Y_011 Z_011 ...

Inside the kernel I have to manipulate the vector field on each side, so I create 3D-data types lie following

(float3) V = dataInput[nx + Nx * ny + Nx * Ny * nz]

where nx = get_global_id(0), Nx = get_global_size(0) and so on.

Is there a better way to do this? Unfortunately I can't use vectorial data types directly in the definition of the kernel arguments because OpenCL internally uses float4 for float3 variables ...

0 Likes
3 Replies
himanshu_gautam
Grandmaster

cipoint wrote:

Hello,

What's the best / most efficient way to pass three dimensional vector fields to the kernel? My current solution is to pass the data (float or double arrays) as a 1D array in this order:

X_000 Y_000 Z_000 X_001 Y_001 Z_001 ... X_010 Y_010 Z_010 X_011 Y_011 Z_011 ...

Inside the kernel I have to manipulate the vector field on each side, so I create 3D-data types lie following

(float3) V = dataInput[nx + Nx * ny + Nx * Ny * nz]

where nx = get_global_id(0), Nx = get_global_size(0) and so on.

Is there a better way to do this? Unfortunately I can't use vectorial data types directly in the definition of the kernel arguments because OpenCL internally uses float4 for float3 variables ...

If you are talking from performance point of view, it is recommended to pass a buffer such that it is accessed in co-alesced manner. Your way looks to me in that respect.

Now It looks like you need to actually traverse this buffer along different dimensions. Traversing along the columns or height will most likely result in channel conflicts. You can try these suggestions in that case, if they fit your problem:

1. Try Using LDS: If you can move a section of your actual buffer to LDS and then access it (in non co-alesced manner), it can give you performance. LDS Bank conflicts are still a important bottleneck in many cases. Go through AMD's OpenCL Programming guide for LDS Optimizations.

2. You can also check other global memory optimization techniques (like staggered offsets) from Programming Guide.

0 Likes

3D contain a 3 vector X , Y and Z

X-heigh

Y-width

Z-depth

X_000 Y_000 Z_000 , X_001 Y_001 Z_001 , X_002 Y_002 Z_002 .

0 Likes
himanshu_gautam
Grandmaster

Going by your access pattern "dataInput[nx + Nx * ny + Nx * Ny * nz]", it is hard to say anything.

The memory bandwidth usage depends on what successive "workitems" in a workgroup are doing

And, what each "workgroup" is doing.

Are you spawning a 3D workgroup and partitioning the 3D volume among multiple workgroups?

Also, what does "X_000 Y_000 Z_000 X_001 Y_001 Z_001 ... X_010 Y_010 Z_010 X_011 Y_011 Z_011 ..." mean?

Are you storing data of (0,0,0) followed (1,1,1) followed by (2,2,2) etc..? Looks like you are storing the diagonal...

Please clarify.

0 Likes