cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

notyou
Adept III

Ways of accessing vectors in kernel?

I've seen two different ways of accessing a vector and I'm not quite sure of the difference between them. Let's say we have a float4 vector. Then in the code I can say:

vector[global_id] = global_id; // assume global_id = 0

and then in other places I see

vector[global_id].x = global_id;

vector[global_id].y = globa_id;

...

 

Will these two do the same thing? If so, is there any reason to use the second method if we write the same data to all locations?

 

Also, if we have something like the above:

// assume vector is a 16 element array

vector[local_id] = local_id // assume local_id = 0

Will this write elements[0-3] with local_id? And then thread 1 writes elements[4-7]?

Or would we have to manually map which elements are executed by each thread?

vector[local_id + X] = local_id // where X is basically making it a loop, in which case we wouldn't be executing 4 items at a time if I'm not mistaken.

Thanks.

-Matt

0 Likes
7 Replies
antzrhere
Adept III

In your first example both will give the same result as with "vector[global_id] = global_id" there is an implicit scalar to vector conversion which is permitted in OpenCL. You could also write "vector[global_id] = (type4)(global_id)" with type being the respective datatype. In theory the compiler should (and does) convert all these (including vector[global_id].x=global_id) to the same code if it can vectorise well.

Answering your second question, looking at your code I think your confusing what are elements in the array.

Are you talking about a vector data type that is 16 elements wide (i.e. vector.s0-vector.s15, the maximum width of any vector type in OpenCL) OR are you talking about a vector datatype that is 4 elements wide(vector.s0-vector.s3 or alternatively x,y,z,w) but which has an array size of 16 (i.e. vector[0]-vector[16])?

If vector is type4 vector then the code "vector[local_id] = local_id" will mean that in the first thread [ vector[0].s0=0; vector[0].s1=0;  vector[0].s2=0;  vector[0].s3=0; ] ....in the second thread thread [ vector[1].s0=1; vector[1].s1=1; vector[1].s2=1; vector[1].s3=1; ] etc. (s0-s3 is interchangable with x,y,z,w) 

 

 

0 Likes
notzed
Challenger

The opencl specification has details on automatic type promotion like converting a scalar to the target vector type.  e.g. section 6.2.6 (and 6.2 in general) and section 6.4.  That document is the best place to answer such questions.

But, basically scalars are duplicated to each element of the target size, so yes the above statements are equivalent, assuming 'vector' is a a uint2 type or other element writes are not shown.

// assume vector is a 16 element array

vector[local_id] = local_id // assume local_id = 0

Will this write elements[0-3] with local_id? And then thread 1 writes elements[4-7]?

Or would we have to manually map which elements are executed by each thread?

vector[local_id + X] = local_id // where X is basically making it a loop, in which case we wouldn't be executing 4 items at a time if I'm not mistaken.

I think you're confusing memory addresses with array elements.

vector[localid] is a unique array element of whatever size the type of the vector is.  e.g. float4 will be 16 bytes wide/4 floats wide.  A write to the localid'th element will necessarily write to all 16 bytes and may be implemented as multiple separate writes, but the programmer only sees it as a single array element (which is 4 floats wide).

The actual details of the memory storage itself varies ... e.g. see section 6.2.5 of the specification.

0 Likes

 

Originally posted by: antzrhere

Answering your second question, looking at your code I think your confusing what are elements in the array.

 

Are you talking about a vector data type that is 16 elements wide (i.e. vector.s0-vector.s15, the maximum width of any vector type in OpenCL) OR are you talking about a vector datatype that is 4 elements wide(vector.s0-vector.s3 or alternatively x,y,z,w) but which has an array size of 16 (i.e. vector[0]-vector[16])?

 

If vector is type4 vector then the code "vector[local_id] = local_id" will mean that in the first thread [ vector[0].s0=0; vector[0].s1=0;  vector[0].s2=0;  vector[0].s3=0; ] ....in the second thread thread [ vector[1].s0=1; vector[1].s1=1; vector[1].s2=1; vector[1].s3=1; ] etc. (s0-s3 is interchangable with x,y,z,w)



 

I was talking about a 4-element vector, let's say the vector is 4 elements wide (e.g. float4) and I want an array of size 4 (for simplicity). Where I was originally confused was whether or not OpenCL said the array was:

(My original belief): [0.x, 0.y, 0.z, 0.w, 1.x, 1.y, 1.z, 1.w, ..., 3.x, 3.y, 3.z, 3.w) --> 16 total elements in the array.

I believe my original problem was that I thought OpenCL flattened the 4-element vector into the array to make it 4 times the actual size (in the above example, I originally thought there would be 16 elements in the array).

Now, if I'm understanding correctly, for visualization, basically it's a 2D array with the columns being handled automatically (because of the vectorization).

i.e.

[0.x, 1.x, 2.x, 3.x]

[0.y, 1.y, 2.y, 3.y]

[0.z, 1.z, 2.z, 3.z]

[0.w, 1.w, 2.w, 3.w]

Where thread 0 handles the 0.x-0.w elements, thread 1 handles 1.x-1.w and so on.

 

 

Originally posted by: notzed The opencl specification has details on automatic type promotion like converting a scalar to the target vector type.  e.g. section 6.2.6 (and 6.2 in general) and section 6.4.  That document is the best place to answer such questions.

 

But, basically scalars are duplicated to each element of the target size, so yes the above statements are equivalent, assuming 'vector' is a a uint2 type or other element writes are not shown.

 

I think you're confusing memory addresses with array elements.

 

vector[localid] is a unique array element of whatever size the type of the vector is.  e.g. float4 will be 16 bytes wide/4 floats wide.  A write to the localid'th element will necessarily write to all 16 bytes and may be implemented as multiple separate writes, but the programmer only sees it as a single array element (which is 4 floats wide).The actual details of the memory storage itself varies ... e.g. see section 6.2.5 of the specification.

 

 

This combined with antzrhere makes much more sense. Thanks.

 

Then, one last question, lets say I have an array of 4 elements (i.e. int[4]). How would I go about vectorizing this properly with say, int2 so that each thread handles 2 elements (thread0 handles elements 0-1 and thread1 takes elements 2-3)? Would it be as simple as changing the array from int[4] to int2[2] (and rearranging the elements from int[4] to fit properly into int2[2]) and then thread0 would take 0.x-0.y and thread1 would take 1.x-1.y, or is there something I'm missing? Thanks.

0 Likes

If I'm reading it correctly, your visualisation isn't quite right..but nearly there (maybe just a technicality)....as a memory layout it should be:

[0.x, 0.y, 0.z, 0.w]            = vector[0]

[1.x, 1.y, 1.z, 1.w]            = vector[1]

[2.x, 2.y, 2.z, 2.w]            = vector[2]

[3.x, 3.y, 3.z, 3.w]           = vector[3]

..and rows are handled automatically (this may be what you meant, just  standard notation is that things are read left to right, then top to bottom...which maps with the order they would read/write from a 1D memory buffer)

Now if you were to pass a simple 1 dimensional array of floats from your c++ program to your kernel (in the form of a global buffer) they would map like this (in order) : 0.x, 0.y,0.z,0.w,1.x,1.y.1.z,1.w,2.x...etc.

now if you were to perform the following code: "result = vector[0] + vector[1]" it would equate to the following:

result.x = vector[0].x + vector[1].x
result.y = vector[0].y + vector[1].y
result.z = vector[0].z + vector[1].z
result.w = vector[0].w + vector[1].w

Regarding your last question:

Yes, that is how you do it. However, atleast on AMD hardware it may be better to use int4 to ensure best performance (a more natural fit for current VLIW design and SSE to boot).

Just make sure when your  reading/writing vectors to memory that the memory addresses are properly aligned. To avoid using vloadn/vstoren all memory addresses that are used for R/W vectors must be aligned by the size of the data type. e.g. int2 must be aligned by 8 bytes, int4 by 16 bytes.

Good luck.

 

0 Likes

Originally posted by: antzrhere If I'm reading it correctly, your visualisation isn't quite right..but nearly there (maybe just a technicality)....as a memory layout it should be:

 

[0.x, 0.y, 0.z, 0.w]            = vector[0]

 

[1.x, 1.y, 1.z, 1.w]            = vector[1]

 

[2.x, 2.y, 2.z, 2.w]            = vector[2]

 

[3.x, 3.y, 3.z, 3.w]           = vector[3]

 

..and rows are handled automatically (this may be what you meant, just  standard notation is that things are read left to right, then top to bottom...which maps with the order they would read/write from a 1D memory buffer)

 

Now if you were to pass a simple 1 dimensional array of floats from your c++ program to your kernel (in the form of a global buffer) they would map like this (in order) : 0.x, 0.y,0.z,0.w,1.x,1.y.1.z,1.w,2.x...etc.

 

now if you were to perform the following code: "result = vector[0] + vector[1]" it would equate to the following:

 

result.x = vector[0].x + vector[1].x result.y = vector[0].y + vector[1].y result.z = vector[0].z + vector[1].z result.w = vector[0].w + vector[1].w

 

Regarding your last question:

 

Yes, that is how you do it. However, atleast on AMD hardware it may be better to use int4 to ensure best performance (a more natural fit for current VLIW design and SSE to boot).

 

Just make sure when your  reading/writing vectors to memory that the memory addresses are properly aligned. To avoid using vloadn/vstoren all memory addresses that are used for R/W vectors must be aligned by the size of the data type. e.g. int2 must be aligned by 8 bytes, int4 by 16 bytes.

 

Good luck.

 

Yes, I understand that my 2D array analogy isn't quite right (it was more of a visual representation for me just to see if it made sense) and yes, I did mean rows.

I also realize that using a 2-element vector isn't good for performance, but I wanted an extremely simple example to make sure I understood.

Next, you talk about aligning the memory addresses. Just to make sure I understand, this should only be necessary inside a kernel (if I pass a kernel parameter, it should automatically be aligned if it's either cl_mem or float4 [for example], correct?). If not, can you provide an example of how/where to do this? The only thing I found in the OpenCL specification was

float x[4];

float4 v = vload4( 0, x );

which I assume is inside the kernel (and not being passed as a parameter; if it was, I would need to do this since the basic data type is different, correct?). Thanks again for your help.



0 Likes

Yes, memory must be correctly aligned only within the kernel itself (it's part of the OpenCL spec). Misaligned R/W results in undefined behaviour.

But, you don't need to worry about this yet as all buffers allocated by OpenCL are aligned on boundaries that satisfy the alignment conditions for all datatypes, so passing a this to a kernel will be 100% fine.

In future, the only time you need to worry about alignment issues is when  typecasting between pointers where you calculate a new base address using an offset (within a kernel itself). In these cases, if you cannot guarantee alignment then you can use vloadn/vstoren. But you needn't worry about this unless this is what your doing.

0 Likes

Perfect. Thanks.

0 Likes