Hi,
I found in our code a struct like this
struct MyStruct
{
float x1, x2, x3;
float y1, y2, y3;
float PAD1, PAD2;
}
and we use this struct like this somewhere.
... = (float4)((myStruct->x1+ myStruct->y1) * 0.5f, (myStruct->x2+ myStruct->y2) * 0.5f, (myStruct->x3+ myStruct->y3) * 0.5f, 0.0f);
I thought it was bad. Then, I change like this
struct MyStruct
{
float3 x;
//Don't need to pad. float3 are already aligned to float4
float3 y;
//Don't need to pad. float3 are already aligned to float4
}
... = (foat4)((myStruct.x + myStruct.y) * 0.5f, 0.0f)
I compare kernel time. My new version is slower. My kernel needed 585 ms before my change and need 751 ms now. My question is : Why? Maybe because coalesced memory access will help because there is no padding between x3 and y1. But I thought GPU will be faster to compute a float3 instead of 3 floats. Maybe the compiler is smart enought to use the same register and having no gain to transform in float3, But, is not just no gain, it's a lost. If it's faster to use float, I will change all our float3 to float,
I tried to use float4 to see the result. 728ms, It's faster than float3 but still a lot slower than floats.
You created a lot of duplicates, I had deleted them now.
Regarding what GPU you are running the code on? What Driver, APP SDK, Operating System etc..
Ideally you should not loose performance if it is just a change from float3 to 3 floats, and work-items are still doing the same amount of work. Probably you can share a repro case.