Just experienced a bizarre problem. I can't past the entire code to confuse you. The program is like this conceptually:
struct st_a{
...
int3 vInA;
...
int num;
};
struct st_b{
...
float3 vInB;
...
}
__kernel void MyKernel()
{
__local struct st_a A;
__local struct st_b B;
A.vInA = (int3)0;
B.vInB = (float3).6f;
// OMG, A.vInA is now very big numbers!!!
...
}
I only launched one work item, so there is no write conflict.
Does anybody have any suggestions?
Thank you in advance!
VisCocoa
Solved! Go to Solution.
viscocoa,
I've found the problem. The reason this is occurring is that the code in question was written a long time ago, before OpenCL 1.1 and vec3 existed. In OpenCL, vec3 types actually store the same amount of memory as a vec4 type and this section of code wasn't taking that into account, so the offsets into the structure were based on the wrong size. The only work-around that is guaranteed to work is to use vec4 types instead of vec3 types. This will provide the correct size calculation and will fix all the issues.
Expect this fix to make it into catalyst in the April or May driver.
Can you give the size of the structs st_a and st_b? Maybe we can reproduce this in-house with that info.
Hi Jeff,
It's not a question on bank conflict.
I know that when passing a structure from the host to a kernel, the structure must be padded.
Now, it seems that a structure in local memory has to be padded to be the multiples of 16 bytes? Otherwise, the next structure will overlap the previous one?
I have worked around the overwrite by adding a float4 in the end of the first structure:
struct st_a{
...
int3 vInA;
...
int num;
int4 padding;
};
I don't think this comply with OpenCL standard.
Thank you for you answer.
No, of course it's not complying with the standard. I just want to get a test case that we can reproduce in-house.
Thank you Jeff. I hope that you can confirm this if you can reproduce it. I will write a small example kernel when I get some time.
Only aligning the first structure does not solve the problem. However, if I keep the second variable far away from the first by declaring two local variables based on st_a, they don't conflict at all. To make things clear:
struct st_a{
...
int3 vInA;
...
int num;
};
struct st_b{
...
float3 vInB;
...
}
__kernel void MyKernel()
{
__local struct st_a A;
__local struct st_a padding; // to keep B far away from A in local memory
__local struct st_b B;
A.vInA = (int3)0;
B.vInB = (float3).6f;
// Now A is intact
...
}
Can you give the size of st_a and st_b and the offset of vInA and vInB within each struct? Thanks.
Thanks, that helped. I am able to see the issue here. CPU device works fine, so it's probably a real bug on our side.
Thank you Jeff!
On my computer:
&foo = 0
sizeof(stA) = 752
&bar = 704
There are 48 bytes overlapping.
Yep, seems to be something going on with the alignment of the pointers within the struct. For a workaround, sort your structs from small to large. I.e.
__local st_B bar;
__local st_A foo;
That worked for me and Micah, who is one of our compiler engineers who posts here regularly, thinks that is a safe workaround until he gets the bug(s) sorted out.
Thanks again for reporting the issue.
I actually used:
__local st_A foo;
__local char padding[48];
__local st_C bar;
In my program, there are other structures following the two.
Thank you and have a good night!
Ack, I just found that while foo may come out okay, bar isn't correct, at least according to printf. I changed the code as follows:
foo.m9 = (int3)0;
bar.m1 = (float3).6f;
barrier(CLK_LOCAL_MEM_FENCE);
printf("bar.m1: (%f, %f, %f)\n", bar.m1.x, bar.m1.y, bar.m1.z);
// output: (0.600000, 0.600000, 0.600000)
out[0] = foo.m9.x;
CPU device:
bar.m1: (0.600000, 0.600000, 0.600000)
GPU device:
bar.m1: (0.600000, 0.000000, 0.000000)
Hopefully we can get a proper fix soon. I didn't check whether your padding would fix this case, just whether reordering the structs would.
Thanks Jeff. The padding worked well. I would suggest, for now, that all variables declared in local memory should be checked for possible overlaps.
viscocoa,
I've found the problem. The reason this is occurring is that the code in question was written a long time ago, before OpenCL 1.1 and vec3 existed. In OpenCL, vec3 types actually store the same amount of memory as a vec4 type and this section of code wasn't taking that into account, so the offsets into the structure were based on the wrong size. The only work-around that is guaranteed to work is to use vec4 types instead of vec3 types. This will provide the correct size calculation and will fix all the issues.
Expect this fix to make it into catalyst in the April or May driver.
Hi Micah,
Thank you very much for finding out the reason. This problem has been solved totally.
May I ask another related question?
Since now a vec3 uses as much memory as a vec4, can I still expect that the compiler will pack a vec3 and a scalar variable automatically into a single vector register?
I used to manually store an int3 and a scalar integer into an int4, in hope that a register can be saved. Is this wise? Or just stupid because the compiler will do it if necessary?
The compiler will do it in most cases. There might be some cases where register porting issues don't allow it for certain sequences of instructions however.
Thank you Micah. This is very helpful!