Archives Discussions

viscocoa · ‎02-15-2012

Just experienced a bizarre problem. I can't past the entire code to confuse you. The program is like this conceptually:

struct st_a{

...

int3 vInA;

...

int num;

};

struct st_b{

...

float3 vInB;

...

}

__kernel void MyKernel()

{

__local struct st_a A;

__local struct st_b B;

A.vInA = (int3)0;

B.vInB = (float3).6f;

// OMG, A.vInA is now very big numbers!!!

...

}

I only launched one work item, so there is no write conflict.

Does anybody have any suggestions?

Thank you in advance!

VisCocoa

MicahVillmow · ‎02-20-2012

viscocoa,

I've found the problem. The reason this is occurring is that the code in question was written a long time ago, before OpenCL 1.1 and vec3 existed. In OpenCL, vec3 types actually store the same amount of memory as a vec4 type and this section of code wasn't taking that into account, so the offsets into the structure were based on the wrong size. The only work-around that is guaranteed to work is to use vec4 types instead of vec3 types. This will provide the correct size calculation and will fix all the issues.

Expect this fix to make it into catalyst in the April or May driver.

View solution in original post

jeff_golds · ‎02-15-2012

Can you give the size of the structs st_a and st_b? Maybe we can reproduce this in-house with that info.

viscocoa · ‎02-15-2012

Hi Jeff,

It's not a question on bank conflict.

I know that when passing a structure from the host to a kernel, the structure must be padded.

Now, it seems that a structure in local memory has to be padded to be the multiples of 16 bytes? Otherwise, the next structure will overlap the previous one?

I have worked around the overwrite by adding a float4 in the end of the first structure:

struct st_a{

...

int3 vInA;

...

int num;

int4 padding;

};

I don't think this comply with OpenCL standard.

Thank you for you answer.

jeff_golds · ‎02-15-2012

No, of course it's not complying with the standard. I just want to get a test case that we can reproduce in-house.

viscocoa · ‎02-15-2012

Thank you Jeff. I hope that you can confirm this if you can reproduce it. I will write a small example kernel when I get some time.

viscocoa · ‎02-15-2012

Only aligning the first structure does not solve the problem. However, if I keep the second variable far away from the first by declaring two local variables based on st_a, they don't conflict at all. To make things clear:

struct st_a{

...

int3 vInA;

...

int num;

};

struct st_b{

...

float3 vInB;

...

}

__kernel void MyKernel()

{

__local struct st_a A;

__local struct st_a padding; // to keep B far away from A in local memory

__local struct st_b B;

A.vInA = (int3)0;

B.vInB = (float3).6f;

// Now A is intact

...

}

jeff_golds · ‎02-15-2012

Can you give the size of st_a and st_b and the offset of vInA and vInB within each struct? Thanks.

viscocoa · ‎02-15-2012

Below is a runable kernel that demonstrates the problem. I appreciate if you can confirm that the kernel reproduces the problem on your computer, so that I am confident that my software environment is all right.

typedef struct _C {

int m0;

float3 m1;

float m2;

float3 m3;

} st_C;

typedef struct _D {

float3 m0;

float3 m1;

} st_D;

typedef struct _A {

int m0;

st_C m1[10];

__global float3* m2;

int m3;

__global float3 *m4;

__global float2 *m5;

__global int4 *m6;

int m7;

st_D m8;

int3 m9; // focus on this

__global int2 *m10;

__global int *m11;

int m12;

} st_A;

typedef struct _B {

int m0;

float3 m1; // focus on this

float3 m2;

} st_B;

__kernel void experiment()

{

__local st_A foo;

__local st_B bar;

foo.m9 = (int3)0;

printf("foo.m9: (%d, %d, %d)\n", foo.m9.x, foo.m9.y, foo.m9.z);

// output: (0, 0, 0)

bar.m1 = (float3).6f;

printf("foo.m9: (%d, %d, %d)\n", foo.m9.x, foo.m9.y, foo.m9.z);

// output: (1058642330, 1058642330, 1058642330)

}

jeff_golds · ‎02-15-2012

Thanks, that helped. I am able to see the issue here. CPU device works fine, so it's probably a real bug on our side.

viscocoa · ‎02-15-2012

Thank you Jeff!

viscocoa · ‎02-15-2012

On my computer:

&foo = 0

sizeof(stA) = 752

&bar = 704

There are 48 bytes overlapping.

jeff_golds · ‎02-15-2012

Yep, seems to be something going on with the alignment of the pointers within the struct. For a workaround, sort your structs from small to large. I.e.

__local st_B bar;

__local st_A foo;

That worked for me and Micah, who is one of our compiler engineers who posts here regularly, thinks that is a safe workaround until he gets the bug(s) sorted out.

Thanks again for reporting the issue.

viscocoa · ‎02-15-2012

I actually used:

__local st_A foo;

__local char padding[48];

__local st_C bar;

In my program, there are other structures following the two.

Thank you and have a good night!

jeff_golds · ‎02-16-2012

Ack, I just found that while foo may come out okay, bar isn't correct, at least according to printf. I changed the code as follows:

foo.m9 = (int3)0;

bar.m1 = (float3).6f;

barrier(CLK_LOCAL_MEM_FENCE);

printf("bar.m1: (%f, %f, %f)\n", bar.m1.x, bar.m1.y, bar.m1.z);

// output: (0.600000, 0.600000, 0.600000)

out[0] = foo.m9.x;

CPU device:

bar.m1: (0.600000, 0.600000, 0.600000)

GPU device:

bar.m1: (0.600000, 0.000000, 0.000000)

Hopefully we can get a proper fix soon. I didn't check whether your padding would fix this case, just whether reordering the structs would.

viscocoa · ‎02-16-2012

Thanks Jeff. The padding worked well. I would suggest, for now, that all variables declared in local memory should be checked for possible overlaps.

MicahVillmow · ‎02-20-2012

viscocoa,

I've found the problem. The reason this is occurring is that the code in question was written a long time ago, before OpenCL 1.1 and vec3 existed. In OpenCL, vec3 types actually store the same amount of memory as a vec4 type and this section of code wasn't taking that into account, so the offsets into the structure were based on the wrong size. The only work-around that is guaranteed to work is to use vec4 types instead of vec3 types. This will provide the correct size calculation and will fix all the issues.

Expect this fix to make it into catalyst in the April or May driver.

viscocoa · ‎02-21-2012

Hi Micah,

Thank you very much for finding out the reason. This problem has been solved totally.

May I ask another related question?

Since now a vec3 uses as much memory as a vec4, can I still expect that the compiler will pack a vec3 and a scalar variable automatically into a single vector register?

I used to manually store an int3 and a scalar integer into an int4, in hope that a register can be saved. Is this wise? Or just stupid because the compiler will do it if necessary?

MicahVillmow · ‎02-21-2012

The compiler will do it in most cases. There might be some cases where register porting issues don't allow it for certain sequences of instructions however.

viscocoa · ‎02-22-2012

Thank you Micah. This is very helpful!

Archives Discussions

local memory conflict?