cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

lantis
Journeyman III

maximum local variable usage

I'm trying to maximize local variable usage as a temporary variable for computation-- but it seems I'm hitting bank conflicts from the results I'm getting (at least, that's what I think my problem is).

What I'm trying to do is using uint4 array of four elements for computation, something like:

uint4 Y[4] = {some initialized values here};

uint gid = get_local_id(0);

uint offset = gid*4;

__local uint4*x;

for (uint i=3; --i; ){

x[offset] = Y[0] + Y[1] + Y[2] + Y[3];

x[offset+1] = x[offset] + x[offset];

x[offset+2] = x[offset+1] * x[offset];

x[offset+3] = x[offset+1] * x[offset+2];

Y[0] = x[offset];

Y[1] = x[offset+1];

Y[2] = x[offset+2];

Y[3] = x[offset+3];

}

First : is my offset correct?  Multiplying local id by four will get me the four element space I need for each local id. Or should it be get_workgroup_id(0) * get_local_id(0) * 4?

Second : uint4 is 4 unsigned ints, so 4 bytes * 4 = 16 bytes?  I need a worksize of at least 64.  What's the workgroup/worksize I need to maximize the 32kb local memory size and avoiding bank conflicts/out of border computation?

0 Likes
6 Replies
himanshu_gautam
Grandmaster


The local pointer "x" must be declared as "uint" and not "uint4".

You are already multiplying "gid with 4".

-

Bruhaspati

0 Likes
lantis
Journeyman III


The local pointer "x" must be declared as "uint" and not "uint4"


I don't get this.  I am storing uint4 in x, but you're telling me it must be uint?


You are already multiplying "gid with 4"


so the offset is correct?  How about my second question (which is actually the topic that I am trying to solve)?

uint4 is 4 unsigned ints, so 4 bytes * 4 = 16 bytes?  I need a worksize of at least 64.  What's the workgroup/worksize I need to maximize the 32kb local memory size and avoiding bank conflicts/out of border computation?

0 Likes

Oh Sorry.. I missed your logic.... I think "x" as uint4 is perfectly fine....

Workitems are accessing 16-bytes of data (uint4) -- which occupies 4 banks.

And successive workitems are 4 uint4s away...So, effectively they are 16 banks away..

There are a total of 32 banks...

So, you have a 8-way bank conflict per quarter wave front!

Thats effectively serializing everywhthing coupled with complete under-utlization of local memory bandwidth...(25% - 8 banks active / 32 banks)

This assumes that the compiler will create 1 Local load instruction to load 1 uint4..

If it breaks it down into lesser sizes, you will still have the same 8x bank-conflict with still lesser local memory bandwidht utilization...

The best bet is to do your math using registers themselves... It is not going to cost you that bad...

If you still prefer local memory -- You need to interleave your ints and load them effectively..

i.e.

Store your 4 uint4s like this seen logically as 4 sets

You can always read "uint2"s and stitch together a uint4 and work..

In any case, your math of local-memory will still need to go via the registers...but yes.. the compiler might re-use...

uint2_wi0, uint2_wi1, ...... uint2wiLast

uint2_wi0, uint2_wi1, ...... uint2wiLast

uint2_wi0, uint2_wi1, ...... uint2wiLast

uint2_wi0, uint2_wi1, ...... uint2wiLast

uint2_wi0, uint2_wi1, ...... uint2wiLast

uint2_wi0, uint2_wi1, ...... uint2wiLast

uint2_wi0, uint2_wi1, ...... uint2wiLast

uint2_wi0, uint2_wi1, ...... uint2wiLast

- Bruhaspati

I can't use the registers-- yes they're performing fine but only for a short while and they spill up to global memory (since I'm doing a lot of computations).

I tried moving some of the computations to local memory and had a huge boost-- (it was inside a big loop) but wrong results, so I'm trying to correct that.

Just when I stitched all uint to uint4 and use vector math optimization and now you're telling me to break it down again

Let me try your approach, hopefully the additional instructions in breaking this up will give minimal overhead, and using local memory will give more speed.

So what's the maximum (optimal) workgroup/worksize (taking into account at least multiple of 64 worksize) for this configuration?

0 Likes

Hi,

Workgroup size depends on your local memory needs.

4 uint4s is 64 bytes per Workitem....

i.e.

4096 bytes per wavefront.

40 wavefronts might be needed to fully hide your global memry latency.

That puts your local memory to 160K... Thats obviously not possible.

So, you might be working 32K which is 1/5th of what it is to hide your latencies completely.

i.e. 8 wavefronts....

To be on safe side, launch with a workgrou size of 64..

Even if there are small adjustments to local memory size -- you will at least have 7 of them active..

-

Bruhaspati...

0 Likes

I'll try that-- thanks!

0 Likes