Archives Discussions

notyou · ‎09-08-2011

I have a very simple algorithm which I'm using to give an introduction to OpenCL and I'm working on optimizing the algorithm to use VLIW to increase performance.

What I'm wondering is, when I vectorize the kernel, do I need to change the overall number of work items? i.e. size_t global[1] = {NUM_ITEMS/4}. I tested both with and without the /4 and the only difference (after verifying off of a sequential run) is the execution speed is halved when using the /4. Can anyone shed some light on this? Attached is the kernel I'm using.

__kernel void basic_kernel(__global int4* ocl_buffer, __local int4* local_ocl_buffer, int LOOP_ITERATIONS) { /* get our thread's global ID so we know where to write in memory */ int id = get_global_id(0); int i = 0; local_ocl_buffer[id] = ocl_buffer[id]; /* copy the data to local memory */ for ( i = 0; i < LOOP_ITERATIONS; i += 4 ) /* simple constant to loop many instructions to create work */ { /* this is a bad example as we can actually use registers instead of global/local memory. */ local_ocl_buffer[id] *= 2; local_ocl_buffer[id] /= 2; } local_ocl_buffer[id] += 1; /* offset by 1 just to verify work has been done since we set the original value to id */ ocl_buffer[id] = local_ocl_buffer[id]; /* transfer the data to global memory so we can retrieve it from the CPU */ }

notzed · ‎09-09-2011

Originally posted by: notyou
I have a very simple algorithm which I'm using to give an introduction to OpenCL and I'm working on optimizing the algorithm to use VLIW to increase performance.

What I'm wondering is, when I vectorize the kernel, do I need to change the overall number of work items? i.e. size_t global[1] = {NUM_ITEMS/4}. I tested both with and without the /4 and the only difference (after verifying off of a sequential run) is the execution speed is halved when using the /4. Can anyone shed some light on this? Attached is the kernel I'm using.

If you are still processing the same number of integers, then yes, one that processes 4 at a time will need 1/4 as many work items.

(intermediate:) BTW you don't normally want to write vector types to local memory for a gpu: typical hardware has 32 banks of 32-bit data. Writing 4 words at a time will guarantee bank conflicts if you have the wavefront populated. If you need to then access this local memory more than 1-2 times the bank conflicts will cause slow-downs and you'll find it's faster to store the data as scalar types. Even though you have to do it with 4 separate operations.

hint: You wouldn't normally ever use the global id to index a local memory. By definition local store is per-work group, which spans the local size. Unless localsize == global size, which is normally pretty pointless.

I know you're only using a contrived example, but it's probably better finding something real to do since it will show the sort of problems you will actually hit.

notyou · ‎09-09-2011

Originally posted by: notzed

If you are still processing the same number of integers, then yes, one that processes 4 at a time will need 1/4 as many work items.

OK. That's what I thought, I just needed a confirmation that I was doing 4 times the work if I didn't cut the global work size to 1/4 of the original.

Originally posted by: notzed

(intermediate:) BTW you don't normally want to write vector types to local memory for a gpu: typical hardware has 32 banks of 32-bit data. Writing 4 words at a time will guarantee bank conflicts if you have the wavefront populated. If you need to then access this local memory more than 1-2 times the bank conflicts will cause slow-downs and you'll find it's faster to store the data as scalar types. Even though you have to do it with 4 separate operations.

This I hadn't thought of. I'll try to remember this for the future.

Originally posted by: notzed

hint: You wouldn't normally ever use the global id to index a local memory. By definition local store is per-work group, which spans the local size. Unless localsize == global size, which is normally pretty pointless.

I know you're only using a contrived example, but it's probably better finding something real to do since it will show the sort of problems you will actually hit.

Yes, I know and agree, but I want to keep things as simple as possible to reduce the amount of code since it's supposed to basically be a hello world type of program going through some very simple steps to increase performance without looking at an actual algorithm which I then need to spend time explaining.

Thanks for your input.

Archives Discussions

Question about optimizing to use VLIW