2 Replies Latest reply on Sep 9, 2011 1:18 PM by notyou

    Question about optimizing to use VLIW

    notyou

       

      I have a very simple algorithm which I'm using to give an introduction to OpenCL and I'm working on optimizing the algorithm to use VLIW to increase performance.

      What I'm wondering is, when I vectorize the kernel, do I need to change the overall number of work items? i.e. size_t global[1] = {NUM_ITEMS/4}. I tested both with and without the /4 and the only difference (after verifying off of a sequential run) is the execution speed is halved when using the /4. Can anyone shed some light on this? Attached is the kernel I'm using.



      __kernel void basic_kernel(__global int4* ocl_buffer, __local int4* local_ocl_buffer, int LOOP_ITERATIONS) { /* get our thread's global ID so we know where to write in memory */ int id = get_global_id(0); int i = 0; local_ocl_buffer[id] = ocl_buffer[id]; /* copy the data to local memory */ for ( i = 0; i < LOOP_ITERATIONS; i += 4 ) /* simple constant to loop many instructions to create work */ { /* this is a bad example as we can actually use registers instead of global/local memory. */ local_ocl_buffer[id] *= 2; local_ocl_buffer[id] /= 2; } local_ocl_buffer[id] += 1; /* offset by 1 just to verify work has been done since we set the original value to id */ ocl_buffer[id] = local_ocl_buffer[id]; /* transfer the data to global memory so we can retrieve it from the CPU */ }

        • Question about optimizing to use VLIW
          notzed

           

          Originally posted by: notyou
          I have a very simple algorithm which I'm using to give an introduction to OpenCL and I'm working on optimizing the algorithm to use VLIW to increase performance.

           

          What I'm wondering is, when I vectorize the kernel, do I need to change the overall number of work items? i.e. size_t global[1] = {NUM_ITEMS/4}. I tested both with and without the /4 and the only difference (after verifying off of a sequential run) is the execution speed is halved when using the /4. Can anyone shed some light on this? Attached is the kernel I'm using.

           

           



          If you are still processing the same number of integers, then yes, one that processes 4 at a time will need 1/4 as many work items.

          (intermediate:) BTW you don't normally want to write vector types to local memory for a gpu: typical hardware has 32 banks of 32-bit data.  Writing 4 words at a time will guarantee bank conflicts if you have the wavefront populated.  If you need to then access this local memory more than 1-2 times the bank conflicts will cause slow-downs and you'll find it's faster to store the data as scalar types.  Even though you have to do it with 4 separate operations.

          hint: You wouldn't normally ever use the global id to index a local memory.  By definition local store is per-work group, which spans the local size.  Unless localsize == global size, which is normally pretty pointless.

          I know you're only using a contrived example, but it's probably better finding something real  to do since it will show the sort of problems you will actually hit.

           

            • Question about optimizing to use VLIW
              notyou

               

              Originally posted by: notzed

              If you are still processing the same number of integers, then yes, one that processes 4 at a time will need 1/4 as many work items.

              OK. That's what I thought, I just needed a confirmation that I was doing 4 times the work if I didn't cut the global work size to 1/4 of the original.

               

              Originally posted by: notzed

              (intermediate:) BTW you don't normally want to write vector types to local memory for a gpu: typical hardware has 32 banks of 32-bit data.  Writing 4 words at a time will guarantee bank conflicts if you have the wavefront populated.  If you need to then access this local memory more than 1-2 times the bank conflicts will cause slow-downs and you'll find it's faster to store the data as scalar types.  Even though you have to do it with 4 separate operations.



              This I hadn't thought of. I'll try to remember this for the future.

               

               

              Originally posted by: notzed

              hint: You wouldn't normally ever use the global id to index a local memory.  By definition local store is per-work group, which spans the local size.  Unless localsize == global size, which is normally pretty pointless.

               

               

               

               

              I know you're only using a contrived example, but it's probably better finding something real  to do since it will show the sort of problems you will actually hit.



              Yes, I know and agree, but I want to keep things as simple as possible to reduce the amount of code since it's supposed to basically be a hello world type of program going through some very simple steps to increase performance without looking at an actual algorithm which I then need to spend time explaining.

              Thanks for your input.