Originally posted by: notyou I have a very simple algorithm which I'm using to give an introduction to OpenCL and I'm working on optimizing the algorithm to use VLIW to increase performance. What I'm wondering is, when I vectorize the kernel, do I need to change the overall number of work items? i.e. size_t global[1] = {NUM_ITEMS/4}. I tested both with and without the /4 and the only difference (after verifying off of a sequential run) is the execution speed is halved when using the /4. Can anyone shed some light on this? Attached is the kernel I'm using.
|
If you are still processing the same number of integers, then yes, one that processes 4 at a time will need 1/4 as many work items.
(intermediate:) BTW you don't normally want to write vector types to local memory for a gpu: typical hardware has 32 banks of 32-bit data. Writing 4 words at a time will guarantee bank conflicts if you have the wavefront populated. If you need to then access this local memory more than 1-2 times the bank conflicts will cause slow-downs and you'll find it's faster to store the data as scalar types. Even though you have to do it with 4 separate operations.
hint: You wouldn't normally ever use the global id to index a local memory. By definition local store is per-work group, which spans the local size. Unless localsize == global size, which is normally pretty pointless.
I know you're only using a contrived example, but it's probably better finding something real to do since it will show the sort of problems you will actually hit.