Archives Discussions

pmlopes · ‎10-18-2010

I've tried to port the bitonic sort to aparapi and oddly the GPU code takes 140x more time to execute than simple single thread java. What is wrong with this code (i only made a translation from the sdk samples to java...)?

public class BitonicSort { public static final int TRUE = 1; public static final int FALSE = 0; public static class BitonicSortKernel extends Kernel { int[] theArray; int stage; int passOfStage; int width; int direction; @Override public void run() { int sortIncreasing = direction; int threadId = getGlobalId(); int pairDistance = 1 << (stage - passOfStage); int blockWidth = 2 * pairDistance; int leftId = (threadId % pairDistance) + (threadId / pairDistance) * blockWidth; int rightId = leftId + pairDistance; int leftElement = theArray[leftId]; int rightElement = theArray[rightId]; int sameDirectionBlockWidth = 1 << stage; if((threadId/sameDirectionBlockWidth) % 2 == 1) sortIncreasing = 1 - sortIncreasing; int greater = rightElement; int lesser = leftElement; if(leftElement > rightElement) { greater = leftElement; lesser = rightElement; } if(sortIncreasing != 0) { theArray[leftId] = lesser; theArray[rightId] = greater; } else { theArray[leftId] = greater; theArray[rightId] = lesser; } } } private final BitonicSortKernel kernel = new BitonicSortKernel(); public BitonicSort() { System.out.println("Execution mode=" + kernel.getExecutionMode()); } public void sort(final int[] array) { int passOfStage; int stage; int numStages = 0; int temp; /* * This algorithm is run as NS stages. Each stage has NP passes. * so the total number of times the kernel call is enqueued is NS * NP. * * For every stage S, we have S + 1 passes. * eg: For stage S = 0, we have 1 pass. * For stage S = 1, we have 2 passes. * * if length is 2^N, then the number of stages (numStages) is N. * Do keep in mind the fact that the algorithm only works for * arrays whose size is a power of 2. * * here, numStages is N. * * For an explanation of how the algorithm works, please go through * the documentation of this sample. */ /* * 2^numStages should be equal to length. * i.e the number of times you halve length to get 1 should be numStages */ for(temp = array.length; temp > 1; temp >>= 1) ++numStages; /*** Set appropriate arguments to the kernel ***/ /* the input array - also acts as output for this pass (input for next) */ kernel.theArray = array; /* length - i.e number of elements in the array */ kernel.width = array.length; /* whether sort is to be in increasing order. true implies increasing */ kernel.direction = TRUE; for(stage = 0; stage < numStages; ++stage) { kernel.stage = stage; /* Every stage has stage+1 passes. */ for(passOfStage = 0; passOfStage < stage + 1; ++passOfStage) { /* pass of the current stage */ kernel.passOfStage = passOfStage; /* * Enqueue a kernel run call. * For simplicity, the groupsize used is 1. * * Each thread writes a sorted pair. * So, the number of threads (global) is half the length. */ kernel.execute(array.length); } } } public void CPUSort(int[] array) { final int halfLength = array.length/2; for(int i = 2; i <= array.length; i *= 2) { for(int j = i; j > 1; j /= 2) { boolean increasing = true; final int half_j = j/2; for(int k = 0; k < array.length; k += j) { final int k_plus_half_j = k + half_j; if(i < array.length) { if((k == i) || ((k % i) == 0) && (k != halfLength)) increasing = !increasing; } for(int l = k; l < k_plus_half_j; ++l) { if(increasing) { if(array > array[l + half_j]) { int temp = array; array = array[l + half_j]; array[l + half_j] = temp; } } else { if(array[l + half_j] > array) { int temp = array[l + half_j]; array[l + half_j] = array; array = temp; } } } } } } } public static void main(String[] args) { BitonicSort sort = new BitonicSort(); int[] data = new int[1024]; int[] data2 = new int[1024]; for(int i=0; i<data.length; i++) { data = Math.abs( (int) (Math.random() * 255.0)); } System.arraycopy(data, 0, data2, 0, data.length); long t0 = System.currentTimeMillis(); sort.sort(data); long t1 = System.currentTimeMillis(); sort.CPUSort(data2); long t2 = System.currentTimeMillis(); System.out.println(t1 - t0); System.out.println(t2 - t1); } }

Starglider · ‎10-18-2010

With such a tiny data set, any parallel speedup the GPU can contribute will be overwhelmed by the (considerable) overhead of calling the kernel ten times. Try it with a much larger set (data array size between one and one hundred megabytes).

pmlopes · ‎10-18-2010

I did and i get the same result, i defined the size of my array as Math.pow(2, 24) and i see almost the same speed, single thread CPU version is faster...

gfrostamd · ‎10-18-2010

'pmlopes' I just took your code and replicated your results, and timings sadly.

I did notice that when I ran the Kernel code using JTP (Java Thread Pool) ( -Dcom.amd.aparapi.executionMode=JTP on the java command line) that the Kernel is accessing beyond the end of the array (for 1024 I saw accesses up to 2048) so I was actually getting ArrayIndexOutOfBoundsExceptions. If we have txlated the code to OpenCL properly this will result in an issue on the GPU as well (but clearly we don't have array bounds checks on the GPU, so the results here are probably even more scary).

My suspicion on this workload is that the cost of moving the buffer backwards and forwards to the GPU is what is killing the performance. Clearly the whole array is being moved each time we hit the inner loop.

I am not familiar with the BitonicSort algorithm, but will endeavour to isolate the issue. But can you also just check the implementation (try running using JTP mode) and determine if the ArrayIndex errors are an algorithmic bug? or possibly something we are doing wrong in Aparapi.

Thanks for posting the code.

I will post comments back here once I get more information

gfrostamd · ‎10-18-2010

A few more notes.

Looking at a few more implementations of bitonic sort I think that

kernel.execute(array.length);

Should have been

kernel.execute(array.length/2);

That certainly got rid of the JTP ArrayIndexOutOfBounds errors.

I changed your code slightly to run the kernel twice, this is because the first kernel.execute() incurs the cost of the bytecode to OpenCL conversion. Adding -Dcom.amd.aparapi.logLevel=FINE will show cost of codegen and also show the generated OpenCL.

At this point the GPU was now only 4 x that of the CPU.

512 7 2 (7 ms for GPU 2 for your single threaded loop)

And as we increase the count the ratio starts to come down

1024 10 3

65536 77 38

131072 112 80

However as the buffer size increases we clearly increase the overhead of the host to GPU copy each time. So (as I suspected above) these buffer copies will stop us from beating the single threaded version

A handcrafted OpenCL version of this code can avoid the copy back and forth to the device each time kernel.execute() is called. It can essentially copy the data once and leave on the GPU, then execute multiple times and at the end copy all of the data back. In our case we don't know whether the host (java code) is actually expecting to use the buffer between each execute() call and as such we have to play it safe and txfer both ways.

We recognize this pattern is one where GPU's are currently used and do have some ideas for a Beta release on how we might alleviate this.

I will experiment with this code and see if I come up with any ideas.

pmlopes · ‎10-18-2010

Thanks for your (gfrostamd) for the reply. Maybe it would be a nice feature to have DirectBuffers used as arrays in the kernels instead of java native arrays since under JNI they can be mapped to contiguous memory...

Another thing i didn't understood because i am totally noob to OpenCL is what should be the globalSize in Kernel.execute(int globalSize) call. What does it mean, the total executions? the size of our array (probably not since half gives the correct values...

gfrostamd · ‎10-18-2010

Regarding DirectBuffers, this is a good suggestion and is an idea we have been toying with.

The parameter to Kernel.execute(int globalSize) will determine how many parallel kernel's are launched, each kernel will receive a unique 'globalId' when they invoke getGlobalId(). So in the BitonicSort example, where you are expecting multiple kernels to deal with potentially swapping two array elements, you want to request that length/2 kernels be launched. Each kernel will then deal with 'potentially swapping' two array elements. So the whole array will be covered.

Gary

Archives Discussions

aparapi 140x slower than single thread java?! what am i doing wrong?