I've tried to port the bitonic sort to aparapi and oddly the GPU code takes 140x more time to execute than simple single thread java. What is wrong with this code (i only made a translation from the sdk samples to java...)?

public class BitonicSort { public static final int TRUE = 1; public static final int FALSE = 0; public static class BitonicSortKernel extends Kernel { int[] theArray; int stage; int passOfStage; int width; int direction; @Override public void run() { int sortIncreasing = direction; int threadId = getGlobalId(); int pairDistance = 1 << (stage - passOfStage); int blockWidth = 2 * pairDistance; int leftId = (threadId % pairDistance) + (threadId / pairDistance) * blockWidth; int rightId = leftId + pairDistance; int leftElement = theArray[leftId]; int rightElement = theArray[rightId]; int sameDirectionBlockWidth = 1 << stage; if((threadId/sameDirectionBlockWidth) % 2 == 1) sortIncreasing = 1 - sortIncreasing; int greater = rightElement; int lesser = leftElement; if(leftElement > rightElement) { greater = leftElement; lesser = rightElement; } if(sortIncreasing != 0) { theArray[leftId] = lesser; theArray[rightId] = greater; } else { theArray[leftId] = greater; theArray[rightId] = lesser; } } } private final BitonicSortKernel kernel = new BitonicSortKernel(); public BitonicSort() { System.out.println("Execution mode=" + kernel.getExecutionMode()); } public void sort(final int[] array) { int passOfStage; int stage; int numStages = 0; int temp; /* * This algorithm is run as NS stages. Each stage has NP passes. * so the total number of times the kernel call is enqueued is NS * NP. * * For every stage S, we have S + 1 passes. * eg: For stage S = 0, we have 1 pass. * For stage S = 1, we have 2 passes. * * if length is 2^N, then the number of stages (numStages) is N. * Do keep in mind the fact that the algorithm only works for * arrays whose size is a power of 2. * * here, numStages is N. * * For an explanation of how the algorithm works, please go through * the documentation of this sample. */ /* * 2^numStages should be equal to length. * i.e the number of times you halve length to get 1 should be numStages */ for(temp = array.length; temp > 1; temp >>= 1) ++numStages; /*** Set appropriate arguments to the kernel ***/ /* the input array - also acts as output for this pass (input for next) */ kernel.theArray = array; /* length - i.e number of elements in the array */ kernel.width = array.length; /* whether sort is to be in increasing order. true implies increasing */ kernel.direction = TRUE; for(stage = 0; stage < numStages; ++stage) { kernel.stage = stage; /* Every stage has stage+1 passes. */ for(passOfStage = 0; passOfStage < stage + 1; ++passOfStage) { /* pass of the current stage */ kernel.passOfStage = passOfStage; /* * Enqueue a kernel run call. * For simplicity, the groupsize used is 1. * * Each thread writes a sorted pair. * So, the number of threads (global) is half the length. */ kernel.execute(array.length); } } } public void CPUSort(int[] array) { final int halfLength = array.length/2; for(int i = 2; i <= array.length; i *= 2) { for(int j = i; j > 1; j /= 2) { boolean increasing = true; final int half_j = j/2; for(int k = 0; k < array.length; k += j) { final int k_plus_half_j = k + half_j; if(i < array.length) { if((k == i) || ((k % i) == 0) && (k != halfLength)) increasing = !increasing; } for(int l = k; l < k_plus_half_j; ++l) { if(increasing) { if(array[l] > array[l + half_j]) { int temp = array[l]; array[l] = array[l + half_j]; array[l + half_j] = temp; } } else { if(array[l + half_j] > array[l]) { int temp = array[l + half_j]; array[l + half_j] = array[l]; array[l] = temp; } } } } } } } public static void main(String[] args) { BitonicSort sort = new BitonicSort(); int[] data = new int[1024]; int[] data2 = new int[1024]; for(int i=0; i<data.length; i++) { data[i] = Math.abs( (int) (Math.random() * 255.0)); } System.arraycopy(data, 0, data2, 0, data.length); long t0 = System.currentTimeMillis(); sort.sort(data); long t1 = System.currentTimeMillis(); sort.CPUSort(data2); long t2 = System.currentTimeMillis(); System.out.println(t1 - t0); System.out.println(t2 - t1); } }

With such a tiny data set, any parallel speedup the GPU can contribute will be overwhelmed by the (considerable) overhead of calling the kernel ten times. Try it with a much larger set (data array size between one and one hundred megabytes).