I'm trying to figure out how to port some sequential code to a kernel so i can test the speedup, but im having a lot of trouble. None of the tutorials ive found help with this aspect, only the setting up the kernel to run and recieve data. So far i have this for multiplying polynomials together.
sequential:
int *mult(int size, int *a, int *b)
{
int *out, i, j;
out = (int *)malloc(sizeof(int) * size * 2 - 1);
for(i = 0; i < size; i++){
for(j = 0; j < size; j++){
out[i + j] = (a * b
}
}
return out;
}
Naive Kernel:
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
__kernel void poly_mul( __global int *a, __global int *b, __global int *c)
{
size_t i = get_global_id(0);
size_t j = get_global_id(1);
c[i + j] = (a * b
}
this doesnt return the correct result, i think this is because it of the c[i + j] on the end. Anyone have any advice or a link to a good tutorial on this aspect of opencl?
Thanks
I'm a beginner, but it seams to me that you don't protect memory:
for example, in your code both threads (1, 2) and (2, 1) write to the memory address c[3] at the same time different results, anything can end up there, or first result or second or a combination of the two (they are all in parallel so they do all the operations at the same time). So the result is just junk.
Yep, mihzaha is right. But he's missing an explanation of the underlying reason:
Data layout in memory: I assume your source memory is a 2d array of ints, correct? Well in memory (and after passing to your kernel) these items have to be laid out sequentially, and in C its done in what's called "Row Major" format. You need to adjust your indexing from c[i+j] to something along the lines of c[i*rowSize+j] to switch from 2d indexing to 1d.
Also, I presume you're doing a small problem size, so get_global_id is excusable, but you'll want to get used to using get_local_id and get_group_id. Once your problem size gets bigger you'll have to use multi-tiered indexing.
If you want to use a 2d range, for this problem you have to be careful not to write to the same address concurrently with another kernel (race condition). I guess you could output a matrix size*size with: matrix[i*size + j] = (a * b
Wow, I completely blew it on my first look at this problem. I blame lack of coffee
Algorithmically, your kernel is correct, but fpaboim hit the nail on the head. This problem results in a lot of threads trying to write to the same memory address simultaneously and bashing each other's data.
For polynomials of lower order you could probably get away with using atomic adds for the c[i + j] = (a * b
Is there any way you can map the algorithm so that each thread writes to a unique memory location and then afterwards call a reduction kernel which spawns a thread for each power? (AKA one thread sums up all the x^2 terms, another for x^3, another for x^4, ... x^n)