Hi, welcome to the forums. Please make sure to mark an answer if you find it helpful.
First you need to make sure you know what similar means for your application - I'll carry forward with what you've decided is OK to use:
-The dimension to attack parallelism here is the index of an array - you can see then that this is a 1d problem
-Read in one input from both arrrays in global datastore/memory to local registers
-perform an absolute difference operation and threshold to your desired tolerance, store the result to a local register, as an 32 bit [unsigned] integer.
-do a parallel reduction on this threshold integer - in particular the algorithm you would be using is called parallel prefix sum or [parallel] scan with a addition reduction
The final sum is a count of how many elements are "same" by your metric.
Yes it will be fast, that depends on hardware but for a 7990 - probably a hair under 500 microseconds for 4 million elements ignoring the upload time provided you find the right parallel prefix sum implementation and tune everything.
Note on the parallel prefix sum - it is the algorithm that occurs everywhere in GPU programming and so it is good to understand it. OpenCL 2.0 has the hard part built in as part of the workgroup reduction functions but alas it's hard to get it fast in OpenCL 1 since there's no official implementation/library. You might find inspiration from CUDA CUB or AMD's Bolt library but it is not for the faint of heart to figure these things out.
Thank you for the timely reply.
There is one thing that I still do not quite understand
Since the two arrays are different in size, how should I do the comparison?
A ~do abs_diff(A,B), then B....B, if ans<threshold then C=1 or C=C++;
A ~do abs_diff(A,B), then B....B, if ans<threshold then C=1 or C=C++
Then sum(C) using parallel prefix sum.
So it will be 300 comparisons per worker and 2000 workers? Or should I do 2000 comparisons per worker and 300 workers? Or I am totally wrong?
I will need to run this method (Compare A and B) around 200,000 times, will upload time be a significant factor here then?
We have a machine with i7, 32GB memory, 1 x R9 290x/4GB card, will this be fast enough to do the job within 20min? Otherwise can we use 4 cards in parallel to further accelerate it?
Sorry if I have asked too many questions here....
your original notation confused me, I had figured it was a simple case of sum(abs(A - B) < 1e-4), instead you are looking at every possible pairing or something akin to sets - that makes this the problem a bit harder and now you will need to also make use of shared memory to keep things fast. I'm not very sure what the timing would be with this but I still think this is still fast with the input sizes you have mentioned. Note that this will count repeat values - it's unspecified if that matters to you or not and the problem is trickier otherwise.
with a revised understanding, I would try with the following approach
-read one element per work group from array B and store into a private register as b
-for each nwork_items sized block in A:
read nwork_items from A and store into shared memory as Atile
compare Atile with b using the metric you've defined as similar - if similar increment a private counting register c
-parallel prefix sum each private c to obtain d, the count of similar pairs this work group has found, and store to the GDS for another kernel or host to use
-sum all the ds from each work group either on the host or in another kernel to get D
300 batches of 256 work-items is probably going to keep your hardware fully busy.
Alternatively you can try and keep with your histogram like approach - simply treat each float as an integer (the same bits would be set on each of them). The problem is that you may not be able to fit C as you defined it previously in memory all at once, so you will have to divide up these "ints" in ranges corresponding to a given passes, parallel prefix sum that pass and keep a running total for D.
I'm quite positive both approaches are << 20 minutes with just one R290X. If the first method does what you want, I would expect it fastest.
What about sorting both arrays and process them in ascending order? At this array size (few thousand) it takes no time on the CPU either.
agreed - sorting both arrays would be fast at these sizes. You could determine what is similar via a modified binary search too. Should be decent on a CPU.
1 of 1 people found this helpful
I'm thinking of a better way than binary searching:
Having the 2 sorted arrays A, B.
We need 2 indexes i and j initialized to 0.
You can iterate the arrays like this:
if A[i]<B[j] then i++ else j++;
//so you step through the array elements in a way that A[i] and B[j] are maximally similar. Only need to count that similarity and I guess you have to deal with repeating values in the arrays with extra caution in order to copy the exact behavior of the original algorithm.
Thank you very much for your suggestion.
We have tried similar things before and we can actually keep A natively sorted upon initialization without any efforts. However, B is impossible to be natively sorted because it is experimental observations, so sorting is needed. Sorting will take about 500-700 operations for 300 elements array and so in total it will be slower than the one that I described in my original post.
This function takes about 65% of total runtime of our program that is running 24x7 on our cluster that is approx. 20 TFlops with 10 TB of memory. We just recently get a little budget to expand it for some extra performance. We are trying to determine how we should spend the money, either more CPU nodes or CPU+GPU. Because GPU is more energy efficient and also more cost effective ($/flops, correct me if I am wrong) so we are trying to see if we can implement something fast on GPU. The goal here is really to get maximum number of comparisons done per $, either on CPU or GPU. (For $2000, which machine configuration + what algorithm will run more comparisons per hour and/or per watt?)
I know this problem seems nothing in most of the situations, but for us every +1% of performance counts, +10% means a month more calculation per year.
I will try to implement what you and Jason have suggested as well as our old method once I am back home and I will run a comparison to see which is faster on GPU/CPU/GPU+CPU.
Please let me know if you have some other suggestions for this situation!
so is this a situation where you have a static A a large batch of Bs? And you're continuously chewing thru a lot of Bs?
If that's the case you might try batch-radix sorting B and doing a whole bunch of what @realhet outlined in parallel
It wasn't clear yet but keep in mind to do any work on the GPU, you will encounter latency - typically in the neighbourhood of 100usec per kernel launch + whatever IO latencies. It's a tad lower on the APUs.
Well if the memory hungry version is faster than the sorted version, then I'd try to make it a bit less memory hungry:
We have the C[0..99999999] array that is 12 megabytes long. That is too big for the cache so I'd do it with a 2 level table-array instead.
For example we want to check the 1234.5678 number then we first check it in the D[0..9999] array which point to the actual bit array part.
So basically it's a 2 stage lookup D.
And to make it work with the actual numbers in A, one should experiment with different sizes of D. The aim would be to separate large areas of adjacent zeroes and they could all point to the same array of zeroes in the memory, so the whole thing would be able to fit in the cache. You could choose a CPU with the biggest cache that is big enough to fit the data.
The bottleneck is random memory access, that's were GPU-es are not that good at, so I'd stay on CPU. Also there are so few math for so many memory operations, the GPU would be kinda under utilized while eating the Watts.
And avoid IF-s in the main loop, rather do the 2 lookups. Finally with a bit of SSE, the little amount of math could be processed parallel with the memory read operations.
If you think about the GPU-sorting version, that'd give work for the GPU for sure but you have to reorganize the program in order to process a lot more data than this because as @jason mentioned earlier, the latency is too big on GPU for this small size of data. Ideally you should plan at least 0.5 seconds of batches for optimal utilization.