Thanks for the question. I'm exlicitly leaving this for our community to answer, as it is a general education question, not a support-style question.
Can anyone help out Mercator?
1 of 1 people found this helpful
if you can devote a day or two to reading through some code examples, you'll be done with a working implementation on day 2 or 3. Modify one of those simple examples that come with the developer tools; the vector-matrix multiply example (i'm not sure if it's in the sdk) is a good one as it does something similar to what you want.
this describes a problem similar to your's: evaluating one/two points against all available.
when you're along the way, you're main kernel will probably consist mainly with about 5-10 lines of code
one for a bitwise operator (^) and then counting the number of nonzeros (popcount)
1 of 1 people found this helpful
In OpenCL there is a function for this: popcount: https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/popcount.html
This is implemented in the hardware by SSE4 CPUes and GCN GPUes too. (Not sure if it is supported by earlier GPUes, though...)
So the "5'000'0000'0000 comparisons" is not a problem at all.
For the ASCII -> binary conversion there is also a very fast instruction on the CPU: War on Theism: x86 Instruction Set Reference
It puts every upper bit of 16 bytes into the lower 16 bits of a given register. So you have to load 16 bytes of "00100101101..." into a 128bit xmm register, and then shift each byte in it left by 7 bits, and finally you can extract the resulting 16bits by the PMOVMSKB instruction. Note that, this is not the C/C++ way, it either needs an external assembler or inline asm or cpu intrinsics.
I think the fastest solution for the problem would be CPU asm on an SSE4 compatible CPU, as the task needs relatively small amount of math compared to memory IO.
Even if you don't have that modern SSE4 CPU, there are lots of older SIMD instrutions with which you can build a faster program than:
a) processing the whole data in a loop byte after byte with a classic C program.
b) transfering large (compared to the required math) amounts of data to and from the GPU.
Happy Easter to You too!
That would be really easy as there is a popcount builtin function in OpenCL which returns the number of non-zero bits in a number.
In just one day you can have a decent implementation in OpenCL!
You can take a look on the introductory tutorials about OpenCL on AMD OpenCL zone for example, and then adapt one of those to do your computations.
I believe you just want to iterate in 32 bit chunks over the 128 bits per vector you have (ie 8 32-bit chunks per 128) - and take the source 128 bit-vector, and read it in as a kernel parameter pair as a low and high uint4's (to save GDS reading) - I wouldn't expect the compiler to do the right thing under uint8 but you might try working with it as well. From there the naive approach to an efficient computation would be popcount(x_lo ^ target_lo) + popcount(x_hi ^ target_hi). I'd assume one work item to one hamming distance computation. A work-group would probably just be any multiple of 64 and be just as good and that along with the number of bit-vectors will determine the size of the global dimensions.
It'll be more efficient to submit the whole 100'000x100'000 all at once though and to support this you'd need to add another dimension of work - but it can be mostly the same approach. Since you want 100'000x100'000 outputs, there's really not all that much optimization to be done as far as I can tell - although I'm thinking you could use shared memory to reduce rereading. That's 1e10th elements though which is a bit above what most graphics cards can deal with - so you'll have to block it into smaller large batches. Maybe 100'000x1'000 or 10'000.
The difference in language for C/C++ -> OpenCL C99 is not a large jump, in fact its nicer at times because it has alot of things (vectorized types, math ops, and utilities like popcount) built in and you don't have to worry about other things you would on host code - it lets you think about the algorithm more. The hard part is thinking in multiple dimensions and types of parallel and dealing with/exploiting hardware architecture including register/shared memory usage and the tuning game while balancing an aggressive, unpredictable and untunable compiler. Testing religiously/rigorously which can be kind of harder to do/asinine in this domain but always pays off. Dealing with vendor based issues also takes an unhealthy amount of time and getting used to - especially since it can be a very long until they fix an issue while you sit powerlessly waiting for acks and commitments or news of any sort that they will fixing the problems you experience and in the mean time are forced to work around or put off affected components. I think in about 3 months or so an experienced veteran in C++ (with some hardware architecture skills + prior understanding of parallel programming) will have just started being dangerous - knowing what's fast and how to naively break down the problem in a useful way.
Thanks a lot to everybody who has replied. I especially liked the sentence "it lets you think about the algorithm more. The hard part is thinking in multiple dimensions and types of parallel" as it reminded me of a long-past-time when I was a student at ETH and had to program on a Cray, where it was all about thinking in - well not parallels but vectors. I am somehow looking forward to getting back into programming again. I'll give Open CL a try. Thanks for getting me jump-started.