I'm learning OpenCL and I'm making some slow and steady progress, but I'm not sure I'm understanding enqueueNDRangeKernel and workgroups and their size. I think it has something to do with contiguous byte buffers the kernel works on, so it may start at some indices and end at others, and we should think of it as "random" execution, and wait for all those to finish before trying to access the output buffer(s) of the kernel computation. So it shouldn't be assumed a kernel, if it executes multiple times on different buffer ranges, will start at index 0 to N, then N+1 to N+N, etc.
I wanted to first find a sha256 or other hashing kernel I could just load and try out, with obvious execution and kernel parameters. I found one here: https://github.com/Fruneng/opencl_sha_al_im
I tested it and the input I gave I could verify easily. Nice, since I had struggled to translate a straight C implementation into OpenCL. In particular where to put k[0..63] and how to mark it as constant memory, and whether I pass it in with a buffer copy from host to device or just stick it in the sha256.cl file. That's the way it usually is, but it's not clear to me why. And I wonder how I can pre-compute (pad etc) the input buffer ready to hash and finalize, copying the 32 bytes output back to the host.
Now for the actual question: I'm trying to find a kernel like this that I can have execute for a work-group size; on my Radeon mobile card it's 256. I would assume that's how many times the kernel will do a hash operation for me; is that correct? So I could, if there were a kernel that was "simple" with __kernel void sha256(const char *input, const int input_size, uint8_t out_hash) or similar, would 256 workgroups means I could queue up 256 input, size and output buffers, ready to copy in the data to hash and size of each block of data, and the range of memory for the output hashes? And would work item size be relevant to set, or local work groups?
So is there a way I can have OpenCL launch as many sha256 kernels as possible, with a buffer of input char strings, their sizes, and buffer of output 32 bytes x 256? And enqueueNDRangeKernel there I can specify command queue, kernel (but I would need a kernel where I can loop and set the arguments to it, right? And copy in the input buffers for hashing and the size of the byte buffers?), work dimension = 1, global work offset = ?, local work size ?, event params...) I'm looking for something where I might be able to do e.g. sha256<<<1, 256>>>(...) essentially, as it might look like with CUDA.
I think what I have to do is specify the input parameters 256 times (unless I am misunderstanding something about work group size; I just want it to work on 256 hashes in parallel), sort of like this:
char * vec; // Vector of 256 pointer to char/uint8_t of data to hash, malloc and get these copied ready
int sizes; // The sizes of the input strings to hash
uint8_t *out_hashes; // Vector of 256 output hash pointers to 32 uint8_t's
// Set kernel arguments, doesn't seem like I can just loop over and set them 0, 1, 2 as in the example kernel, but that's what we are supposed to do, isn't it? And the range kernel looks at global size and offsets and figures out which block of data its working on
// Do this 256 times? For the input data to hash and the size of the char strings, internally they are like unsigned char*, or uint8_t which is fine
// I'm actually using the C++ wrapper which I like, to hack on things quickly and try them out
enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(???), ...)
// Then finally read back the vector of 256 hashes
err = queue.enqueueReadBuffer(out_hashes, CL_TRUE, 0, sizes ...);
// Wait either way, block above, or non-block with CL_TRUE, finish here
So basically I'm looking for a kernel that can hash some data, and launch it as many times as I can, with all the bits it needs inside so all I have to specify is the input data to hash, the size of each block of char/unsigned char to hash, and the pointer to where the hashed data goes in a big block of N_HASHES * 32 (size of sha256 output = 32 bytes).
I *think* I have to find a kernel that can just take input, optionally size or specify that elsewhere with other OpenCL calls, but preferably give it to the kernel which I hope will make things simpler, and the output hash pointer for 256 bits/8 uint_t output bytes, and what I need to do is 1) Create the buffers ready for input and output, 2) create the kernel, 3) set the kernel arguments somehow, with contiguous blocks for input, sizes and output, I only have to do this once? 4) create a queue, 5) enqueueFillBuffer for input data and sizes of each input data block, 6) some other buffer call to specify where the output goes and mark it global with the others so both host and device have access, 6) enqueueNDRangeKernel(), 7) enqueueReadBuffer() to read back the output hashes. Then of course print them out and observe.
Thanks for any help!