Hi, I have some questions.
Are the CPU implementations optimized? Specifically, are the calculations blocked to fit in cache? An example of this kind of tuning is the ATLAS BLAS library. And do the CPU implementations use all of the cores? I ask as the usual wisdom is that CPUs have the advantage when arithmetic intensity is low due to very high effective memory bandwidth.
The other thing I wonder is about what performance is like with dense matrices, specifically something like a covariance matrix (so is symmetric positive definite and dense).
this looks like an interesting project. I have a few questions:
Is there a document about the internals of ViennaCL? For example, are you trying to merge several operations into one kernel to reduce overheads or is the OpenCL code static and you have a kernel call for each operation? (I only scanned the manual, maybe this is stated in there...)
You said you're planning to support CPUs in the next release. What exactly is that gonna look like? Will you have different OpenCL kernel versions for CPUs and GPUs and then let the user decide which device to use? Or are you trying to automatically distribute the work across multiple devices?
thanks for your questions and interest :-)
The implementations of the solvers are generic, so the same code is used for CPUs (via e.g. ublas) and GPUs (via the built in types of ViennaCL). Other linear algebra types can also be used after registering them at the wrapper functions (norm_2, inner_prod and the like). Hence, the level of optimization of the CPU implementations depends e.g. on ublas.
As for all calculations, performance for dense matrices depends on the type of operations and the involved sizes (i.e. amortization of the OpenCL overhead). I remember that I got good performance even for partly serial operations such as LU-factorization, but I can't give you exact numbers right now. Benchmark results are soon to follow.
At present, we don't have a document about the internals, apart from the standard doxygen output, which I admit is not very satisfactory. We try to collect as many operations as possible into a single kernel by using expression templates. For example, the operation
vec1 += alpha * vec2
is fused into a single OpenCL kernel using expression templates. This is fine in many cases. The more complicated operation
vec 1 = alpha * vec2 - beta * vec3 + gamma * vec4
creates a few temporaries, but less than with "naive C++". We already thought about creating compute kernels on the fly from the expression templates, but the effort for compiling a custom OpenCL kernel is often too high in order to be useful.
As for the multi-core CPU support in ViennaCL, we aim at providing a fall-back solution (if no GPU is present) in version 1.0.3 (scheduled for no later than next week). In later versions, we plan to provide support for multiple devices. This is, however, a really tricky thing, because operands may reside on different devices.
Thanks for the hint, dominik_g. It was on my schedule anyway :-)