Whenever a kernel is launched across a set of CUs is there any way to get back info on what executed where ? - or how did the kernel mapped to the existing hardware.
I want to achieve a forced software-hardware map for a given GPU i.e. given some known contraints on the kernel and the exact hardware configuration to determine if a range of IDs execute in a certain CU or set of CUs.
Is this possible in any degree ? If not will this be possible in the future ?
Thanks in advance
Solved! Go to Solution.
Thank you for you elaboration. I think I start to see your perspective.
I'll take a look at the resources you linked in the next few days.
In the meanwhile,
glupescu wrote:
I'm currently working in safety critical products/applications and GPUs are just starting to be integrated here. Most GPU IPs don't have from my knowledge ways to detect problems if parts go deffective. Likewise there may be a need to run things concurrently in a real time manner and being able to have afinity over a certain range of CUs would I think be helpful. You could expand on these ideas to find other use cases - so it's not so much about performance as it is about corectness and latency.
So coming back to my initial question a bit rephrased:
For instance if I would execute only 1 thread that would use few registers and fit inside a CU/SIMD (1) would it execute on CU0 or is it random (2) ? And if I have 8 threads or 32 threads or 64 threads would I be able to pinpoint the scheduling given the known resource constraints (registry requirements, memory stalls etc). (3)
OR
If I have for instance only an OpenCL application with a single command queue is it possible I could build/launch a kernel such that I would know what ID (get_global_id) executed on what compute unit (CU) ?(4) In other words can I determine a hardware-software map ?(5) Which executes on what compute unit/SIMD ?(6)
Not possible in a portable way and most likely not going to work ever; the whole point of GPU programming, even before OpenCL entered the works is that you have a massive array of equal processors and work can be dispatched to them all.
I see you might want to do that as means of sharing L1. It still seems a very limited benefit to justify this.
I recall reading an extension about "slices" which might be doing more or less something like what you want. I'm dropping this here FYI but I'm not sure myself.
Well if I have for instance only an OpenCL application with a single command queue it should be possible I could build/launch a kernel such that I would know what ID (get_global_id) executed on what compute unit.. In other words why can't I determine a hardware-software map ? (which executes on what compute unit/SIMD)
There are several articles on topics like concurrent kernel execution by executing the kernels each in smaller chunks thus forcing some sort of "software" round robin time slice (e.g. Improving GPGPU Energy-Efficiency through Concurrent Kernel Execution and DVFS")
It's not only about the L1 cache, I see several advantages to being able to do SIMD/CU afinity.
In other words why can't I determine a hardware-software map ? (which executes on what compute unit/SIMD)
Because it's not in the current programming model. The current programming model is: everybody can execute everything (so we don't spend transistors in control circuitry but only stuff that does useful work).
You want the model extended. Fine. Then the question is: why would you want this functionality?
Please share those articles, I'd be interested in giving them a quick read. The few I've read in the past were nonsense. In particular, I remember reading the article you mention in the past; while I don't consider it snake oil, I consider it ivory tower academia blabbers; with DVFS being the epitome of academia obscurantism being basically "tweak your GPU".
I also don't see any kind of "software" round robin going on there. If you launch multiple kernels they will execute concurrently; I have observed this in practice and yes, it does improve performance.
Anyway, CL2.0 pipes might be able (in some future implementation) to exploit L1/LDS coherency. We still don't get to know who executes what but I don't see it being a problem.
Again, please elaborate on your needs. I'm interested in specific use case.
I'm currently working in safety critical products/applications and GPUs are just starting to be integrated here. Most GPU IPs don't have from my knowledge ways to detect problems if parts go deffective. Likewise there may be a need to run things concurrently in a real time manner and being able to have afinity over a certain range of CUs would I think be helpful. You could expand on these ideas to find other use cases - so it's not so much about performance as it is about corectness and latency.
So coming back to my initial question a bit rephrased:
For instance if I would execute only 1 thread that would use few registers and fit inside a CU/SIMD would it execute on CU0 or is it random ? And if I have 8 threads or 32 threads or 64 threads would I be able to pinpoint the scheduling given the known resource constraints (registry requirements, memory stalls etc).
OR
If I have for instance only an OpenCL application with a single command queue is it possible I could build/launch a kernel such that I would know what ID (get_global_id) executed on what compute unit (CU) ? In other words can I determine a hardware-software map ? Which executes on what compute unit/SIMD ?
About those articles I've read or am still studying.
http://www.cs.virginia.edu/~skadron/Papers/meng_dws_isca10.pdf
https://www.ece.ubc.ca/~aamodt/papers/wwlfung.micro2007.pdf
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7108420
https://www.ece.ubc.ca/~aamodt/papers/eltantawy.hpca2014.pdf
http://cva.stanford.edu/publications/2000/cstream.pdf
http://www.comp.nus.edu.sg/~tulika/CGO15.pdf
Thank you for you elaboration. I think I start to see your perspective.
I'll take a look at the resources you linked in the next few days.
In the meanwhile,
glupescu wrote:
I'm currently working in safety critical products/applications and GPUs are just starting to be integrated here. Most GPU IPs don't have from my knowledge ways to detect problems if parts go deffective. Likewise there may be a need to run things concurrently in a real time manner and being able to have afinity over a certain range of CUs would I think be helpful. You could expand on these ideas to find other use cases - so it's not so much about performance as it is about corectness and latency.
So coming back to my initial question a bit rephrased:
For instance if I would execute only 1 thread that would use few registers and fit inside a CU/SIMD (1) would it execute on CU0 or is it random (2) ? And if I have 8 threads or 32 threads or 64 threads would I be able to pinpoint the scheduling given the known resource constraints (registry requirements, memory stalls etc). (3)
OR
If I have for instance only an OpenCL application with a single command queue is it possible I could build/launch a kernel such that I would know what ID (get_global_id) executed on what compute unit (CU) ?(4) In other words can I determine a hardware-software map ?(5) Which executes on what compute unit/SIMD ?(6)
Thank you for the detailed answer - I understand now your point.