Archives Discussions

glupescu · ‎06-10-2015

Whenever a kernel is launched across a set of CUs is there any way to get back info on what executed where ? - or how did the kernel mapped to the existing hardware.

I want to achieve a forced software-hardware map for a given GPU i.e. given some known contraints on the kernel and the exact hardware configuration to determine if a range of IDs execute in a certain CU or set of CUs.

Is this possible in any degree ? If not will this be possible in the future ?

Thanks in advance

maxdz8 · ‎06-11-2015

Thank you for you elaboration. I think I start to see your perspective.

I'll take a look at the resources you linked in the next few days.

In the meanwhile,

glupescu wrote:
I'm currently working in safety critical products/applications and GPUs are just starting to be integrated here. Most GPU IPs don't have from my knowledge ways to detect problems if parts go deffective. Likewise there may be a need to run things concurrently in a real time manner and being able to have afinity over a certain range of CUs would I think be helpful. You could expand on these ideas to find other use cases - so it's not so much about performance as it is about corectness and latency.
So coming back to my initial question a bit rephrased:
For instance if I would execute only 1 thread that would use few registers and fit inside a CU/SIMD (1) would it execute on CU0 or is it random (2) ? And if I have 8 threads or 32 threads or 64 threads would I be able to pinpoint the scheduling given the known resource constraints (registry requirements, memory stalls etc). (3)
OR
If I have for instance only an OpenCL application with a single command queue is it possible I could build/launch a kernel such that I would know what ID (get_global_id) executed on what compute unit (CU) ?(4) In other words can I determine a hardware-software map ?(5) Which executes on what compute unit/SIMD ?(6)

A "thread" always fits inside a CU or SIMD by definition, regardless how many resources it uses. In the sense it is always mapped to a single CU/SIMD as it is always mapped to a single lane of a SIMD. The private registers (VGPRs) will get spilled to memory if necessary; you cannot "take registers from other lanes". If you run out of LDS you won't be able to run the kernel. Maybe some drivers might spill LDS to global memory as well... given the vastly different performance pattern I wouldn't consider this a good thing.
Try to get away with "threads". Those are CPU-oriented terms. CL term is Work Item and a "thread" is truly a wavefront (AMD parlance) or sub-group (CL2.1 parlance, but I only quickly looked at the preliminary 2.1 spec).
In the wild, it is equivalent to random. Keep in mind Work Items don't get scheduled themselves; they are grouped in wavefronts and work groups ("local work size"), the latter being the way to associate "nearby threads" to coherent HW resources (such as LDS/L1 banks).
On AMD GCN those would always be grouped in a single wavefront and go to the same SIMD of a CU.
Not in base CL and I see no good reason to know this. Note cl_arm_get_core_id seems to to allow you exactly that. I haven't got much on the "slices" I previously mentioned, it looks to be an Intel specific concept.
Note if mission critical reliability is your target you don't want to map CUs either: the possibility a CU goes defective is close to zero. In the case it happens, you really want to throw the device away, exactly as you do with RAID disks as soon as one starts degrading. I agree vendors could improve detection of those cases. Correctness is not a problem. GPUs have provided correct evaluation of nearly arbitrary functions for close to 20 years now. If latency is your concern, use CPU cores as those are optimized for low latency.
No. Maybe you can as in (4) but I don't see how would you use this knowledge in any realistic way as branching on the condition still requires full dispatch potentially across all CUs. Maybe with sub-devices this might be a thing. Sub-devices are not exposed by GPUs for a reason: AFAIK those partitions don't exist at hardware level.
Everything can execute everywhere. Just to be safe: CU and SIMDs are two different things. The latter in particular is implementation-specific concept so no chance this is going to be supported in base CL. Maybe with some future extension but again I don't see the point. The programming model is centered towards massive scale and this means no restrictions.

View solution in original post

maxdz8 · ‎06-10-2015

Not possible in a portable way and most likely not going to work ever; the whole point of GPU programming, even before OpenCL entered the works is that you have a massive array of equal processors and work can be dispatched to them all.

I see you might want to do that as means of sharing L1. It still seems a very limited benefit to justify this.

I recall reading an extension about "slices" which might be doing more or less something like what you want. I'm dropping this here FYI but I'm not sure myself.

glupescu · ‎06-10-2015

Well if I have for instance only an OpenCL application with a single command queue it should be possible I could build/launch a kernel such that I would know what ID (get_global_id) executed on what compute unit.. In other words why can't I determine a hardware-software map ? (which executes on what compute unit/SIMD)

There are several articles on topics like concurrent kernel execution by executing the kernels each in smaller chunks thus forcing some sort of "software" round robin time slice (e.g. Improving GPGPU Energy-Efficiency through Concurrent Kernel Execution and DVFS")

It's not only about the L1 cache, I see several advantages to being able to do SIMD/CU afinity.

maxdz8 · ‎06-10-2015

In other words why can't I determine a hardware-software map ? (which executes on what compute unit/SIMD)

Because it's not in the current programming model. The current programming model is: everybody can execute everything (so we don't spend transistors in control circuitry but only stuff that does useful work).

You want the model extended. Fine. Then the question is: why would you want this functionality?

Please share those articles, I'd be interested in giving them a quick read. The few I've read in the past were nonsense. In particular, I remember reading the article you mention in the past; while I don't consider it snake oil, I consider it ivory tower academia blabbers; with DVFS being the epitome of academia obscurantism being basically "tweak your GPU".

I also don't see any kind of "software" round robin going on there. If you launch multiple kernels they will execute concurrently; I have observed this in practice and yes, it does improve performance.

Anyway, CL2.0 pipes might be able (in some future implementation) to exploit L1/LDS coherency. We still don't get to know who executes what but I don't see it being a problem.

Again, please elaborate on your needs. I'm interested in specific use case.

glupescu · ‎06-11-2015

I'm currently working in safety critical products/applications and GPUs are just starting to be integrated here. Most GPU IPs don't have from my knowledge ways to detect problems if parts go deffective. Likewise there may be a need to run things concurrently in a real time manner and being able to have afinity over a certain range of CUs would I think be helpful. You could expand on these ideas to find other use cases - so it's not so much about performance as it is about corectness and latency.

So coming back to my initial question a bit rephrased:

For instance if I would execute only 1 thread that would use few registers and fit inside a CU/SIMD would it execute on CU0 or is it random ? And if I have 8 threads or 32 threads or 64 threads would I be able to pinpoint the scheduling given the known resource constraints (registry requirements, memory stalls etc).

OR

If I have for instance only an OpenCL application with a single command queue is it possible I could build/launch a kernel such that I would know what ID (get_global_id) executed on what compute unit (CU) ? In other words can I determine a hardware-software map ? Which executes on what compute unit/SIMD ?

About those articles I've read or am still studying.

http://www.cs.virginia.edu/~skadron/Papers/meng_dws_isca10.pdf

https://www.ece.ubc.ca/~aamodt/papers/wwlfung.micro2007.pdf

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7108420

https://www.ece.ubc.ca/~aamodt/papers/eltantawy.hpca2014.pdf

http://cva.stanford.edu/publications/2000/cstream.pdf
http://www.comp.nus.edu.sg/~tulika/CGO15.pdf

maxdz8 · ‎06-11-2015

Thank you for you elaboration. I think I start to see your perspective.

I'll take a look at the resources you linked in the next few days.

In the meanwhile,

glupescu wrote:
I'm currently working in safety critical products/applications and GPUs are just starting to be integrated here. Most GPU IPs don't have from my knowledge ways to detect problems if parts go deffective. Likewise there may be a need to run things concurrently in a real time manner and being able to have afinity over a certain range of CUs would I think be helpful. You could expand on these ideas to find other use cases - so it's not so much about performance as it is about corectness and latency.
So coming back to my initial question a bit rephrased:
For instance if I would execute only 1 thread that would use few registers and fit inside a CU/SIMD (1) would it execute on CU0 or is it random (2) ? And if I have 8 threads or 32 threads or 64 threads would I be able to pinpoint the scheduling given the known resource constraints (registry requirements, memory stalls etc). (3)
OR
If I have for instance only an OpenCL application with a single command queue is it possible I could build/launch a kernel such that I would know what ID (get_global_id) executed on what compute unit (CU) ?(4) In other words can I determine a hardware-software map ?(5) Which executes on what compute unit/SIMD ?(6)

A "thread" always fits inside a CU or SIMD by definition, regardless how many resources it uses. In the sense it is always mapped to a single CU/SIMD as it is always mapped to a single lane of a SIMD. The private registers (VGPRs) will get spilled to memory if necessary; you cannot "take registers from other lanes". If you run out of LDS you won't be able to run the kernel. Maybe some drivers might spill LDS to global memory as well... given the vastly different performance pattern I wouldn't consider this a good thing.
Try to get away with "threads". Those are CPU-oriented terms. CL term is Work Item and a "thread" is truly a wavefront (AMD parlance) or sub-group (CL2.1 parlance, but I only quickly looked at the preliminary 2.1 spec).
In the wild, it is equivalent to random. Keep in mind Work Items don't get scheduled themselves; they are grouped in wavefronts and work groups ("local work size"), the latter being the way to associate "nearby threads" to coherent HW resources (such as LDS/L1 banks).
On AMD GCN those would always be grouped in a single wavefront and go to the same SIMD of a CU.
Not in base CL and I see no good reason to know this. Note cl_arm_get_core_id seems to to allow you exactly that. I haven't got much on the "slices" I previously mentioned, it looks to be an Intel specific concept.
Note if mission critical reliability is your target you don't want to map CUs either: the possibility a CU goes defective is close to zero. In the case it happens, you really want to throw the device away, exactly as you do with RAID disks as soon as one starts degrading. I agree vendors could improve detection of those cases. Correctness is not a problem. GPUs have provided correct evaluation of nearly arbitrary functions for close to 20 years now. If latency is your concern, use CPU cores as those are optimized for low latency.
No. Maybe you can as in (4) but I don't see how would you use this knowledge in any realistic way as branching on the condition still requires full dispatch potentially across all CUs. Maybe with sub-devices this might be a thing. Sub-devices are not exposed by GPUs for a reason: AFAIK those partitions don't exist at hardware level.
Everything can execute everywhere. Just to be safe: CU and SIMDs are two different things. The latter in particular is implementation-specific concept so no chance this is going to be supported in base CL. Maybe with some future extension but again I don't see the point. The programming model is centered towards massive scale and this means no restrictions.

glupescu · ‎06-11-2015

Thank you for the detailed answer - I understand now your point.

Archives Discussions

GPU hardware scheduler