cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Meteorhead
Challenger

GCN working as x86(_64)

Hi everyone!

I would have a highly theoretical question: how big of a challange would it be to enhance GCN architecture to be x86 compatible?

I'm thinking of the following... all vendors are aiming the Fusion of CPUs and GPUs, just in slightly different manners. Intel had the idea of using x86 cores for doing graphics, and tried to create the smallest possible core to create a 10-12 core CPU (at first) that could dynamically dedicate each cores as x86 or as GPU cores. We all know how that story goes, many failures until Phi has been released as the first viable product, but actually they have not implemented a matching SW renderer for any of the graphics APIs, so it is practically uncapable of modern graphics (sadly). How challanging would it be to tackle the problem from the other side? How self-defeating would it be to enhance the GCN architecture to semi-efficiently do x86 instructions?

Naturally most of the surrounding would have to be altered, the instruction feeder, the branching unit and the GCN cores cores themselves have to recieve additional wiring to be able to do x86 instructions collectively. I know a lot less about CPU architectures (although I do read the articles about new architectures when they appear), so I do not know how complex things can get, but I'd figure there's a reason why the cores are that much larger and why it takes roughly 100X more power to do one instruction on a CPU than on the GPU. Could it be that similarly as to how 64-bit instructions are done via joining 2 GCN cores, maybe the x86 instructions could be done joining 2 or more of them. Naturally the SSE and AVX instructions would not be that hard to implement, since AFAIK they always issue the same instructions down each lane, and since there are 16 scalar cores coupled together, it cannot be that hard to do vector operations together, or perhaps the 4 seperate bundles of 16 cores could serve as 4 lanes of an SSE operation. Naturally the cores could not operate in GPU and CPU mode at the same time (similar to how Wavefront switching is done), so the CPU mode would be exclusive to an entire CU at a time, and it would be a very special mode of the instruction feeder also, and in this sole case the GCN cores would do a very limited form of branching, where they do different parts of an operation. Although many things are different, the 64k LDS could serve as L1 cache to the cores, saving die space, etc. I have the feeling many things could be reused that is already working. The biggest challange I feel would be coming up with the collective behaviour that result in x86 operations in the fastest way possible, branching, peeking ahead instructions, etc... all the things that make a CPU latency optimized.

I understand that the resulting GPU would be a gigantic overkill for graphics in terms of knowledge, but it would be a very dumb CPU. However, it could be one of the most advanced GPGPU graphics architecture, and it could be a 16-32 core CPU in a notebook even, with moderate single core performance, and humongous parallel compute power, not to mention it's fully dynamic nature in terms of CPU-GPU relative performance.

I know that AMD would never comment on ongoing research, even if there was such a project as the next step of Fusion, but forumers and AMD empoyees alike could comment on how big nonsense are the things I just said? Is the entire idea self-defeating, or is just all too hard to do, or what are your opinions?

0 Likes
5 Replies
yurtesen
Miniboss

I guess I may be wrong but below is my thoughts about this What I dont get is why make GPUs to become x86 compatible hardware? It doesnt make sense in my mind?

First of all PHI is also a failure (and not x86 compatible at all). I tried running Intel's own OpenCL (we tried native codes, offload codes too and failed in performance) examples which are optimized for PHI and they run at least twice faster on AMD GPUs. Of course GPU optimized programs run at least 10 times faster so far... If you check Intel's own literature their speed figures always talk about 'theoretical' speeds (with small letters in every page bottom) and even they dont have graphs showing how PHI takes on GPUs. (should be a hint eh?). But of course this wont stop Intel from selling PHIs like candy Well they will just push it out using brute-force advertising.

About x86 on GPUs... Even if it would be possible, it would never be 100% compatible and be useless. Because you cant just run x86 programs which are designed to run on single or few cores on a massively parallel architecture. Again, phi is a great example, Intel said to everyone that they will be able to re-compile their code and run their programs (yes and you can run but very slowly heh). But you really have to re-design your program to be efficient on PHI. The same problem you would have if you were to use GPU.

In addition, I would expect the power usage would be tremendous if all the x86 logic was implemented. Basically you are talking about making a 64/128-core processor. It is simply impossible unless if you go back to something like original pentium complexity (as what Intel did). Today you can't run X86 programs on phi anyway. You need to re-compile them anyway for phi architecture, and those programs wont run on x86 processors either. What a mess Intel will soon have similar problems as VLIW cards had, they will get bad performance for programs which does not vectorize. Most programs do not vectorize well unfortunately, you need specific programming for it. Intel realized this and made an pseudo x86 compatible device, but they support OpenCL because otherwise they will fail to get people to re-program their programs only for their own device.

I see today that if I make an OpenCL program, it is sort of useless for me to think what architecture my program is running on. It will just run on anything. (so does it matter if GPU supports x86 assembly?)

Also when I think about this, it doesnt make much sense. I have x86 which was originally designed to excel on single core processing, then we have GPUs which were designed to excel on massive parallelism. Why try to turn GPU into a CPU?  Wouldnt it be like making a car which can be used as a boat? (the problem would be that car shape is not suitable for water, so it would be inefficient ...)

I think the future is in programs which can offload heavy calculations to GPUs or any accelerator, OpenCL / HSA is a great way to accomplish this. Perhaps they can make it so CPUs can take advantage of GPU elements automatically, but it makes little sense for GPUs to become x86 CPUs in my opinion.

0 Likes
realhet
Miniboss

Hi,

IMO the IA-64 instruction set is a terrible mess compared to GCN. It would be a waste of die space to implement that complicated x86 decoder in a GCN chip. It is a mess because 25? years of downward compatibility: It still supports old instructions that are never used in todays programs, for example BCD arithmetic. Those instructions has the smallest encodings in contrast to the new simd instructions which are encoded as prefixes over prefixes and yet more prefixes. (REX, VEX, DataSize, AddressSize, etc.) -> X86-64 Instruction Encoding - OSDev Wiki Also I bet the IA64 can address at least 100 registers, but those are just different names to the same location: 8x64bit regs:[al, ax, eax, rax], 8x80bit fpu/mm regs[mm0,st0], 16 x 256bit simd regs:[xmm0, ymm0] so many bits and bytes are used to distinguish those. GCN simply uses 9 bits to choose from 256 vregs, 128 sregs, 90 inline constant values and even a direct_lds value.

Although the S ALU is fully capable to do things that an x86 can do: it can branch anywhere in the 64bit address space, even self modifying code works on GCN.

0 Likes

I understand why it would be painful to incorporate the entire IA-64 instructionset into GCN, I do not know if ARM differs in that matter.

To Yursteen: the idea of merging the two types is mainly twofold. Once to save die space. This might be a naive idea that parts of one could be reused in the other. It might just be that it's the other way around and it would actually take more wiring to merge the two than leave them seperate. More iportantly however dyanmic allocation of resources to CPU or GPU compute power would do a lot more in the long run. Let's say I got a notebook processor which would sport 8-12 GCN-like CUs, which all could function as 8-12 regular x86 (or ARM) cores. When I'm playing games 1-2 of them is running the engine, the others are doing graphics, or when I'm developing on the desktop, 10 are compiling my app and running Intellisense while the 2 render desktop for instance. Naturally this could all be done with seperate architectures, but as I see things, there is no viable solution to leveraging GPU compute power if it's not done completely transparently.

This is not by all means due to the lazyness of us programmers. It's simply that there's not a simple API that could serve as a convenient means of programming the GPU. OpenCL is very cumbersome and is only useful at that very last stage of computation when all things boil down to C structs. As far as I see, with the sabotage of NV, OpenCL is seizing being a portable API. If I were any big company, I'd fear building on top of it if I intended to aim all customers. C++AMP is the closest to being a good solution, however there's no linux implementation, and the spec lacks many things. Just look at the MSDN blog post with C++AMP wishlist. Still, I say it's one of the best.

All APIs either miss out on a vendor, or on a platform. Or if things sort of meet, than it's juwst outright too hard to implement logic like a compiler would need. (I know there is an OpenCL based compiler with that watercooled beast machine, we've just haven't heard of them since.)

As long as there is no good, standard, portable API out there, nothing's gonna change unless the HW really starts to merge. That is how I see things.

0 Likes

As far as I know, even if all else is possible, no company has gotten the clocks per cycle on the parallel chips to be much better than 1 ghz. So, you would be faced with running an x86 instruction set for an operating system on a gpu-style chip at 1ghz instead of a cpu-style chip at 3ghz. That performance hit can't be overcome in most apps.

0 Likes

I think GCN would probably be very slow if you ran an operating system on it, even if it supported x86 instructions natively. Perhaps it would achieve even worse performance than the first atom processors, and would use more power at the same time. So that is why I said it makes no sense.

Do you think 16 CU GCN will be faster than 2 Core CPU when running an operating system? But why? You should think that you have a  processor running at 3ghz, with advanced logic to efficiently handle single thread tasks which most programs are made of, compared to gpu architecture which is relatively dumb and running at 1ghz. I wouldnt want to exchange 1 CPU core to 8 CUs  To be honest, even some parallel tasks work quite well on CPUs... There is a reason why OpenCL is only useful in the last stage of a computation, because your code would just work very slow if you ran the whole thing on the GPU and only some certain parts of it benefit from the architecture of GPU.

I think AMD is already working on what you want. It is HSA

http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-system-architecture...

You dont really want a single device which can switch between CPU/GPU modes. You need a programming environment which will send work to whoever is more efficient in solving that specific work.

0 Likes