will graphics functionality be a part of the CPU instruction set?

you are completely wrong, since all experts say that in 6 years grafix card will be obselete.
I don't think a "graphics" card will be obsolete, it will just have more uses than just graphics. As Stream computing intensifies over the years, expect offloading dedicated routines to a VPU/GPU.
We are seeing it with GPGPU, physics.. the possibilities of embedding encryption routines on a GPU are endless. Supercomputer utilities can now be offset at a fraction of the cost, in a way the dedicated graphics processor has matured.

Graphics don't become obsolete, they evolve in something way beyond CPU/GPU.
 
Last edited by a moderator:
you are completely wrong, since all experts say that in 6 years grafix card will be obselete.

You see pretty cock-sure of yourself bearing in mind some of the newbie questions you've been asking in other parts of the forum. Please also bear in mind that some of the people who hang around this forum are actually "experts". Maybe they know more than you do?

"Experts" have made many predictions over the years about the shape of the future. Funnily enough, a goodly fraction of them have been wrong. Many of the "experts" spouting on this issue recently actually have an intrinsic bias and are motivated by financial self-interest; their pronouncements are often expressions of wishful thinking rather than predictions.
 
"Experts" have made many predictions over the years about the shape of the future. Funnily enough, a goodly fraction of them have been wrong. Many of the "experts" spouting on this issue recently actually have an intrinsic bias and are motivated by financial self-interest; their pronouncements are often expressions of wishful thinking rather than predictions.

the Advent of the 3D accelerator wasn't regarded as something good. very programmer I knew of back then had their doubts at first. How 3D accelerators were only good for one or two things, how their own code would allways be better. This was from the time where you could hand code phong shading and it would run better than doing gouraud on a hardware level. 3D cards have taken off from there and made a gap that seems insurpassible in a six year timeframe, unless offcourse intel is cooking up something brilliant..
 
i was going through some articles when i found this statement by Charlie Demerjian when he said

"if you add in GPU functionality to the cores, not a GPU on the die, but integrated into the x86 pipeline, you have something that can, on a command, eat a GPU for lunch"

if and only if this happens, then i think this type of chip will be a gamer's only chip.
 
"if you add in GPU functionality to the cores, not a GPU on the die, but integrated into the x86 pipeline, you have something that can, on a command, eat a GPU for lunch"
It's clear that CPU and GPU technologies are slowly merging... it's arguable which is becoming more like the other (personally I think CPUs are going to become more like GPUs, but we'll see ;)).

That statement is pretty vague though. What does "adding GPU functionality to the cores" mean? Make the "cores" SIMD? Adopting their branching and threading model? Changing the memory model to be more like that of GPUs? Adding a rasterizer?

I think you see my point... GPU's *are* the embodiment of "GPU functionality". If that sounds nonsensical to you, I'd argue that so is the original statement.

The following are pretty clear to me (but I'm quite willing to be wrong!):

1) Parallelism is a must, and it's coming from every front. New programming models will be required, and some have already started to emerge.

2) GPUs are already quite parallel, and have solved some of the associated problems. CPUs are a bit behind in that game, but still have a few advantages of their own.

3) Memory models and access are going to play a huge role in the future, and will most likely be the bottleneck rather than math. The current CPU memory model is wasteful and inefficient for parallel code. GPUs have some clever ideas on how to handle this, but algorithms will still have to be written with coherency in mind.

4) Rasterizers are quire handy to have as a piece of hardware (they are fairly inefficiently implemented in software). They can even be useful as a generalized scatter unit for non-graphics work.
 
i was going through some articles when i found this statement by Charlie Demerjian when he said

"if you add in GPU functionality to the cores, not a GPU on the die, but integrated into the x86 pipeline, you have something that can, on a command, eat a GPU for lunch"

if and only if this happens, then i think this type of chip will be a gamer's only chip.

I wish I could ask just how tightly integrated he thinks it would be.

There are varying degrees of integration; some of which are barely integrated at all.

The least integrated ones (probably not the integration Charlie Demerjian was talking about):

1)The CPU and GPU are on the same chip, with no other changes. With the driver being aware, the software layer could route commands and data so that they are kept in cache.

2)The CPU and GPU are on the same chip, and they share some kind of link: either the memory controller is aware of the command traffic and routes loads and reads appropriately, or the chips share a buffer, or they have common cache access.
This is faster and mostly transparent, though it's not really integrating as much as it seems the INQ writer was saying.

More integrated, but still pretty much separate:
3) Everything from 2, but the x86 core is given a handful of special function calls and escape instructions that can shift a code and data stream to the GPU very quickly. The x86 pipeline has to handle a few extra instructions in the pipeline and some really long-distance branch jumps to the next portion of CPU code.

More integrated still, but both still distinct:
4) Everything from 3, but there is a direct silicon link that can fastpath commands and stores to and from the GPU. The units are still separate, and besides the link, you can easily tell them apart. The x86 pipeline is still essentially independent, the burden is a few extra commands that sit for a while in the ROB, which could be problematic without some special measures or multi-threading.

More integrated, some fuzziness:
5) Everything from 4, but there is a bit of overlap, where some of the fp units in the CPU are shared with the GPU. Some scheduling ports now link to an interface with the GPU.
The integer pipeline is still x86, fp traffic is more complex. Possibly a few extra instructions in SSE5 or whatever.

Highly integrated, still some demarcation:
6) Everything from 5, though the routing instructions may be fewer. The GPU and CPU share more units, both are able to make calls to a crossbar that allows them to offload some instruction packet the other is better at. The two may run from the same cache.

Fully integrated from the ISA standpoint (microcode):
7) The a common decoder handles both x86 and GPU instructions, with the GPU becoming a heavy-duty parallel decoder, analogous to a microcode or vector path.
The back end may still show signs of demarcation, (go to 9 if no, 10 if yes).

Fully integrated from the ISA standpoint (hardware):
8) The common decoder flits to and from x86 and GPU instructions. The back end may still show specialization (go to 9 if no, 10 if yes).

Fully integrated to a stupid, stupid, stupid, degree:
9) The decoded x86 and GPU micro-ops go wherever they please, to any unit, and take up space in the ROB and scheduling hardware.

Fully integrated, but faking it like a porn actress:
10) The decoded x86 and GPU micro-ops go wherever they please, but the chip really isn't trying too hard to be that good at one of the target workloads, it's just mapping execution however it wants and will get back to you when it's done.


Change at the instruction decode level is not to be taken lightly. The x86 decoders are already monstrous, and the ISA is pretty bloated. No function, not even a routing one, should be added unless you are sure it's going to be used by a lot of users for a long time.
Merging decoders is downright "My name's Janet: Miss Jackson if you're nasty" nasty.
Creating a rift in x86 between the classes of chip is also a risky bet.


Merging scheduling is suicidal. The P4's ROB is ~120 micro-instructions with two thread contexts. A single cluster in the G80 has the context of hundreds of threads, with hundreds upon hundreds of in-flight instructions, any one of them being massively long-latency.
The integer x86 side is going to be stalled stupid, so it's unlikely scheduling will merge.


Merging units: might happen, sort of.
Some specialty hardware might be good to go. The CPU could borrow a SQRT unit or some transcendental functions that take hundreds or thousands of cycles anyway.

More than that, and the question arises about how bypass and communications will work at high speeds.

The CPU needs high clocks, it can't rely on data parallelism to just make work appear. The tighter it is coupled to the GPU, the more it suffers.
Even the moderately integrated approaches make the probability of being CPU-limited higher.

The GPU needs wide fetch, wide data, and a lot of scheduling context. It tolerates huge latencies, because there's always another pixel to work on.
The more it has to cater to the CPU, the less it has to work with.

Sharing caches is problematic, especially if it is a shared L1. Even one additional port can be painful. Lower and slower levels are more likely, the first level to be shared between all cores is most likely. However, doing this means that the GPU and CPU cores stay rather distinct.

Sharing a socket: very painful for a top GPU.
The memory bandwidth will never be what it could be if the GPU had its own card.
Whatever can be done to increase the bandwidth of a socket can just as easily be done to increase it for a board, and the board can do it better.

Since CPUs and GPUs differ in their market segmentation and their need for changeable memory configurations, CPUs aren't likely to give up on having a socket.

If I had my way, I'd have a full-size video card that also plugged into a Torrenza socket, just to get the (relatively) best of both worlds.
 
Although i didnt understand most of what u said (gotta admit), but you seem to know what u are saying. So i understand that all what Charlie said is just imagination. Perhaps, adding more instructions will be better, a year back, Phil Hesters said that

"When referring to the future goals for AMD's architecture, the only example Phil Hester provided for FPU Extensions to AMD64 was the idea of introducing extensions that would accelerate 3D rendering. We got the impression that these extensions would be similar to a SSEn type of extension, but more specifically focused on usage models like 3D rendering.

Through the use of extensions to the AMD64 architecture, Hester proposed that future multi-core designs may be able to treat general purpose cores as almost specialized hardware, but refrained from committing to the use of Cell SPE-like specialized hardware in future AMD microprocessors. We tend to agree with Hester's feelings on this topic, as he approached the question from a very software-centric standpoint; the software isn't currently asking for specialized hardware, it is demanding higher performance general purpose cores, potentially augmented with some application specific instructions''
 
Although i didnt understand most of what u said (gotta admit), but you seem to know what u are saying. So i understand that all what Charlie said is just imagination.
If you have any questions, I can try to answer them. I'm not a professional in the field, so there may be some angles I have missed.

What Charlie says is technically possible, but it takes more than specialized instructions to beat a GPU at its own game. GPUs and CPUs handle different tasks, and they have design trade-offs that are often opposite of what makes the other one perform better.

It's not that the situation is gloom and doom, but it is better not to get too enthusiastic and overlook the very real challenges to making either a CPU or GPU perform well.

Mashing them together without great care would have the effect of making a big chip that isn't very good at being either a GPU or CPU.

"When referring to the future goals for AMD's architecture, the only example Phil Hester provided for FPU Extensions to AMD64 was the idea of introducing extensions that would accelerate 3D rendering. We got the impression that these extensions would be similar to a SSEn type of extension, but more specifically focused on usage models like 3D rendering.

Through the use of extensions to the AMD64 architecture, Hester proposed that future multi-core designs may be able to treat general purpose cores as almost specialized hardware, but refrained from committing to the use of Cell SPE-like specialized hardware in future AMD microprocessors. We tend to agree with Hester's feelings on this topic, as he approached the question from a very software-centric standpoint; the software isn't currently asking for specialized hardware, it is demanding higher performance general purpose cores, potentially augmented with some application specific instructions''

Adding even a few instructions can mean a lot of things. There could still be some kind of integration or reorganization going on, but it's hard to say until we know what kind of instructions they are.

The last part with some cores being "almost specialized" could mean some kind of merger of GPU and CPU units or a reorganization of some designs. Without details, I can't say.
 
Last edited by a moderator:
Although i didnt understand most of what u said (gotta admit), but you seem to know what u are saying. So i understand that all what Charlie said is just imagination.

What if they both, equally, appear to know what they're saying?

Making bold claims about the future some experts predict in front of a forum full of experts in the field is rather like sticking your tongue in a lightsocket. Vaguely amusing but mostly inadvisable. So, please, join the rest of us lurkers who enjoy nibbling around the fringes but don't have enough invested time and knowledge to contribute without resorting to quoting more apt individuals. It'll save the annoyance of the lights flickering constantly.

Yes, I'm a big meanie.
 
What's the use of putting 4 or more x86 cores on the same die? None. (well, except for servers with Opterons, perhaps). What better to add than a GPU to make it worth more money? Although splitting things up and adding GP vector units sounds even better. Like Cell, or unified GPU's. And getting rid of MMX, SSE, FP, and lots of cache on the CPU parts. The are left over can be used as local storage.
 
What's the use of putting 4 or more x86 cores on the same die? None. (well, except for servers with Opterons, perhaps).
x86 manufacturers seem to mostly agree, they've already stated they are aiming at adding different cores after they reach 4 x86 cores.

Although splitting things up and adding GP vector units sounds even better.
How many general-purpose vector machines have you ever seen? Being vector-based is in my book a sign a chip might not be general-purpose.

Like Cell, or unified GPU's.

Like Cell or like an SPE? Cell is a combination of parts that adds up to general purpose. Its components aren't all equally flexible.

I also don't think you can qualify unifed GPUs as being general-purpose either.

And getting rid of MMX, SSE, FP, and lots of cache on the CPU parts. The are left over can be used as local storage.

The first three probably aren't going to leave if we're talking about the x86 cores. Backwards compatibility remains paramount.
The cache isn't likely to go anywhere either. All those extra units need to be fed, and cache is still needed if the CPUs are to keep from backsliding in performance. Future chips can't go backwards on the workloads current cores are already good at. Backwards compatibility is a burden the x86 cores must handle, so they can't do worse on older apps.
 
The first three probably aren't going to leave if we're talking about the x86 cores. Backwards compatibility remains paramount.
Not to mention that the actual execution units are only a modest fraction of a modern core.

The cache isn't likely to go anywhere either. All those extra units need to be fed, and cache is still needed if the CPUs are to keep from backsliding in performance. Future chips can't go backwards on the workloads current cores are already good at. Backwards compatibility is a burden the x86 cores must handle, so they can't do worse on older apps.

And not only for backwards compatibility. More cache will equal more performance in the future. With umpteen cores on die to saturate the main memory bus, cache is going to be primary way to alleviate contention.

Cheers
 
More cache? For what architecture? The old Intel Netburst one? AMD processors don't gain much, if anything by adding more than about 1 MB cache.

Instead of cramming 4 x86 cores and a whopping amount of cache on the chip, you're better off with only 2 cores and using all the area left over for main memory. THAT will speed things up significantly.
 
G80 exposes itself to the programmer as having only scalar units, which makes it significantly easier to program efficiently. So I'm not sure what you're thinking of here... :)


Uttar

I don't understand what you are saying. How does scalar units change the way your program you application? How does the G80 expose itself?
 
Last edited by a moderator:
GPUs and CPUs are the same just with emphasis on different abilities. CPUs are optimized for single threads and GPUs for multiple. CPUs have full featured high precision math capabilities while GPUs have limited lower precision math. GPUs required tonnes of bandwidth and CPUs require signficantly less. There is nothing stopping someone from developing a CPU with GPU like features or a GPU with CPU like features. Only cost, complexity, and imagination prevent these two from becoming one and the same.
 
I don't understand what you are saying. How does scalar units change the way your program you application? How does the G80 expose itself?
Programmer might be the wrong term; I meant that the person interested in optimizing a shader for the G80 would take into consideration the fact that there is no need to try to make the code vectorizable. Techo+'s assumption was that an architecture designed around vectors is easier to program for, and that's all I was refuting. Obviously, a language with native vector support, swizzling, etc. might be easier to program for if that's the kind of thing you're using it for, but that's another thing completely, as it might also support scalars.


Uttar
 
Instead of cramming 4 x86 cores and a whopping amount of cache on the chip, you're better off with only 2 cores and using all the area left over for main memory. THAT will speed things up significantly.

Only 2 cores? the more cores the better, enter virtualization. That path has allready been set for the near future, not only on x86 but also on other platforms.

AMD's future GPU integration is the simplests of forms offcourse, integration a small vpu on the die, sharing the same crossbar for memory access. It's a cost reduction mostly as it takes away the costs of packaging and implementing a separate vpu on a laptop board. but For the next year or so, don't expect any benefit other than the fact that the vpu doesn't has to access memory through a nb.

What you can do with multiple cores and an according compiled environement is having dedicated logic for threads, with one core assisting the gpu calls, one core handling physics, one core handling ai etc.

I don't think memory bandwidth for cpu's is scaling equally with the memory demand set forth by increasing cpu logic, hence the application of more cache in the near future.
 
More cache? For what architecture? The old Intel Netburst one? AMD processors don't gain much, if anything by adding more than about 1 MB cache.

For most user applications and popular benchmarks, this is true. Server loads and other programs with large workings sets can still benefit.
As more cores are placed on-die, it's also helpful to have more cache to keep memory coherence traffic down.

That's probably one of the reasons AMD is adding a shared L3 to its server chips.

Instead of cramming 4 x86 cores and a whopping amount of cache on the chip, you're better off with only 2 cores and using all the area left over for main memory. THAT will speed things up significantly.

Main memory? That's way far away on the motherboard. It would be difficult to get as much main memory onto the die as most programs use. Vista insists on gobbling most of a gig of RAM, and that's not going on-die.
 
Main memory? That's way far away on the motherboard. It would be difficult to get as much main memory onto the die as most programs use. Vista insists on gobbling most of a gig of RAM, and that's not going on-die.

Well.. you could add in in a smaller portion, make it run at the same speed as the core and consider it cache :(
 
Last edited by a moderator:
Well, if you add 4 * 1 MB L2 cache, and 1 * 4 MB cache, you could consider using 8 MB local storage instead, and paging to main memory in 4kB pages with DMA. It wouldn't change the virtual memory size. And with only two processors, you might be able to add 16 MB. But you would have to write your programs with that in mind to get any benefits from it.

I agree: improving and adding cache is better for now.
 
Back
Top