Software Rendering Will Return!

I purposefully noted northbrige or combination northbrige/southbridge.
It's not like A64 motherboards suddenly lost their chipsets after the architecture picked up an IMC.
Ok, there's a little mixup in northbridge/southbridge definitions. Intel calls the northbridge the Memory Controller Hub, and when it has an IGP it's the GMCH. Clearly all of that is going to get integrated into the CPU. The southbridge as I know it is going to stick around though.
Depends on what you consider a core. The way they are set up now, with the exception of the multi-GPU setups, they are not truly multicore.
You can parallelize 3D rendering at many different levels. Multi-GPU setups deal with very low bandwidth between the 'cores'. CPUs on the other hand are increasing the inter-core bandwidth and lowering the latency. So they become a lot more like (single-die) GPUs. GPUs on the other hand get tasked with new workloads, so they have to behave more like independent cores. As long as they are converging, I don't see a point in what you're saying.
The question is how much, and here is where diminishing returns comes in.
For the bulk of the portable and desktop markets, 8-way symmetric multicore is an utter waste of time right now and will be incredibly wasteful in the future.
ALU density for those cores will be an order of magnitude less than what more specialized designs produce.
The slides comparing Larrabee and Sandy Bridge(Gesher) show Larrabee with 24 cores capable of 8-16 DP SSE ops, while Sandy bridge shows 4-8 cores capable of 7 DP SSE ops.
Even factoring Larrabee's lower core clock is insufficient to change the fact that Larrabee's core count is 3-6 times that of Sandy Bridge and each core is capable of slightly more to double (likely closer to double) the throughput.
It's not about having high efficiency. It's about being able to deliver adequate performance for the task at hand. The CPU is not the most efficient at a lot of things, but it's adequate at most anything, making it very cost effective. I can't see why in the not too distant future 3D can't be part of that.
The increasing influence of the laptop market is going to severely impact the prevalence of octo-core general purpose only chips.
I think you're mistaken. People play the same games on laptops as they do on desktops. Heck, there are even desktop replacements with desktop CPUs. And there are lots of other applications for which people want laptops that are not trailing too far behind deskops performance-wise. And given that the Pentium M formed the basis of the Core 2 (which then made it back to mobile), there's clearly a tight connection beteen the laptop and desktop market and performance / watt is just as important for both.
You are advocating the removal of specialized hardware. You are taking away a world of choice.
Quite the contrary. First of all, even in my first post I tried to make it clear that I don't expect GPUs to dissapear. Secondly, I'm adding an option by making applications run that would otherwise not be supported at all. And I'm hoping that one day people will have the option of buying a system capable of minimal 3D demands without paying for extra hardware.
Here's where we'll probably have to differ.
The return on investment is not going to be all that great with 2-4 billion extra transistors thrown into an extra 4 general purpose cores.
The incremental gain of a single modest streaming core with less footprint than a single CPU will be significantly better.
Like I said before, today an octa-core isn't all that interesting. But let's have this conversation again in a few years when the majority of software is highly threaded. The adoption of a new ISA or API for a streaming processor, likely from different vendors, isn't all that attractive. The industry as a whole loves constant evolution, not radical changes. x86 never got replaced, instead it evolved. And AGEIA's software is a hundred times more popular than its hardware...
I don't see how. There's little if any point for abstraction or dynamic recompilation to target a core if all the cores are the same general-purpose core.
Multi-core programming is hard. Developers need abstractions and stream processing is just one of them, among many that the CPU supports. Dynamic compilation is essential for eliminating branches and making CPUs much faster at 'fixed-function' processing. It's an optimization by specializing for semi-constants that are only known at run-time.
This future will be delayed or possibly cancelled 2H 2009, by both AMD and Intel.
Do you have any sources or arguments for that?
It's more fascinating to watch the generalized hardware waste most of those billions of cycles and pumping out hundreds of watts for no return.
Jack of all trades, master of none, though ofttimes better than master of one.

Besides, my GPU is 95% of the time doing diddly-squat. This percentage varies between users so clearly there are bound to be people for who it makes a lot of sense to do 3D in software, no matter how inefficient it may be. And for many GPGPU workloads the GPU isn't exactly an example of efficiency either, requiring expensive graphics cards to beat single-threaded unoptimized C code.
You'd rather have those octocores hammering away at software rendering that could have been handled adequately by a GPU 4 years prior using a fraction of the transistors and you're lecturing on optimality?
I'd rather have those octa-cores hammering away at software rendering than having no 3D at all due to not having a GPU, not having a GPU supporting necessary features, broken GPU drivers, extremely weak IGPs, etc. So yes, optimization is key. I also wouldn't hesitate for a second when choosing between an optimized physics engine or A.I. library versus forcing my whole adience to buy extra or more powerful hardware. Laptops are often not upgradable at all, so sooner or later the specialized hardware is no longer going to be able to run newer appliations, even if they're not so demanding. Tons of casual games already use DirectX 9 features today, or would like to...
 
Please specify.

What's there to specify exactly? I've tried both SW renderers several times on different setups and the output is lightyears apart from what I was used to from the very release of the game. It's of course great when s.o. doesn't have any other alternative, but I personally wouldn't bother seriously playing even a game as old as UT2k4 even on an IGP.

I only mentioned Unreal Tournament 2004 because that's a game that now, four years later, runs perfectly in software. That's performance-wise. Feature-wise, we're already four years later than that and we're certainly not limited to DirectX 7/8 games.
Which doesn't come as a surprise since UT2k4 is so dreadfully CPU bound on today's high end systems/GPUs, that even if you go all the way up to 1920*1200+AA/AF you still have fillrate to spare.

You don't even need a 3D capable graphics card.
For Oblivion? I lived under the impression that it's a SM2.0 game.
 
What's there to specify exactly? I've tried both SW renderers several times on different setups and the output is lightyears apart from what I was used to from the very release of the game.
For a start, please specify "lightyears apart".
I personally wouldn't bother seriously playing even a game as old as UT2k4 even on an IGP.
With all due respect this isn't about how you "personally" like to play games. A lot of people are willing to pay only for an IGP, and if they'd get the option to save even that cost by using a software renderer some of them would.
Which doesn't come as a surprise since UT2k4 is so dreadfully CPU bound on today's high end systems/GPUs, that even if you go all the way up to 1920*1200+AA/AF you still have fillrate to spare.
Your point being?

The only thing I conclude is that you don't need anything more to run Unreal Tournament 2004. It's still the same fun game. If you squint your eyes there's hardly any difference with Unreal Tournament 3. Not everyone feels the urge to upgrade regularly to keep up with advances in eye candy. Proof of that is the Wii selling like hot cakes. It might not be your personal idea about gaming but this is where the market is heading.
For Oblivion? I lived under the impression that it's a SM2.0 game.
It is.
 
It's not about having high efficiency.
Then where have I been for the last 5 years of CPU and GPU design?

It's about being able to deliver adequate performance for the task at hand. The CPU is not the most efficient at a lot of things, but it's adequate at most anything, making it very cost effective. I can't see why in the not too distant future 3D can't be part of that.
The CPU is adequate at most anything the developers haven't given up running on it.
If a design is efficient, you get more with less.
Physical constraints on power, die size, diminishing returns, and other limitations mean that inefficient use of transistors and power leads to suboptimal products.

I think you're mistaken. People play the same games on laptops as they do on desktops. Heck, there are even desktop replacements with desktop CPUs. And there are lots of other applications for which people want laptops that are not trailing too far behind deskops performance-wise. And given that the Pentium M formed the basis of the Core 2 (which then made it back to mobile), there's clearly a tight connection beteen the laptop and desktop market and performance / watt is just as important for both.
Which is why we should expect a Skulltrail laptop in in the Fall?
If performance/watt is just as important, then the octocore still loses.
The static leakage of the chip alone keeps it from mainstream laptop usage, and desktop replacement does not count in that category.

If the laptop market is as important as you say, then I'd expect AMD and Intel's addition of GPUs to their mobile products would tell you something.

Quite the contrary. First of all, even in my first post I tried to make it clear that I don't expect GPUs to dissapear. Secondly, I'm adding an option by making applications run that would otherwise not be supported at all. And I'm hoping that one day people will have the option of buying a system capable of minimal 3D demands without paying for extra hardware.
AMD and Intel will have something for you next year.

Like I said before, today an octa-core isn't all that interesting. But let's have this conversation again in a few years when the majority of software is highly threaded.
Define a few, and tell me why the emergence of tools for highly parallel programming cannot also improve the performance of parallel cores.

The adoption of a new ISA or API for a streaming processor, likely from different vendors, isn't all that attractive.
We can do what AMD and Intel do and have multiple vector ISAs for x86 that vary from vendor to vendor and chip to chip.

The industry as a whole loves constant evolution, not radical changes. x86 never got replaced, instead it evolved. And AGEIA's software is a hundred times more popular than its hardware...
The industry as a whole loves single-threaded performance. Physics stomped on that.
Physics will still be a concern for reckless expenditure of transistors.


Do you have any sources or arguments for that?
Nehalem's graphics variant (perhaps even 1H 2009) and Fusion.
The new baseline for the mainstream market will be dual cores with an IGP or GPU included or on-die PCI-E for peripheral devices to interface with the CPU.

That will be far more widespread than octocores. Affixing graphics/streaming to the die will also mean the CPU vendors will have an incentive to maintain continuity in their 3d software stack, to provide at a bare minimum the same kind of compatibility they provide for SSE support (or did, until the SSE4+ crap).

Jack of all trades, master of none, though ofttimes better than master of one.
When all you have is a hammer, everything starts to look like a nail.

Besides, my GPU is 95% of the time doing diddly-squat. This percentage varies between users so clearly there are bound to be people for who it makes a lot of sense to do 3D in software, no matter how inefficient it may be. And for many GPGPU workloads the GPU isn't exactly an example of efficiency either, requiring expensive graphics cards to beat single-threaded unoptimized C code.
However often the GPU is idle is irrelevant considering how even a modest GPU can deliver significantly better performance in a number of common applications.
With CPUs often hitting IPCs <1, that's the same as having 2/3 to 3/4 of the chip's performance off the table.
I could shave off one or two mostly idle CPUs and put in more specialized hardware than can do about 10 times that in the same footprint.

That gain is not something designers have missed, and both AMD and Intel have brought up the eventual need for more heterogenous solutions because of diminishing returns.

I'd rather have those octa-cores hammering away at software rendering than having no 3D at all due to not having a GPU, not having a GPU supporting necessary features, broken GPU drivers, extremely weak IGPs, etc.
A six-core design with one on-die GPU would be 3/4+ as effective as the octocore (thanks to diminishing returns) on general apps and likely two to three times as effective on apps suited for the GPU.

A quad-core with attached GPU would be better than half as effective as the octocore, and then 2-3 times better on anything running on the GPU.

The marginal cost to the user for omitting a few extra CPU cores is relatively small compared to the potential payoff on workloads that suit the GPU (one that would probably save die space in the end).

So yes, optimization is key. I also wouldn't hesitate for a second when choosing between an optimized physics engine or A.I. library versus forcing my whole adience to buy extra or more powerful hardware. Laptops are often not upgradable at all, so sooner or later the specialized hardware is no longer going to be able to run newer appliations, even if they're not so demanding. Tons of casual games already use DirectX 9 features today, or would like to...

I can run new apps on many-years old graphics cards.
APIs are far more stable, so it is likely a decent on-die GPU or stream processor will serve for the effective lifespan of the CPU cores.
As much as you like the infinite features software rendering provides, the most optimal software renderer will not make a Pentium III a viable option because the performance is not there.
The lack of raw performance is a consideration, and you can't optimize below the bare minimum of necessary work.
 
For a start, please specify "lightyears apart".

Kindly let me know if I missed anything in terms of settings for the Pixo renderer and ignore for now resolutions and IQ improving (GPU specific) features used in the 2nd case:

http://users.otenet.gr/~ailuros/Pixo.jpg

http://users.otenet.gr/~ailuros/normal.jpg

I did of course put that scale2x thingy on false and filtering quality on "3", but what appears on my screen in the first case is lightyears apart from the second case.

With all due respect this isn't about how you "personally" like to play games. A lot of people are willing to pay only for an IGP, and if they'd get the option to save even that cost by using a software renderer some of them would.
The average gamer won't result to a software renderer anyway and an IGP can play with much better IQ and settings UT2k4 nowadays and that at least in 1024*768.

Your point being?
My point is that there isn't something particularly unexpected about the fact that a recent high end CPU can render in SW mode a game like UT2k4. When the game was released it was playable on a Ti4400 and a high end CPU in 1280*960 with 2xAA/4xAF (the texturing stage optimisation for AF included). That example above is with completely unoptimised 16xAF in 1920*1200 with 16xS (4xRGMS + 4xOGSS) and I enabled Supersampling only because I wouldn't know what else to invest the spare fillrate in. With such a giant leap over the past 4 years on the high end GPU front and the game being completely CPU bound on today's high end systems, where exactly is the "impressive" part if a CPU by itself can yield playable performance at barely 800*600 with the above output.

The only thing I conclude is that you don't need anything more to run Unreal Tournament 2004.
Do I? I apologize out front if the above result is inaccurate and s.o. can get better output out of a SW renderer from UT2k4; yet if it's better just by how much exactly and if in any other case the above is all there is, that IS Voodoo3 material. If games wouldn't advance in IQ through the years we wouldn't had needed any GPUs from the get go.


It's still the same fun game. If you squint your eyes there's hardly any difference with Unreal Tournament 3.
If I squint my eyes I could also say the same for the original Unreal Tournament.

Not everyone feels the urge to upgrade regularly to keep up with advances in eye candy.
For that minority of gamers (which I wouldn't even call casual gamers anyway) SW rendering isn't exactly "returning". As I said it has always been there and never went away. And just in case you've missed it there aren't any spectacular changes or innovative ideas in terms of gameplay in newer games. Whatever "evolutions" occur are mostly in the direction of IQ; and yes exaggerations and redundancy are perfectly debatable there, but if I dump down settings in Crysis nowadays to get it playable I get a FarCryOnSteroids with not really remarkable changes in gameplay compared to FarCry.

Finally you get times more out of a balanced system with a midrange CPU coupled with a midrange GPU for games and both will still cost less than a high end CPU.

Proof of that is the Wii selling like hot cakes. It might not be your personal idea about gaming but this is where the market is heading.
Proof of what exactly? The console market is a totally different animal than the desktop standalone market. What you pay is what you get anyway and if I'd really would buy a console today I wouldn't buy anything less than a XBox360 and use it on a low resolution CRT monitor.

If PDA/mobile devices wouldn't be tailored for small screens/low resolutions, even a high clocked MBX would hand a Wii potential behind in and that's the other direction another market is running with a times higher projected growth for the foreseeable future than either the console or desktop standalone market. The real question would be why in a market where power consumption and die size is the most critical of them all the definite choice for now seem still to be high performance 3d accelerators.

Well if a software renderer truly does what I can see above to a game like UT2k4 then I shudder at the thought what Oblivion would look like in an equivalent case.
 
Last edited by a moderator:
Then where have I been for the last 5 years of CPU and GPU design?
By your logic we'd absolutely need dedicated hardware for everything. But instead everything is pointing towards integration and unification. You're going to lose efficiency at specific tasks no matter what. But thanks to Moore's law things still get faster overall. And that's how GPUs are evolving too. It's all about the overall ability to adequately support many different workloads and optimize performance / dollar for all of them combined. For someone to whom 3D graphics is not a priority software rendering is rapidly becoming a viable option.
The CPU is adequate at most anything the developers haven't given up running on it.
If a design is efficient, you get more with less.
Physical constraints on power, die size, diminishing returns, and other limitations mean that inefficient use of transistors and power leads to suboptimal products.
Suboptimal from a hardware point of view doesn't mean suboptimal from a system point of view (hardware + software) and/or commercial point of view.
We can do what AMD and Intel do and have multiple vector ISAs for x86 that vary from vendor to vendor and chip to chip.
Are you talking about 3DNow! and SSE4/5? History shows that only one ISA extension survives and the other has to adopt it. Bridging the differences is no big deal as long as you can get the same end result. Differentiation is an obvious result of trying to find out what people want and trying to have have an edge over the competition. But for every feature it's only a temporary effect.
The industry as a whole loves single-threaded performance. Physics stomped on that.
Physics will still be a concern for reckless expenditure of transistors.
The industry loved single-threaded performance until around the Pentium 4 Prescott. Besides, adding a PPU is not exactly a single-threaded solution. Developers quickly realized that multi-core is a great idea and more importantly it's coming to every system.
Nehalem's graphics variant (perhaps even 1H 2009) and Fusion.
The new baseline for the mainstream market will be dual cores with an IGP or GPU included or on-die PCI-E for peripheral devices to interface with the CPU.
Like you say, they will be variants. They won't become the new baseline because there will still be all sorts of GPU combinations. Like I said before it's not a great idea to let an IGP do anything other than graphics.
With CPUs often hitting IPCs <1, that's the same as having 2/3 to 3/4 of the chip's performance off the table.
IPC has vastly increased again since the Pentium 4 was ditched. Also, (next-gen) HyperThreading ups ALU usage once more. I personally have found it quite easy to reach IPC's well over 1 with a tiny bit of optimization effort. Especially for streaming workloads.
A six-core design with one on-die GPU would be 3/4+ as effective as the octocore (thanks to diminishing returns) on general apps and likely two to three times as effective on apps suited for the GPU.
Here's another option: a Sandy Bridge quad-core cheaper than your six-core+IGP and offering more GFLOPS than a GeForce 8800 GTS 320.
I can run new apps on many-years old graphics cards.
I can't run any DirectX 9 application on a two-years old laptop with a 855GM. Its Pentium M can though.
As much as you like the infinite features software rendering provides, the most optimal software renderer will not make a Pentium III a viable option because the performance is not there.
The lack of raw performance is a consideration, and you can't optimize below the bare minimum of necessary work.
Who's talking about a Pentium III? I'm talking about four times the cores, four times the clock frequency and four times the SIMD width. I don't know your definition of raw performance, and frankly I don't need to know, that's plenty to have systems without graphics chip and sell them to half of the people who previously bought IGPs.

In the Pentium III era performance density wasn't all that great. But with increasing transistor budgets things have gotten a lot better. The area spent on decoding x86 and handling O.S. tasks is getting smaller. There's no reason why CPUs can't reach comparable performance density. They call it Larrabee and it's definitely going to influence future mainstream CPUs. GPUs on the other hand still have some steps to take toward complete general-purpose programmability.

A gather instruction would be the end of IGPs and make the CPU the primary choice for stream processing.
 
The industry loved single-threaded performance until around the Pentium 4 Prescott. Besides, adding a PPU is not exactly a single-threaded solution. Developers quickly realized that multi-core is a great idea and more importantly it's coming to every system.

I am not sure with which developers you have talked. So far I have found no software engineer in our industry who really thing that multi core is a great idea. It’s more like. “If I can’t get more power with a single core I take a second one. It’s better than nothing.”
 
Kindly let me know if I missed anything in terms of settings for the Pixo renderer and ignore for now resolutions and IQ improving (GPU specific) features used in the 2nd case:
I could give you another rant about how that's a four year old software renderer and software rendering has absolutely no limitations in terms of features but maybe a screenshot sais it better: ImageShack. Sorry I don't have a 1920x1200 screen. And that's a one year old version of a software renderer included in a totally different product, in case you care.
I apologize out front if the above result is inaccurate and s.o. can get better output out of a SW renderer from UT2k4...
I'm sorry but who or what is "s.o."?
...yet if it's better just by how much exactly and if in any other case the above is all there is, that IS Voodoo3 material.
Radeon X800 material I'd say. But it doesn't matter, anything is possible in due time. Larrabee would be screwed if that wasn't a fact.
If games wouldn't advance in IQ through the years we wouldn't had needed any GPUs from the get go.
We only need hardware for a bump in performance. Software rendering IQ can range from anything between Quake and Finding Nemo.
For that minority of gamers (which I wouldn't even call casual gamers anyway) SW rendering isn't exactly "returning". As I said it has always been there and never went away.
Except that before casual games used DirectDraw and now they use Direct3D 9. And I definitely consider people playing World of Warcraft and Spore casual gamers.
Well if a software renderer truly does what I can see above to a game like UT2k4 then I shudder at the thought what Oblivion would look like in an equivalent case.
The same as hardware rendered.
 
I am not sure with which developers you have talked. So far I have found no software engineer in our industry who really thing that multi core is a great idea. It’s more like. “If I can’t get more power with a single core I take a second one. It’s better than nothing.”
I won't deny that there isn't a learning curve. But once you're past that you take a dual-core over another 200 MHz any day. Ask video processing guys.

And like I just said something like a PPU isn't exactly a single-core solution either. You deal with the same concurrent processing issues. There's no escaping that.

And I strongly believe that in time, with new tools and languages, multi-threaded development on homogenous cores will be abstracted to a level that is no harder than single-threaded development. Things like SystemC are already half way there...
 
I could give you another rant about how that's a four year old software renderer and software rendering has absolutely no limitations in terms of features but maybe a screenshot sais it better: ImageShack. Sorry I don't have a 1920x1200 screen. And that's a one year old version of a software renderer included in a totally different product, in case you care.

That's obviously not the Pixo renderer and yes that's times better then what I can get from that Pixo thing. However the above still not identical to hardware output.

As for the resolution, ignore it. It's the highest resolution available in UT2k4, otherwise I would had gone even higher. In any case if I disable the Supersampling stuff I'd have a lot more performance than that.

I'm sorry but who or what is "s.o."?

Someone.

Radeon X800 material I'd say. But it doesn't matter, anything is possible in due time. Larrabee would be screwed if that wasn't a fact.

For the time being Larabee remains a funky paper presentation. We'll see its advantages and disadvantages when it finally arrives. Can you safely predict with what settings and with how much performance it could theoretically run Crysis? No? Too bad that it doesn't stand a chance as a point for the time being.

I'll accept the R4x0 material for the debate's sake, yet it's even two entire generations apart both in terms of IQ and performance from what we have today. Let alone that a X800 would render that stuff in a much higher resolution with a lot more performance.

We only need hardware for a bump in performance. Software rendering IQ can range from anything between Quake and Finding Nemo.

What we need is balanced systems and so far there's no single indication that general purpose hardware can or will replace dedicated hardware. And a balanced system would consist of both a very capable CPU as well as a very capable GPU and for the time being there's no single difference in that aspect whether its a SoC for a small form factor device, a console or a PC and not wait almost half a decade to get a fraction of GPU performance in low resolutions.

And just for the argument's sake resolutions have been scaling since 1998 on PC monitors.

Except that before casual games used DirectDraw and now they use Direct3D 9. And I definitely consider people playing World of Warcraft and Spore casual gamers.

You just haven't told me yet what the majority of casual gamers use for gaming in their machines or in extension what minority those you call as casual gamers actually represent.


The same as hardware rendered.

Hopefully not in the same sense as above.
 
However the above still not identical to hardware output.
Sigh. ImageShack
Please don't abbreviate unless it actually saves you lots of time and doesn't make things harder to read or understand.
For the time being Larabee remains a funky paper presentation. We'll see its advantages and disadvantages when it finally arrives. Can you safely predict with what settings and with how much performance it could theoretically run Crysis? No? Too bad that it doesn't stand a chance as a point for the time being.
We were talking about features. But if you must know I'm not too concerned about Larrabee performance-wise either.
And just for the argument's sake resolutions have been scaling since 1998 on PC monitors.
That's just as true for hardware rendering as it is for software rendering. No argument there. In fact when resolution increases vertex processing and primitive setup take relatively less, increasing pixel throughput. As long as CPU performance keeps scaling the way it does, resolution doesn't worry me one bit.
You just haven't told me yet what the majority of casual gamers use for gaming in their machines or in extension what minority those you call as casual gamers actually represent.
All I know and all I need to know is that it's a growing minority.
 
I won't deny that there isn't a learning curve. But once you're past that you take a dual-core over another 200 MHz any day. Ask video processing guys.

And like I just said something like a PPU isn't exactly a single-core solution either. You deal with the same concurrent processing issues. There's no escaping that.

And I strongly believe that in time, with new tools and languages, multi-threaded development on homogenous cores will be abstracted to a level that is no harder than single-threaded development. Things like SystemC are already half way there...

Video processing is something that fits natural to multi cores. Games in general don’t do this. Therefore even if you have mastered the problem of multithreaded software architecture (multithread is IMHO an architecture and less a code problem) you can still end up in Situations where 200 MHz gives you more than one additional core.

GPUs aren’t single core either but we don’t care as the APIs hide this fact from us most times. I know that there are APIs like OpenMP that do the same for multi core CPUs but unfortunately they doesn’t work that well for games.

Better language and tools may help a little bit but as already noted I see multi core as an architecture problem that could not be fully solved at lower levels.
 
Video processing is something that fits natural to multi cores. Games in general don’t do this. Therefore even if you have mastered the problem of multithreaded software architecture (multithread is IMHO an architecture and less a code problem) you can still end up in Situations where 200 MHz gives you more than one additional core.
I know its not as simple as 1+1. Indeed you can end up in situations with little or no gain, even negative gain! But that doesn't take away that with a sound architecture you can get very high performance increases with every doubling of cores.

By the way, Valve Loves Quad Core. "Valve believes that their hybrid approach yields good scalability in the foreseeable future, even beyond quad-core processors." That was in 2006...
GPUs aren’t single core either but we don’t care as the APIs hide this fact from us most times. I know that there are APIs like OpenMP that do the same for multi core CPUs but unfortunately they doesn’t work that well for games.
The exact same API's used for GPUs and PPUs can be used for CPUs (I should know). So whatever heterogeneous core(s) you add it's not going to be easier nor harder to program for than adding more homogeneous cores.

But with heterogenous cores you do introduce new bottlenecks. There's always one going to be maxed out and others will be thumb twiddling. Think about the vertex and pixel shader unification evolution. CPUs with homogeneous cores can handle many different tasks and balance things automatically.

Only when the heterogenous core is vastly more powerful it stands a chance of surviving. For things like physics and A.I. the difference would be minimal and they are useless for anything other than gaming (and more specifically games that actually use heavy physics and A.I.). Make them more general-purpose and there's no difference with a homogeneous core. Graphics is a different story but only mainly because of the texture samplers. Add scatter/gather instructions to all of the CPU cores and the difference with an IGP fades...
Better language and tools may help a little bit but as already noted I see multi core as an architecture problem that could not be fully solved at lower levels.
I fully agree. But there are also going to be significant changes in how we think about programming and software architecture. Today we still think about starting from a single-threaded application and making that multi-threaded. New software developed with multi-core in mind get a completely different architecture from the get-go. It takes a while because dependent libraries need to be re-achitected as well, but we're getting there. Also, students who start an advanced programming course today get in touch with multi-core quite quickly, making them think completely differently about these problems. In fact it's not a problem, it's an opportunity.

Also note that Nehalem will already vastly increase inter-core bandwidth, and introduce new hardware assisted locking primitives. This improves a few problems with fine-grain scheduling, making it easier to get good gains.
 
Last edited by a moderator:
I know its not as simple as 1+1. Indeed you can end up in situations with little or no gain, even negative gain! But that doesn't take away that with a sound architecture you can get very high performance increases with every doubling of cores.

Unfortunately core game code has large sequential parts and the number of tasks that can be run in parallel to the game code is limited. Therefore you would reach the point of no more significant scaling very fast.

By the way, Valve Loves Quad Core. That was in 2006...

I can find only two direct statements there and to be honest I would have said something similar in an interview at this time frame. Four cores are more interesting than two for sure as it would become harder to make use of them. It’s although the most significant development as it force us developers to do things different than before.

The exact same API's used for GPUs and PPUs can be used for CPUs (I should know). So whatever heterogeneous core(s) you add it's not going to be easier nor harder to program for than adding more homogeneous cores.

As long as there is an API that hides that fact that there is more than one core. But we don’t have such a general API for general CPU tasks. We might even need another language then C++ to get this right.

Also note that Nehalem will already vastly increase inter-core bandwidth, and introduce new hardware assisted locking primitives. This improves a few problems with fine-grain scheduling, making it easier to get good gains.

As long as not every CPU you target support such features it doesn’t help you much.
 
CPUs with homogeneous cores can handle many different tasks and balance things automatically.
I don't think so. Unless you mean that someone has to program the CPU so that it can autobalance its workload. Fair enough, I wouldn't call it 'automatic' though.
 
By your logic we'd absolutely need dedicated hardware for everything.
Reread what examples I have been using to see that there must be some stage in the process where my point is getting lost.

Example: Octocore whatever x86 vs 6-core x86+1GPU on die

The octocore's general-purpose performance will at the very best case be 25% greater than the heterogenous solution.
We see from even the best multithreaded apps that scaling is less than ideal, and we may be looking at a 10-20% shortfall on the subset of apps that can fully leverage 8 cores.
For any apps that only need up to 6 cores (this includes single-threaded apps and apps that divide only into 4 or so compute-intensive threads), the shortfall is 0%.

On the other hand, any apps that can leverage the GPU have 6 x86 cores and one GPU capable of about 10-20x the raw throughput of the missing x86 cores.

In the worst case, I can't use the GPU. I can shut the GPU off almost entirely.

In the best case, I can use the GPU to better effect than a 10-core x86 and still have 6 x86 cores left over for whatever I want.

In the more common case, there will probably be a mixture of x86 and GPU threads going around that on average either produce higher throughput or possibly lower power output.

So I could have marginally lower performance across the subset of applications that just absolutely needs 8 general-purpose cores and there is no way to take advantage of the hardware the GPU provides.

If instead, it's just a smaller chip with 4 x86 and 1 GPU, the percentage arguments are in the same vein with slightly adjusted ratios.

But instead everything is pointing towards integration and unification. You're going to lose efficiency at specific tasks no matter what. But thanks to Moore's law things still get faster overall.
Not faster, just a combination of denser and cheaper. Faster cannot be assumed, and current lithography challenges indicate the other two are not a given either.
Fast circuits are less dense, and highly tweaked processes are not cheap. Large dies with fast circuits on highly tweaked processes are also not cheap.

Suboptimal from a hardware point of view doesn't mean suboptimal from a system point of view (hardware + software) and/or commercial point of view.
If the hardware that can be built with a given amount of resources cannot provide a significant benefit over the older product it doesn't sell.
If the designers can offer a product that appeals to the market and meets a number of stringent requirements, and it just happens that more efficient use of transistors leads to a cheaper die, why not go that route?

Are you talking about 3DNow! and SSE4/5? History shows that only one ISA extension survives and the other has to adopt it.
3DNow!, 3DNow!+, MMX, SSE, SSE2, SSE3, SSE4, SSE4.1, SSE4a, SSE4 on Nehalem, SSE5, Larrabee, Sandy Bridge.

With the last two, we can't even guarantee one vendor will adopt its own extensions. That's a fight that won't start for 2 years and won't be resolved probably for another 2-4.
There might be one API transition or further exposure of the parallel languages that are hardware-agnostic in that time frame.

The industry loved single-threaded performance until around the Pentium 4 Prescott. Besides, adding a PPU is not exactly a single-threaded solution. Developers quickly realized that multi-core is a great idea and more importantly it's coming to every system.
The industry still loves single-threaded performance. They just can't get much more of it with every hardware generation.
Serial performance is useful everywhere on all software, old or new.
Parallel performance can only help a subset of software.
Developers didn't realize that multi-core is a great idea, they were bludgeoned over the head by the hardware designers who said they can't make single-core performance improve the way they did in the 1990s.

Hardware designers are looking at heterogenous and asymmetric solutions with good reason.
People want X amount of improvement every generation.
There are physical constraints and economic concerns that mean hardware designers have to look everywhere to manage it.

IPC has vastly increased again since the Pentium 4 was ditched. Also, (next-gen) HyperThreading ups ALU usage once more. I personally have found it quite easy to reach IPC's well over 1 with a tiny bit of optimization effort. Especially for streaming workloads.
I thought unoptomized software was a fact of life, if I were to go back a few posts.
The average IPC is still way below the maximum the architectures offer, and even full IPC is a fraction of what other hardware can offer on a number of math-intensive workloads.

Especially for streaming workloads.
If only there were some kind of architecture capable of similar levels of utilization on streaming workloads, but could offer 10x the peak performance. I must only have seen that in a dream.

Here's another option: a Sandy Bridge quad-core cheaper than your six-core+IGP and offering more GFLOPS than a GeForce 8800 GTS 320.
More FLOPs than a design that will be 2 years old by the time Sandy Bridge comes out is impressive?
There will be new GPUs out by that time.

My six-core+GPU is still cheaper than your octocore, which is what has been debated up until this point. I can change the core counts and specialized units however I want (and both AMD and Intel in their presentations say the same).
I move my goal posts as easily as you can.

I can't run any DirectX 9 application on a two-years old laptop with a 855GM. Its Pentium M can though.
The market's changed significantly since then, and new options are coming that coincide with the improvements you only acknowledge working on CPUs.

Who's talking about a Pentium III? I'm talking about four times the cores, four times the clock frequency and four times the SIMD width.
Move the clock forward a few years, and the argument for today's quad cores is the same.
All the features in the world will not matter if those features cannot be performed at an acceptable level of performance.
Feature completeness without performance was a losing proposition in the GPU space and in the CPU space.
I doubt we'll be seeing a sea change on this issue.

I don't know your definition of raw performance, and frankly I don't need to know,
I thought you wrote software that ran on hardware.
If you've never run into a workload where the amount of available operations per clock cycle exceeded what the hardware could do per clock, I'd be surprised.

In the Pentium III era performance density wasn't all that great. But with increasing transistor budgets things have gotten a lot better. The area spent on decoding x86 and handling O.S. tasks is getting smaller. There's no reason why CPUs can't reach comparable performance density.
This is where we'll probably never agree.
There are a number of good reasons why general purpose CPUs cannot reach the same density, and it's not because of x86 or the OS.
Unless of course you imagine that all other silicon products will suddenly cease to scale at all, and move back a few years even.

They call it Larrabee and it's definitely going to influence future mainstream CPUs. GPUs on the other hand still have some steps to take toward complete general-purpose programmability.
If you think Larrabee is general-purpose, I'm not sure what you think special purpose is.

A gather instruction would be the end of IGPs and make the CPU the primary choice for stream processing.
I didn't know CPUs couldn't just issue multiple load instructions.
 
Last edited by a moderator:

The link doesn't work.

We were talking about features. But if you must know I'm not too concerned about Larrabee performance-wise either.
I don't have to know at this point; theoretical estimates don't render any pixels on any screen. I feel much better expecting less and end up pleasantly surprised.

That's just as true for hardware rendering as it is for software rendering. No argument there. In fact when resolution increases vertex processing and primitive setup take relatively less, increasing pixel throughput. As long as CPU performance keeps scaling the way it does, resolution doesn't worry me one bit.
Given that I'd accept that a SW renderer delivers now identical output (can't really see anything from that link above) to a GPU, UT2k4 is still a 4 year old game and 800*600 with mediocre performance is still too far behind to convince me of anything.

All I know and all I need to know is that it's a growing minority.
Meaning sw rendering has always been around.
 
Unfortunately core game code has large sequential parts and the number of tasks that can be run in parallel to the game code is limited. Therefore you would reach the point of no more significant scaling very fast.

I can find only two direct statements there and to be honest I would have said something similar in an interview at this time frame. Four cores are more interesting than two for sure as it would become harder to make use of them. It’s although the most significant development as it force us developers to do things different than before.
Sure, it's not without effort. But clearly if you don't do it the competition will. And it's really not a sudden jump from single-core to octa-core or anything. There's plenty of time to come up with good architectures, scheduling schemes, locking approaches, etc. one CPU generation at a time.

Besides, you can't escape that the only way forward is to run every possible thing in parallel. Superscalar, pipelining, out-of-order, SIMD, HyperThreading, multi-core, ... They all have the goal of increasing the amount of parallel work being done. They tried to keep the developer's life simple for a long time but it's just no longer an option (though future compilers could be a great help). And you can add as many heterogenous cores as you like, sooner or later you're CPU bound again and the only thing you can do is add more parallelism.
As long as there is an API that hides that fact that there is more than one core. But we don’t have such a general API for general CPU tasks. We might even need another language then C++ to get this right.
It's only a matter of time before APIs, tools, compilers, etc. for multi-core appear. And I think you're very right that new languages are going to appear. There are plans to extend C++ too, but it seems to be only a few careful small steps. Anyway, millions are being poured into multi-core programming research. You'd have to be quite pessimistic to see no opportunity for gradual progression at all. Sure, some software will always be less suited for multi-core, but that doesn't mean it's not worth it for everything else. Besides, that software isn't going to be helped much/long with heterogenous cores either.
As long as not every CPU you target support such features it doesn’t help you much.
It's introduced with Intel's first native quad-core. Seems right in time to me. Dual-cores and dual-die quad-cores wouldn't benefit much from it anyway. I admit though it requires a bit of flexibility in the software to deal with having the features or not. But it always worked that way (for GPUs too).
 
3dilettante, your arguments are all very good... when talking about optimizing a system meant for 3D rendering at a low cost.

What I'm talking about is systems for which the end-user doesn't even ask for 3D 'acceleration' at all. It's the same people who ten years ago would never have bought a Voodoo 2. These systems are hardly even meant for casual gaming. So let's call them office systems for a second if that makes things clearer.

There's no reason to equip these systems with 3D hardware. All of today's newly sold systems already come equipped with dual-core CPUs capable of providing a more than adequate fallback for the rare occasion that 3D is used in office applications. Surely it's used more and more, but they don't nearly need the performance of even a casual game. And should these people feel the urge, they can still play popular casual games or even shooters from a couple years ago.

Now, in the not too distant future every system will be equipped with a Sandy Bridge quad-core CPU or better. That's twice the cores, twice the clock frequency and twice the SIMD width. Three-operand instructions and other AVX extensions are likely going to increase efficiency as well. So what is already true for today's dual-core CPUs is only going to get more true.

So thanks for the discussion. It's clear to me now where software rendering belongs and what its future success depends on. I hope you can somehow relate with my perspective, and I wish you all the luck with whatever you do to improve 3D hardware rendering.
 
Back
Top