Software Rendering Will Return!

By the way, Pixomatic and SwiftShader are perfectly capable of running a game like Unreal Tournament 2004, exactly four years old now.

Where's the achievement in that exactly considering the awful output/resolution?
 
That's what I said. But you do have to use SIMD to get rid of redundant control logic and optimize performance / area. The only reason not use at least 4-way SIMD would be if you don't aim the chip primarily at graphics.
Unless your engineers are smart enough to make the control logic really simple and efficient... You'll always have *some* overhead, but an IEEE-compliant FP32 unit is hardly cheap so the control costs should be considered relative to that too.

Regarding MIMD vs 4-way SIMD, remember that vertices often have no reason to be grouped and there the lack of branching granularity is a very clear win (animation with a variable number of bones for example). According to the SDK, SGX is also capable of doing per-batch computations and so forth; this is especially useful in the handheld world where the ARM core might not have a FPU or it might not be as fast or power efficient.

As for pixels, obviously small polygons are a problem. However, assuming you just need ddx and ddy based on a single quad, it should be easy to see that you can do ALU work for only 3 pixels in a truly-MIMD architecture. Or only one pixel if you've only got texturing based on vertex attributes... Saving ALU work in this way is either going to improve performance *or* save a little bit of power on a handheld (depending on the shader's 'normal' ALU-TEX ratio).

Things could get more complicated (and interesting) when you can offload some or all of the texture addressing/filtering into the MIMD shader core (which SGX can't do); then the ALU-TEX ratio also becomes less important. But that's so incredibly more complex it's another debate entirely and probably we won't see that for quite a while, if ever.

EDIT: Also, I presume this was obvious, but if you got this program: { if(...) result = func1(input); else result = func2(input); tex(result); } then, if there are no texture instructions in the functions, you can skip them completely for individual pixels for which the test fails. And even if there are texture instructions in there, it doesn't matter as long as the texture coordinates were calculated outside any conditionally executed instruction.
 
Where's the achievement in that exactly considering the awful output/resolution?
What awful output? For Pixomatic, make sure you set Scale2X=False and FilterQuality3D=3. It's perfectly playable at 800x600 on my 2.4 GHz Core 2. And that's a 2004 single-threaded software renderer that makes very little use of SSE and was developed by two people.
 
I play Crysis on my laptop's Quadro FX 350M (G72). It isn't great but it's perfectly playable. Now, even if that GPU was suited for A.I., the dual-core CPU would still be a much better fit.
Depending on the type of AI and what methods it uses to evaluate its environment, this can be very true or very wrong.
Depending on what phase in the AI's calculations you are describing, it can be the case that it is both very true and very wrong.

In the 32 nm era when quad-core makes it into mainstream there will still be a majority of systems equipped with low-end graphics cards. But the CPU has four powerful cores at your disposal...

I might as well rewrite your phrase and replace "graphics" with "CPU".

At 32 nm, chipset designers are going to be killing themselves to put something on their chipsets. The northbridge or northbridge/southbridge will have its capacity for size reduction capped by the need for the many hundreds of pins needed to interface with the system.

The manufacturers are going to put something in that space.
To top it off, the returns for quad core are not double that of the returns for transitioning to dual core from single core.

The transition to eight general cores will be for the forseeable future a highly dubious prospect for the mainstream.
That means your core scaling argument is good for about 4 years.

Exellent. But that core isn't going to be signfiicantly faster at anything other than graphics. It will have SIMD units very much like the other cores, and texture samplers. And unless texture samplers can vastly accelerate A.I. you'd better let the other cores handle it.
It will depend on the AI and the methods used, which should be determined by the developers, not the limitations of the hardware, or is your harping on how developers need more freedom only applicable to algorithms that favor the very limiting design assumptions that general purpose CPUs rely upon?

Also, heterogenous cores again make it more difficult for developers. What's Intel going to do and what is AMD going to do? Will all CPUs have an IGP or just the mobile or low-end ones?
Wait for integration and the probable abstraction layers that have been explored by Peakstream, Intel, AMD, Nvidia, and everybody else are pursuing. It might be a pretty well-established paradigm around the time you expect octo-cores to be introduced to the mainstream.

Yes it costs cycles, but it's nonsense that this is why some games have poor A.I. Do you honestly think that extra cycles alone is going to fix that? It has a lot more to do with programmer skills, budget, time, and creativity than just raw cycles.
A lot of it is cycles. AI can either have a time dependency in a dynamic environment or a large problem space to evaluate in a more strategic game.
There is a bare minimum of complexity that cannot be avoided with heuristics, and an AI that heavily relies on assumptions or hard-coded values that save cycles tends to be increasingly fragile or ineffective.

There's only so much creativity that can replace actual work.
A significant portion of the initial evaluation phases for a number of AI designs relies on number crunching of raw data to coalesce it into a more refined form for applying rules or decision making.
The grunt work phase is often the sort of thing where there is a lot of parallel non-dependent work is present, and due to the varying nature of simulations, it can easily be a cache buster or amenable to the kind of access texture units can handle.
 
Unless your engineers are smart enough to make the control logic really simple and efficient... You'll always have *some* overhead, but an IEEE-compliant FP32 unit is hardly cheap so the control costs should be considered relative to that too.
How about the cost in allowing the more flexible register access? Connecting the multiplicity of units to allow an equivalent amount reads and writes even from local storage or the register file must have an impact.
 
EDIT: Also, I presume this was obvious, but if you got this program: { if(...) result = func1(input); else result = func2(input); tex(result); } then, if there are no texture instructions in the functions, you can skip them completely for individual pixels for which the test fails. And even if there are texture instructions in there, it doesn't matter as long as the texture coordinates were calculated outside any conditionally executed instruction.
That's a very good point. I didn't realize this before although it's obvious now. Thanks for pointing it out!

I also wonder whether the SGX has dedicated hardware for primitive setup. If not it makes a lot of sense to have MIMD units that can immediately process the next primitive when the previous is culled (outside the viewport, back facing, etc).
 
How about the cost in allowing the more flexible register access? Connecting the multiplicity of units to allow an equivalent amount reads and writes even from local storage or the register file must have an impact.
Yeah you need extra decoders, extra constant buffer read ports, likely some synchronization logic, and you need to route a lot of extra wires. But for a mobile chip ALU usage might just have a bigger impact on performance / watt.
 
What awful output? For Pixomatic, make sure you set Scale2X=False and FilterQuality3D=3. It's perfectly playable at 800x600 on my 2.4 GHz Core 2. And that's a 2004 single-threaded software renderer that makes very little use of SSE and was developed by two people.

While it's an achievement to former CPUs and/or software renderers it's hardly any kind of achievement even compared to a recent IGP. Been there done that and no the result is Voodoo3 material at best.
 
At 32 nm, chipset designers are going to be killing themselves to put something on their chipsets. The northbridge or northbridge/southbridge will have its capacity for size reduction capped by the need for the many hundreds of pins needed to interface with the system.

The manufacturers are going to put something in that space.
What northbridge? Nehalem will have integrated memory controllers and some variants will have IGP cores. I'm sure some manufacturers are not happy about this evolution but eventually it will all get integrated (e.g. with four CPU cores or more a software RAID controller is fine). They can keep coming up with new stuff but once it gets to the point where it's too small to fit on a separate chip it will get integrated into something else or become software.
To top it off, the returns for quad core are not double that of the returns for transitioning to dual core from single core.
True, but it all depends on the workload. GPUs are doing great with a lot more cores so I'm not sure what's your point. One way or another you're going to add additional processing units that work concurrently. And in my opinion additional CPU cores is not the worst choice at all.
The transition to eight general cores will be for the forseeable future a highly dubious prospect for the mainstream. That means your core scaling argument is good for about 4 years.
When looking at current software, yes, it can be hard to imagine any purpose for an octa-core. But in 4 years a lot can change. Like I said before some game developers are already certain to be able to make good use of quad-cores. In 4 years they'll have more tools to work with and likely some hardware assisted locking/scheduling. In fact, Nehalem will re-introduce HyperThreading. So clearly Intel is confident that eight threads is manageable in the not too distant future.

I also expect a paradigm shift in programming languages. Just like the object-oriented revolution allowed us to abstract raw data and functions into something more manageable, there will very likely be extensions or entirely new languages that abstract threads and locks into tasks and dependencies. It might take a decade or so to mature but I'm confident that we won't run into a dead end beyond quad-core.
It will depend on the AI and the methods used, which should be determined by the developers, not the limitations of the hardware...
What the heck are you saying? Anyone calling himself a software engineer is going to take hardware limitations as an equally serious factor in determining the right approach.
...or is your harping on how developers need more freedom only applicable to algorithms that favor the very limiting design assumptions that general purpose CPUs rely upon?
What limitations? Except for rasterization GPUs are a lot more limiting. For any random application developers are first going to look at running it on the CPU, simply because it can run anything imaginable. Only if it's clearly going to run faster on even an IGP they might consider deviating from the default. This isn't going to change in the next few years, not with multi-core steadily evolving and IGPs sticking around.
Wait for integration and the probable abstraction layers that have been explored by Peakstream, Intel, AMD, Nvidia, and everybody else are pursuing.
These attempts are more succesful for multi-core CPUs than for GPUs. Even with GPUs getting more programmable every generation, you still have to deal with slow IGPs and crappy drivers.
It might be a pretty well-established paradigm around the time you expect octo-cores to be introduced to the mainstream.
Exactly. But again, the only reliably evolving baseline is the CPU, so they're not just going to stop investing in that. Hence octa-cores are coming our way sooner or later and software rendering will compete with IGP's from the bottom up.
A lot of it is cycles. AI can either have a time dependency in a dynamic environment or a large problem space to evaluate in a more strategic game.
There is a bare minimum of complexity that cannot be avoided with heuristics, and an AI that heavily relies on assumptions or hard-coded values that save cycles tends to be increasingly fragile or ineffective.
Sure, a minimum of complexity is unavoidable. But modern CPUs offer billions of cycles per second to turn your agent into something that doesn't run circles. If I see a game having good A.I. and a nearly identical game having terrible A.I. I'm not going to conclude that I have to upgrade my hardware... Badly optimized software is a reality, and throwing more cycles at it is not the right answer.
 
While it's an achievement to former CPUs and/or software renderers it's hardly any kind of achievement even compared to a recent IGP.
Granted, a recent IGP is still a better option in most cases. But software rendering is, slowly but steadily, catching up. Games that previously required serious hardware now run smoothly in software. And modern casual games that use the same level of 3D technology are already a perfect match for software rendering. And the range of applications for which software rendering is adequate is only getting bigger every year. Interestingly, as ALU/TEX ratios go up CPUs have relatively less trouble processing pixels. Mark my words: Well within five years we'll be able to play Crysis smoothly on the CPU. Ironically though, some might not think of it as an achievement any more then...

You really have to look at it as a proof-of-concept. It's just one step in the direction of making software rendering viable again. In this light I think it's a major achievement that we can smoothly run games that were originally intended to run only on hardware, just a few years after their release.

What I also consider an achievement is that while a recent IGP still costs a few bucks a software renderer that comes with an application costs essentially nothing (it pays for itself by reducing support calls). I'm sure at least a few people are happy they can run certain casual games without upgrading their outdated hardware.
Been there done that...
Been where done what exactly?
...and no the result is Voodoo3 material at best.
Last time I checked, Voodoo3 did not support DirectX 9 at all. One gigantic benefit of software rendering is that you can upgrade it. People who are still stuck with a Radeon 8500, GeForce 4 MX or a 855GM can actually run DirectX 9 applications in software. Your application is also going to look exactly the same on any machine. For casual games and medical applications that's already a very serious argument.

The reason IGPs exist is because there's a demand for very cheap chips that can handle the minimal 3D demands. For the same reason software rendering has a very good chance of returning even if it's no match against an IGP any time soon. It's not going to fulfill everyone's expectations, but in my eyes for those playing only casual games it already has returned...
 
Last edited by a moderator:
Granted, a recent IGP is still a better option in most cases. But software rendering is, slowly but steadily, catching up. Games that previously required serious hardware now run smoothly in software. And modern casual games that use the same level of 3D technology are already a perfect match for software rendering. And the range of applications for which software rendering is adequate is only getting bigger every year. Interestingly, as ALU/TEX ratios go up CPUs have relatively less trouble processing pixels. Mark my words: Well within five years we'll be able to play Crysis smoothly on the CPU. Ironically though, some might not think of it as an achievement any more then...

I don't doubt any of the above; even gaming on a IGP means that you have to sacrifice a ton of in game settings, pick a very low resolution and forget about any kind of IQ improving features.

You really have to look at it as a proof-of-concept. It's just one step in the direction of making software rendering viable again. In this light I think it's a major achievement that we can smoothly run games that were originally intended to run only on hardware, just a few years after their release.

What I also consider an achievement is that while a recent IGP still costs a few bucks a software renderer that comes with an application costs essentially nothing (it pays for itself by reducing support calls). I'm sure at least a few people are happy they can run certain casual games without upgrading their outdated hardware.

If vendors in the future create hybrid CPUs with some GPU specific functionalities I wouldn't be in the least surprised if those would be capable of entirely replacing IGPs eventually. That would still be the ultra low end segment of the market and rarely folks buy an IGP for serious gaming.

Been where done what exactly?

Seen it times and times again.

Last time I checked, Voodoo3 did not support DirectX 9 at all.

I truly wish UT2k4 would even deserve the title of a D3D9.0 title. It's in its vast majority a DX7 T&L optimized game and in a few spare spots it might have a couple of DX8.0 shaders. With V3 material I meant resolution and filtering quality as examples.

One gigantic benefit of software rendering is that you can upgrade it. People who are still stuck with a Radeon 8500, GeForce 4 MX or a 855GM can actually run DirectX 9 applications in software. Your application is also going to look exactly the same on any machine. For casual games and medical applications that's already a very serious argument.

I'd love to see such a SW renderer for Oblivion especially the 4 MX or the 855GM.

The reason IGPs exist is because there's a demand for very cheap chips that can handle the minimal 3D demands. For the same reason software rendering has a very good chance of returning even if it's no match against an IGP any time soon. It's not going to fulfill everyone's expectations, but in my eyes for those playing only casual games it already has returned...

IGPs and any future relevant ultra low end HW is office material at best. The truth is SW rendering never went away, it was always present one way or another. It's only natural that as CPU processing power scales that SW rendering catches up by some degree over the years. I don't see anything returning, not even in the small form factor mobile/PDA market; instead GPU IP is on such high uptake that only looks as it will continue to scale as a necessity for relevant 3D capable devices.
 
The reason IGPs exist is because there's a demand for very cheap chips that can handle the minimal 3D demands. For the same reason software rendering has a very good chance of returning even if it's no match against an IGP any time soon. It's not going to fulfill everyone's expectations, but in my eyes for those playing only casual games it already has returned...
I certainly think you might be right in the moderately long term there, but only for desktops and servers. In the laptop market (and even more so for UMPCs/MIDs), the higher power efficiency of IGPs will likely keep them highly relevant unless we're talking about Larrabee-like GPU-oriented cores being on every CPU. Either way what happens there depends a lot on what a few executives decide is best, and that's rarely very predictable sadly.

Regarding southbridges, you seem not to be realizing that the cost there is also related to analogue and I/O... You can't just put SATA in software; it's not just a digital controller. There obviously is some digital logic in there too, but it's very specialized and expect for things like PCI Express Gen2, it's far from being as big as the non-digital stuff AFAIK (and the latter doesn't even scale much if at all). Doing that digital logic in software would, AFAIK, be astonishingly inefficient. Either way, Intel found the near-perfect solution to this southbridge problem: Fab68 in Dalian, China. It'll lag behind in terms of process technology, but will have noticeably lower cost/wafer.

And I just thought I'd point out that obviously most of us on this forum don't take software rendering very seriously because of our "heritage" but I'm personally glad that you keep defending it so vigorously and try to dispell some myths from time to time! It's certainly nice to be able to have debates on the subject with someone who has so much first-hand experience! :)
 
What northbridge? Nehalem will have integrated memory controllers and some variants will have IGP cores. I'm sure some manufacturers are not happy about this evolution but eventually it will all get integrated (e.g. with four CPU cores or more a software RAID controller is fine). They can keep coming up with new stuff but once it gets to the point where it's too small to fit on a separate chip it will get integrated into something else or become software.
I purposefully noted northbrige or combination northbrige/southbridge.
It's not like A64 motherboards suddenly lost their chipsets after the architecture picked up an IMC.

True, but it all depends on the workload. GPUs are doing great with a lot more cores so I'm not sure what's your point.
Depends on what you consider a core. The way they are set up now, with the exception of the multi-GPU setups, they are not truly multicore.

One way or another you're going to add additional processing units that work concurrently. And in my opinion additional CPU cores is not the worst choice at all.
The question is how much, and here is where diminishing returns comes in.
For the bulk of the portable and desktop markets, 8-way symmetric multicore is an utter waste of time right now and will be incredibly wasteful in the future.
ALU density for those cores will be an order of magnitude less than what more specialized designs produce.
The slides comparing Larrabee and Sandy Bridge(Gesher) show Larrabee with 24 cores capable of 8-16 DP SSE ops, while Sandy bridge shows 4-8 cores capable of 7 DP SSE ops.
Even factoring Larrabee's lower core clock is insufficient to change the fact that Larrabee's core count is 3-6 times that of Sandy Bridge and each core is capable of slightly more to double (likely closer to double) the throughput.

When looking at current software, yes, it can be hard to imagine any purpose for an octa-core. But in 4 years a lot can change.
The increasing influence of the laptop market is going to severely impact the prevalence of octo-core general purpose only chips.

Like I said before some game developers are already certain to be able to make good use of quad-cores.
Those that make occassionally decent use of quad cores already make significantly better use of GPUs to do far more than the quad cores could dream of handling. I don't see how that changes anything.

In 4 years they'll have more tools to work with and likely some hardware assisted locking/scheduling. In fact, Nehalem will re-introduce HyperThreading. So clearly Intel is confident that eight threads is manageable in the not too distant future.
Look at the more mainstream versions of Nehalem and see what Intel finds to have more bang for buck.
The SMT, if anything, would likely delay octocores from achieving general acceptance, because 8 threads on 4 cores is as good as 8-not-quite-so-good cores.
If the more specialized cores can match just a few of those not so good cores, they can do so with a far smaller footprint and far higher peak capability.

It might take a decade or so to mature but I'm confident that we won't run into a dead end beyond quad-core.
It's too bad the market's not going to wait decades for this.
The same advances you expect to cure what ails general purpose multicores will not only benefit them. The same designs capable of 10x the raw performance will benefit as well, not magically stagnate.

What the heck are you saying? Anyone calling himself a software engineer is going to take hardware limitations as an equally serious factor in determining the right approach.
You are advocating the removal of specialized hardware. You are taking away a world of choice.

Only if it's clearly going to run faster on even an IGP they might consider deviating from the default. This isn't going to change in the next few years, not with multi-core steadily evolving and IGPs sticking around.
Here's where we'll probably have to differ.
The return on investment is not going to be all that great with 2-4 billion extra transistors thrown into an extra 4 general purpose cores.
The incremental gain of a single modest streaming core with less footprint than a single CPU will be significantly better.

These attempts are more succesful for multi-core CPUs than for GPUs. Even with GPUs getting more programmable every generation, you still have to deal with slow IGPs and crappy drivers.
I don't see how. There's little if any point for abstraction or dynamic recompilation to target a core if all the cores are the same general-purpose core.

Exactly. But again, the only reliably evolving baseline is the CPU, so they're not just going to stop investing in that. Hence octa-cores are coming our way sooner or later and software rendering will compete with IGP's from the bottom up.
This future will be delayed or possibly cancelled 2H 2009, by both AMD and Intel.

Sure, a minimum of complexity is unavoidable. But modern CPUs offer billions of cycles per second to turn your agent into something that doesn't run circles.
It's more fascinating to watch the generalized hardware waste most of those billions of cycles and pumping out hundreds of watts for no return.

If I see a game having good A.I. and a nearly identical game having terrible A.I. I'm not going to conclude that I have to upgrade my hardware... Badly optimized software is a reality, and throwing more cycles at it is not the right answer.
You'd rather have those octocores hammering away at software rendering that could have been handled adequately by a GPU 4 years prior using a fraction of the transistors and you're lecturing on optimality?
 
ohh sorry to be honest my knowledge of it only extends to knowing of its existance
but here goes :
its not meant for end users (in answering your question "Is there already a way to run Crysis in software mode" i answered yes but meant yes but only for the devs of crysis they could run it in software mode)

to use it you would need ms visual studio + the latest direct x developement kit and write your program to use the reference rasteriser (this is how devs could see what their game would look like before there was any dx10 hardware available)

dx7 software renderer:
http://www.radgametools.com/pixomain.htm

you can run dx9 games in software with this (expect it to be very slow)
http://www.transgaming.com/products/swiftshader/
 
Seen it times and times again.
Please specify.
I truly wish UT2k4 would even deserve the title of a D3D9.0 title. It's in its vast majority a DX7 T&L optimized game and in a few spare spots it might have a couple of DX8.0 shaders. With V3 material I meant resolution and filtering quality as examples.
I only mentioned Unreal Tournament 2004 because that's a game that now, four years later, runs perfectly in software. That's performance-wise. Feature-wise, we're already four years later than that and we're certainly not limited to DirectX 7/8 games.
I'd love to see such a SW renderer for Oblivion especially the 4 MX or the 855GM.
You don't even need a 3D capable graphics card.
 
I certainly think you might be right in the moderately long term there, but only for desktops and servers. In the laptop market (and even more so for UMPCs/MIDs), the higher power efficiency of IGPs will likely keep them highly relevant unless we're talking about Larrabee-like GPU-oriented cores being on every CPU.
True, power efficiency is very important in the mobile world. But there are plenty of arguments why I believe it's only going to take one generation longer:
  • Of every desktop CPU architecture there's a laptop version sooner or later.
  • Lower clocking drastically reduces power while performance only reduces moderately.
  • Quad-core laptop chips are already on the roadmaps.
  • IGP's don't exactly consume nothing. In fact gaming itself is not a great idea while on battery.
  • Intel is putting a lot of effort in increasing performance/watt. 45 nm does great and they promise even better things for 32 nm.
  • Lots of mobile devices are fine with only having an OpenGL ES software renderer.
Either way what happens there depends a lot on what a few executives decide is best, and that's rarely very predictable sadly.
Yeah, it's about a lot more than just feasibility. But once it's technically sound what the executives decide can both work against or in favor of software rendering. So it's no more unpredictable than for anything else.
Regarding southbridges, you seem not to be realizing that the cost there is also related to analogue and I/O...
I wasn't implying the removal of southbridges any time soon. I was only saying that thinks like the digital side of RAID controllers, for which software drivers already exist, is only going to get pushed to the (multi-core) CPU side more and more. We're obviously always going to need analogue and I/O in hardware.
And I just thought I'd point out that obviously most of us on this forum don't take software rendering very seriously because of our "heritage" but I'm personally glad that you keep defending it so vigorously and try to dispell some myths from time to time! It's certainly nice to be able to have debates on the subject with someone who has so much first-hand experience! :)
You're welcome. I might be barking up the wrong tree most of the time, but it's interesting to see how hardware rendering is so vigorously defended against harmless software rendering. ;) It only makes me more certain of where software rendering technology is/should be heading... So thank you!
 
Back
Top