Optimizing the Rendering Pipeline of Animated Models Using the Intel SSE

the article seems to be more on balancing out what does what in an over all system, with backward compatibility. Sounds all good and dandy, but when we start getting into physics on the GPU that will remove the burden of the bus bandwidth between the CPU and GPU for the most part for animated objects on newer cards. So in essence the article is great if developers are targeting 2 year old system, but thats about it.
 
I moved this thread since the answer is quite short and basic imo...

In the Doom engine, things are rendered multiple times because of the way shadows are computed; older GPUs cannot save transform data, so because Doom has multiple rendering passes, it would have to re-animate things several times and waste computing power on that. So, instead, they decided to simply do the animation on the CPU so that it's less likely T&L would be a bottleneck. This is one of the reasons why Doom/Quake are relatively CPU limited.

On modern GPUs, you can use Render to Vertex Buffer or similar techniques to be able to compute these things once on the GPU, instead of using the CPU. That approach is thus outdated. Also, because the rendering passes involved are often very "light" in terms of pixel shading, it makes sense not to even bother with this approach on unified architectures, such as G80 or R600, because it's very unlikely the VS would be the bottleneck (bandwidth, triangle setup, etc. are more likely candidates).

As such, the approach proposed in that link has been used, but it is now completely and utterly outdated imo. Heck, even doing stuff multiple times in the VS on older GPUs was rarely a bottleneck, anyway, and this was afaik just a minor performance boosts even for those engines. Given that GPUs' performance evolved faster than CPUs', it's most likely a performance loss for the Doom 3 engine nowadays, actually - heh.


Uttar
 
Given that GPUs' performance evolved faster than CPUs'...
I question whether this is still true in the multi-core era. It has taken four years to increase the Pentium 4's clock frequency from 1.3 to 3.8, a 300% theoretical performance increase. In less than two years Intel introduced dual-core and quad-core. So in four years we might have a 1600% theoretical performance increase. While GPUs are close to the limits of die size / silicon cost, CPUs can still vastly increase their throughput by placing many smaller cores on a larger die. They both benefit equally from advances in process technology now.

What I'm trying to say is that we shouldn't be too quick to move all workload to the GPU. In a couple years a typical budget system could very well have a quad-core CPU but in comparison a weak GPU...
 
I question whether this is still true in the multi-core era.
I completely agree that the theorical performance of CPUs is no longer going to stall in the future. However, I believe a key point to take into consideration is that the ALU-TEX ratio is going to increase in the future. So, in the near/mid-term, I think you'll see arithmetic performance of GPUs scale up faster than that of CPUs, because the percentage of the die dedicated to ALUs is going to increase.

Furthermore, if many-core CPUs would be seen as 3x less efficient per mm2 in terms of massively parallel arithmetic workloads, it'd make sense to do many of these computations on the GPU. The key would heterogeneous architectures, but in the current landscape, I'm not even sure what the advantage would be against a GPU for arithmetic-heavy workloads, unless it's also x86 based. We'll see how AMD's APU initiative works out, I guess.


Uttar
P.S.: Considering Moore's law, unless the per-core performance degrades, homogeneous architectures which would not scale by clock speed are unlikely to become more than 2x faster every 2 years. Quad-core having been introduced in such a short timeframe is an exception, imo, and not the rule. Of course, Nehalem might hold some interesting surprises... (4 cores, 8 threads - if it's wider than Conroe, that'd be interesting!)
 
I question whether this is still true in the multi-core era. It has taken four years to increase the Pentium 4's clock frequency from 1.3 to 3.8, a 300% theoretical performance increase.
That's a 192% increase, actually.

In 40 months from R300, 31.2GFLOPs to R580, 374.4 GFLOPs, there is a 1100% increase (the programmable arithmetic pipeline being the same format the entire time: MAD+ADD, 2x vec4). We'll know, soon, what R600 can do...

Of course GPUs only have to do "single-precision" mathematics, but R300 is 24-bit and R580 is 32-bit.

The GPUs wouldn't scale as well when you take into account the fixed function pipelines, such as TMUs and ROPs.

In less than two years Intel introduced dual-core and quad-core. So in four years we might have a 1600% theoretical performance increase. While GPUs are close to the limits of die size / silicon cost, CPUs can still vastly increase their throughput by placing many smaller cores on a larger die. They both benefit equally from advances in process technology now.
GPUs have 1 process node in hand, 90nm versus 65nm.

What I'm trying to say is that we shouldn't be too quick to move all workload to the GPU. In a couple years a typical budget system could very well have a quad-core CPU but in comparison a weak GPU...
A weak GPU in a budget system in 2 years' time will be something like RV530 running at 400MHz, presumably on 65nm, about 160M transistors. 57.6 GFLOPs and say 16GB/s of bandwidth.

Anyway, those Intel documents are 2 years out of date. They totally ignore what D3D10 is all about and how D3D10 effectively opens up the GPU to non-graphics computing functionality.

Jawed
 
However, I believe a key point to take into consideration is that the ALU-TEX ratio is going to increase in the future. So, in the near/mid-term, I think you'll see arithmetic performance of GPUs scale up faster than that of CPUs, because the percentage of the die dedicated to ALUs is going to increase.
That really doesn't scale. If it's 60/40 now and 80/20 in the future then you only get a 33% increase and that's it. Besides, CPUs are abandoning the idea of long and narrow pipelines and go for short and wide pipelines, thereby also increasing ALU/area ratio.
Furthermore, if many-core CPUs would be seen as 3x less efficient per mm2 in terms of massively parallel arithmetic workloads, it'd make sense to do many of these computations on the GPU.
3x is barely enough to offload the work to the GPU without risking to make things slower. The communication latency is not negligible, and you have to consider the ease of the programming model. Game developers are much more inclined to reduce the workload to a third than to take the effort to run it efficiently on the GPU.

And like I said before, there's a huge variation in GPU performances. With a quad-core CPU and an R600 I'm sure it's worth it to do as much as possible on the R600. But with a quad-core CPU and a budget GPU it's best to stress the latter as little as possible. If a compromise has to be made, game developers choose to make it run well on budget systems...

The relative gap between high-end GPUs and CPUs isn't really getting bigger any more, but the gap between CPUs and low-end GPUs certainly is. People who only need to run Vista and the occasional game will buy a system with a CPU that is two times slower than the high-end but a GPU that has only a tiny fraction of a high-end GPU's performance.
The key would heterogeneous architectures, but in the current landscape, I'm not even sure what the advantage would be against a GPU for arithmetic-heavy workloads, unless it's also x86 based. We'll see how AMD's APU initiative works out, I guess.
Unless they can standardize it (i.e. Intel implements the exact same functionality), it's not going to live long. Programmers are really not asking for yet another variation in system configuration...
 
Last edited by a moderator:
That's a 192% increase, actually.
Ok, 292% relative performance.
In 40 months from R300, 31.2GFLOPs to R580, 374.4 GFLOPs, there is a 1100% increase (the programmable arithmetic pipeline being the same format the entire time: MAD+ADD, 2x vec4).
You also have to consider that GPUs have been catching up with silicon and design technology. So I really think GPUs and CPUs are bound by the same laws now.
GPUs have 1 process node in hand, 90nm versus 65nm.
Intel is almost ready with 45 nm... But even if GPUs do catch up completely that's still a one-time advantage in relative performance. On the other hand GPUs really don't have any room left with thermal dissipation. Die size is also a problem, and even if they split them, so can CPUs.
A weak GPU in a budget system in 2 years' time will be something like RV530 running at 400MHz, presumably on 65nm, about 160M transistors. 57.6 GFLOPs and say 16GB/s of bandwidth.
And a budget 3 GHz quad-core CPU would be 48 GFLOPS with 192 GB/s L1 cache bandwidth and about 12 GB/s RAM bandwidth. Using the GPU for anything it isn't particularly designed for would be a bad idea...
 
That really doesn't scale. If it's 60/40 now and 80/20 in the future then you only get a 33% increase and that's it.
The base of your arguement is correct, but your numbers are horribly wrong. Let us take R520 and R580's examples. The first is 288mm2, the second is 352mm2. Excluding the VS' ALUs, the latter has 2.45x the amount of Gflops per mm2. Furthermore, through some slightly more subjective calculations, we can conclude that only 10 to 15% of the R580's die was dedicated to the ALUs! ((((352-288)/2)/288) = 11.1%...)

And it's worth noting that they enlarged the PS register file at the same time. Arguably, this means R580 is far from a perfectly scalable architecture (which can easily be "proved" by looking at the die sizes of RV570 and RV530), but this is not today's debate. If you look at G80, they've got a gigantic amount of filtering power and ROP power. They also have a lot of SFU/Interpolator units. I'd be if the G80 was more than 20/80. Now, imagine if architectures in the 2010 timeframe were 80/20 instead...

Besides, CPUs are abandoning the idea of long and narrow pipelines and go for short and wide pipelines, thereby also increasing ALU/area ratio.
Very true. As I said above, if Nehalem is wider than Conroe and handles 2 threads per core, things could be very interesting indeed.

And like I said before, there's a huge variation in GPU performances. With a quad-core CPU and an R600 I'm sure it's worth it to do as much as possible on the R600. But with a quad-core CPU and a budget GPU it's best to stress the latter as little as possible. If a compromise has to be made, game developers choose to make it run well on budget systems...
While this is true today, I question whether it will be tommorow. If we considered a many-core future, die sizes would suddenly become a lot more variable in the CPU world. As such, it'd make sense to sell 2 cores, 4 cores and 8 cores offerings at the same time, for example (350mm2, 180mm2, 100mm2 or some such, for example). And then, the budget system might also be equipped with a much weaker CPU, not just a weaker GPU. Another possibility is that the ratio between CPU and GPU power would begin varying even more, which would make balancing tasks between the CPU and GPU absolutely key for optimal performance.

Unless they can standardize it (i.e. Intel implements the exact same functionality), it's not going to live long. Programmers are really not asking for yet another variation in system configuration...
What most programmers would want is a GPU that's exclusively dedicated to graphics and a CPU architecture that is single-core and that scales wonderfully in terms of serial performance over time. How likely are either of these things, though? In the end, what matters is middleware. It is increasingly unlikely we will see teams that create physics implementations from scratch for many-core CPUs, for example. And it's downright unthinkable more than the extremely hardcore development teams would make a GPU physics engine from scratch!

You know, if I was a venture capitalist today, I'd invest in Ageia. Not because their hardware will take over the world. No, it'll most certainly flop pitifully and they'll be cash starved soon enough. But considering their API is now 100% free, it's likely they'll get some serious developer adoption from not only indie studios, but also AAA engine and game developers. Heck, UE3 and Gamebryo are already using it as standard nowadays.

I would be extremely surprised if neither NVIDIA nor AMD (or even Intel!) decided to outright buy them within a year or two, and make the API's acceleration works via DX10 instead. And whichever company will buy Ageia will also port it to its own proprietary API (CUDA or CTM), which would be a key advantage. Of course, this is partially dependent on the PhysX API getting some further traction from game developers in the next 6-9 months. And either way, opensource physics and dynamics engines will be ready for GPU physics and many-core acceleration eventually.

Now, other things than Physics are another thing completely, and it remains to be seen what will happen there, of course.


Uttar
 
The base of your arguement is correct, but your numbers are horribly wrong. Let us take R520 and R580's examples. The first is 288mm2, the second is 352mm2. Excluding the VS' ALUs, the latter has 2.45x the amount of Gflops per mm2. Furthermore, through some slightly more subjective calculations, we can conclude that only 10 to 15% of the R580's die was dedicated to the ALUs! ((((352-288)/2)/288) = 11.1%...)
I admit I pulled those numbers out of thin air, but still, I don't believe they can repeat this ad infinitum. First of all your calculation seems a bit off. I get 12.5% for R520 and 27.3% for R580. But the rest isn't just texture samplers, there's also lots of other logic you need to keep the pipelines filled. You can't just sacrifice that to allow more shader units and expect higher performance. G80 likely set the ratio for the next generations.
And it's worth noting that they enlarged the PS register file at the same time.
They tripled the register file as well? Interesting if true. Do you have a reference for that? I was under the impression that R580 was quickly register starved for complex shaders.
I'd be if the G80 was more than 20/80. Now, imagine if architectures in the 2010 timeframe were 80/20 instead...
If R580 is 27.3% then G80 is likely 30-40%. But leaving 20% for dedicated texture samplers and all the logic to keep the pipelines filled? Good luck with that. Seriously, I believe G80 already pushes the ALU/TEX ratio to the extreme. If increasing it even more did effectively improve effective performance, I'm sure they would have.
While this is true today, I question whether it will be tommorow. If we considered a many-core future, die sizes would suddenly become a lot more variable in the CPU world. As such, it'd make sense to sell 2 cores, 4 cores and 8 cores offerings at the same time...
Some variation seems very likely indeed. But I expect a few things to make the difference much smaller than what we see in the GPU market. First of all, CPU performance is still the primary buying factor. People on a budget still want the fastest thing available for their money and if Intel offers a quad-core for 100 $ then they won't choose a dual-core for 75 $. Secondly, Intel already offers dual-core CPUs for less than 100 $ today, making it very likely that quad-core will be affordable in little time as well. Finally, it becomes increasingly more interesting to produce only dual-core dies (currently packaging two together to form quad-core). So once they produce quad-core on a single die, dual-core will dissapear from the budget market as well.
What most programmers would want is a GPU that's exclusively dedicated to graphics and a CPU architecture that is single-core and that scales wonderfully in terms of serial performance over time.
Going multi-threaded or doing extra work on the GPU is both an investment. But going multi-threaded has to happen sooner or later anyway, and avoids the risk of bogging down cheaper GPUs.
 
I don't believe they can repeat this ad infinitum.
Completely agreed. All I'm saying is that gflops/transistor of GPUs are likely to increase faster in the next 3-4 years than the CPU's. This is important, however, in terms of perf/transistor. If the GPU had a signifcant advantage there for the tasks that are computation-heavy and the CPU's architecture didn't significantly evolve (two big IFs, I'll admit!), the CPU would become increasingly less important, at least for gaming. We aren't quite there yet, but the next 5 years will be decisive in determing whether it remains a key system component or not, and it'll be interesting to see how all involved companies try to push their long-term agenda.

I get 12.5% for R520 and 27.3% for R580.
Oopsy, my number was for R520. No idea why I brainfarted R580 there.

But the rest isn't just texture samplers, there's also lots of other logic you need to keep the pipelines filled. You can't just sacrifice that to allow more shader units and expect higher performance.
Yeah, obviously. For a fairer comparaison, it would be required to

They tripled the register file as well? Interesting if true. Do you have a reference for that?
They definitely did, according to Eric Demers. They didn't triple the control logic though, and it'd be hard to estimate how much silicon that represents.

If R580 is 27.3% then G80 is likely 30-40%.
I fail to see how you get to 30-40% for G80. The ALU ratio is most likely a fair bit lower than R580's, IMO, no matter how we count the VS pipelines. The R580 has 16 TMUs and 16 ROPs, while the G80 has 32 or 64 TMUs (depending on how you count) and 24 ROPs. Both units are a fair bit more capable than R580's (4x MSAA zixel rates should illustrate that nicely for the ROPs ;)). Furthermore, if you exclude the MUL (which the R580 very arguably also has, they just don't advertise it), the R580 has more PS-only GFlops than G80 for VS+PS.
G80 likely set the ratio for the next generations. [...] Seriously, I believe G80 already pushes the ALU/TEX ratio to the extreme. If increasing it even more did effectively improve effective performance, I'm sure they would have.
NVIDIA's strategy is and has always been the same: optimize for the games that will be benchmarked on the card release date's, not the ones that will be benchmarked 6-12 months later. If you want to give yourself a rough idea of why that makes sense, look at Anandtech's review of G80. Out of the 7 games they used, I can only see 2 that might kinda sorta stress G80's ALU-TEX ratio. And even then, those two games (F.E.A.R. and Oblivion) benefit a lot more from other attributes of G80's architecture such as the extremely fast Z and Stencil rates for F.E.A.R., and cheap FP16 filtering/blending for Oblivion.
Given G8x's apparent architectural flexibility, and future workloads, it is extremely safe to say that G80 did NOT set NVIDIA's ratio for the next generations. I would be extremely surprised if that ratio didn't go up within the next 6 months.

Finally, it becomes increasingly more interesting to produce only dual-core dies (currently packaging two together to form quad-core). So once they produce quad-core on a single die, dual-core will dissapear from the budget market as well.
Yup, that's kinda true. Intel doesn't seem to be planning a Conroe-architecture 4-cores chip though, so their first 4-cores will be Nehalem. Who knows how much bigger (or smaller?) each of those dies will be. But then, the budget CPU might be 2 wider cores with 4 threads total, so applications would definitely have to be ready to handle more than 2 threads anyway.

Going multi-threaded or doing extra work on the GPU is both an investment. But going multi-threaded has to happen sooner or later anyway, and avoids the risk of bogging down cheaper GPUs.
That's definitely true for the next 2 years or so for core gameplay elements imo. However, for things like effects physics (which you can easily scale down without affecting gameplay), I don't think it really matters if you'd bog down a low-end GPU; if it didn't make sense to do that, it wouldn't make sense to design any feature that requires more performance than low-end GPUs (or CPUs) can offer.

In the end, I ponder how much of this arguement makes sense, since it's very debatable at this point if we'll even see a "many-core" future. AMD seems to be want to max-out at 4 cores for the desktop, while Intel seems to favour 4 cores and 8 threads for the 2008 timeframe. It's likely they'll try 8 cores and 16 threads with two chips, but how likely is that to give any boost whatsoever to applications in the 2009 timeframe? It would certainly be interesting if Intel also took the APU road by 2010 with Gesher... (in fact, it looks like they're aiming at (differentiated?) micro-cores... hmm. Scheduling a single thread's instructions accross those cores would certainly come in handy for them!)


Uttar
 
actually was thinking about Fusion a little bit, I know 2 or 3 groups (companies) that are working on realtime raytracing engines. If Fusion takes off, we could see an explosion of this kind of technology? Wouldn't the Fusion GPU be able to deliver raytracing to mainstream?
 
All I'm saying is that gflops/transistor of GPUs are likely to increase faster in the next 3-4 years than the CPU's.
We're both looking in crystal balls here, but I believe there are important indications that the relative gap isn't going to get any larger. The Pentium 4 era is over. The primary reason AMD gained so much market share is because Intel bet all horses on clock frequency to keep performance following Moore's curve. But AMD showed them where the should have been, with older silicon technology. With short but wide pipelines and multi-core they're taking a big turn and catch up what they've missed in the previous years.

The primary reason why GPU performance has increased faster than Moore's law is clock frequency increases. Now CPUs are riding the same waves of parallelism and clock frequency, I see little reason why one would do particularly better or worse than the other. Yes, architectural advances are clearly just as important, but in the long term neither has exclusive advantages.
That's definitely true for the next 2 years or so for core gameplay elements imo. However, for things like effects physics (which you can easily scale down without affecting gameplay), I don't think it really matters if you'd bog down a low-end GPU; if it didn't make sense to do that, it wouldn't make sense to design any feature that requires more performance than low-end GPUs (or CPUs) can offer.
Ok, you can scale down physics on the GPU, but that doesn't make a lot of sense if future CPUs will offer lots of GFLOPS over the entire high-end to low-end range. I also think it's easier to scale down graphics without affecting gameplay, than it is to scale down physics (or even more crucial computations). Reducing the resolution and the detail goes a long way to run the graphics on weaker CPUs. But even for scaled down physics I don't want it to affect my framerate. If I have a quad-core I want it to be used effectively.
In the end, I ponder how much of this arguement makes sense, since it's very debatable at this point if we'll even see a "many-core" future. AMD seems to be want to max-out at 4 cores for the desktop, while Intel seems to favour 4 cores and 8 threads for the 2008 timeframe. It's likely they'll try 8 cores and 16 threads with two chips, but how likely is that to give any boost whatsoever to applications in the 2009 timeframe?
As long as you have extra threads to run, more cores is always better. For multimedia there certainly can't be too many cores. But I also expect a paradigm shift for other applications. Nowadays it appears hard to even go dual-core, but once the idea of concurrent computing settles and tools and languages appear that make it easier to manage, I don't think more than 4 cores is a problem.

Ever used SystemC? It's a C++ framework for describing hardware. But it can be used for writing software as well (it's still C++). The basic entity is actually not a class but a thread (cfr. concurrently running hardware functionality). So even in a simple SystemC project there are lots of threads, making it scale easily with the number of CPU cores...
 
Back
Top