G80: Physics inflection point

DemoCoder · Nov 14, 2006

The G80 delivers stellar increases in IQ/perf on DX9 games over G7x, but many people will reason that the true power won't be unlocked until DX10 games arrive, which are along way off, and thus, in the meantime, the card will only be a decent upgrade to topend G7x/R5xx.

However, some thinking on the crapper has convinced me that the G80 will kick start the physics market long before even DX10 games arrive, in a way that Ageia can only dream of. We've seen hints of the G80 physics power in the "smoke box" demo which runs real time Navier-Stokes computation fluid dynamics calculations on a very fine grid concurrently with a raytracing/raymarching rendering app, showing off single-GPU usability of physics.

Whereas Ageia was coupled to Novodex on their HW (which IMHO was a Physics *decelerator*), NVidia is pushing Havok, which has the majority of the middleware physics marketshare. This brings up the possibility of a recompile/relink/patch of existing games being able to take advantage of GPU physics, perhaps with increased "Effects Physics" loads in the beginning.

NVidia's "Quantum Physics Engine" seems to be a low-level port of Havok to NVidia's CUDA architecture, which seems to repurpose L1/L2/GPR/onchip storage into scratch-pad RAM ala CELL, albeit with 8 times the scalar MADD power of a single CELL, as well as a much faster memory bus and better TLP design/branching performance. If the QPE can be dropped into existing Havok apps with minimal upgrade work, it could show off immediate benefits/killer demos in a way that Ageia never could, in that one doesn't need to buy a "PPU only" card.

Besides the fact that for the first time, a GPU card is on the market which may be able to accelerate physics reasonably well (and indeed, accelerate more than effects physics with no collision detection),
SLI becomes even more worthwhile, even for low-end G8x cards as a way to add-on PPU power later. It's unlikely that people with SLI 8800GTXes will encounter resolution limits in the near term on DX9 games, and higher SLI AA levels have diminishing returns (unless we're talking about getting into temporal supersampling, etc) So there is less pressure to own an SLI 8800GTX given its performance on rendering existing games. However, QPE performance would be a major selling point, if you could cut down on context-switch time and memory/cache thrashing (between GPU rendering and CUDA mode) by adding a second card, as well as maximizing physics performance.

Leaving aside Folding@Home stuff, I also think we will see this go beyond Newtonian dynamics and branch out into audio processing (resurrect Aureal Wavetracing please), as well as the mundane, such as boosting multimedia apps (Adobe) and even server apps (think SSL acceleration)

But my main point is, Ageia was too overpriced for a single-purpose solution, as well as IMHO, being unable to compete with NVidia/ATI from an engineering perspective (process, memory bus, design, et al). Physics on previous-gen non-DX10 chips seems to be limited due to architecture limitations. Stream out helps, but is less useful than having the additional ability (in addition to streamout) for shared general purpose random access memory + synchronization primitives (G80). Now granted, the total amount of memory is small, which means it won't be a simple cakewalk to get it working, but the potential is alot higher.

Thus, it seems to mean that 2007 could be the year of GPU physics acceleration. True ground-up DX10 games (as opposed to quickies which use some DX10 features) might not arrive until 2008. However, middleware upgrades to use GPUs may bridge the gap in which the extra (and in some cases hidden/unexposed-by-DX10) GPGPU power of these cards may be used to upgrade existing engines with nice add-on features (remember how various games got EAX/Aureal/MMX/SSE "patches"?)

Now, it is true that 2007 will also be the year of quad-core CPUs and ubiquitous dual-core chips, and that means games with increased physics workloads will also run on these systems too, but NVidia is claiming 10x physics performance vs a 2.66 Ghz Core Duo already, so whatever additional CPU power these machines have will only help make the QPE setup quicker, load balance with the QPE (CPU + GPU physics acceleration) or reduce CPU limits elsewhere.

A counter argument would be that G80 ain't gonna run Havok much faster than a Core 2 Duo, that perhaps Havok collision detection won't run on it, or will run, but won't have a stellar speedup. Or, that a 10x speedup given n^2 or n^3 scaling won't lead to that much of a visual difference. I think those are legitimate points to consider.

Finally, there's the aspect of a "physics only G80". Why? When I just explained how consumers will be offput by single-use solutions. I believe that this may come to fruition, and will be a way for "non-SLI" mainboard owners to boost performance. This would be a cutdown G80 core with smaller memory bus, single-slot card installable in a secondary PCI slot. During the NVidia G80 launch, they held up an enthusiast 680i mainboard with *FOUR* Nvidia cards in it. Two were 8800GTXs, dual slot consuming cards. The other two were thin single-slot cards presumably sharing a third space where you can plug in either another dual-slot 8800GTX, or 1 or 2 single-slot GPUs. There are two potential markets for this depending on price: 1) if they can get reasonable performance on a cut-down G8x for under $100, it might sit well with mainstream users. 2) the SLIzone overclocking "I spent $2000 on my gfx system alone" crowd.

Ok, enough bathroom crapper theories.

Jakob · Nov 14, 2006

Yes.

G8x and DX10 will finally crack the chicken or egg problem. We need a killer app, and up to now, devs were reticent to invest in significant physics love because the install base was so low. This changes with DX10 GPUs. Now everyone with a DX10 GPU also has a 'PPU'.

DemoCoder · Nov 14, 2006

To some extent yes. I think however GPGPU physics engines won't be written to the DX10 API, but will use low-level IHV-specific GPU drivers to take advantage of underlying architecture. They *could* be written to DX10 abstract API, but the performance will probably be worse.

Moreover, given the paucity of DX10 GPU suppliers (ATI, NVIDIA, and maybe Intel), middleware develoepers like Havok only need to tune/port their engine for 3 different proprietary "GPGPU" interfaces, making the problem tractable from a developer standpoint.

Realistically, they'll just do G80 and R600 versions, and ignore Intel GMA, and probably not bother two much with G7x/R5xx versions.

Jakob · Nov 14, 2006

What's the latest on Microsoft adding physics to DX? I recall this was hinted at a year ago, but haven't heard anything in a while. This would simplify things greatly for developrs. The IHVs would be responsible for writing backends that map to their hardware, be it PPU, G8x, R5xx, etc. MS could even supply a generic backend that maps physics to DX10 graphics pipe as a fallback, for lazy IHVs that don't want to write another driver.

DemoCoder · Nov 14, 2006

I think it died (DirectPhysics), but given that ATI and NVidia are shipping separate GPGPU drivers that allow developers to bypass DX, I think it is only a matter of time before MS takes note and tries to reconcile the two IHV proprietary interfaces into a standardized one.

INKster · Nov 14, 2006

DemoCoder said:
I think it died (DirectPhysics), but given that ATI and NVidia are shipping separate GPGPU drivers that allow developers to bypass DX, I think it is only a matter of time before MS takes note and tries to reconcile the two IHV proprietary interfaces into a standardized one.

Interestingly, the G80 CUDA driver works alongside the standard driver, it doesn't replace it.

dnavas · Nov 14, 2006

I await with some amusement the Fortran version of cuda....

PeterAce · Nov 14, 2006

DemoCoder,

Please slightly reduce the fibre in you diet (joke) .....as that was a most enjoyable post to read

DemoCoder · Nov 14, 2006

INKster said:
Interestingly, the G80 CUDA driver works alongside the standard driver, it doesn't replace it.

Yes, but developers trying to support both CUDA and ATI's DPVM/CTM may start to yearn for a higher level tool/API. That may be hard given architectural limitations, but there may be a way to use profiles to get around it. MS may eventually offer something like CUDA/Brooke with caveats, to which NVidia and ATI supply drivers. Of course, NV and ATi would also continue to ship their proprietary versions as well. (Just like HLSL/GLSL didn't kill off Cg)

Rufus · Nov 14, 2006

Great post, however just a small nitpick:

DemoCoder said:
Leaving aside Folding@Home stuff, I also think we will see this go beyond Newtonian dynamics and branch out into audio processing (resurrect Aureal Wavetracing please), as well as the mundane, such as boosting multimedia apps (Adobe) and even server apps (think SSL acceleration).

What's common between F@H, audio, Adobe, and every other program that has and could be mapped onto a GPU is that 99% of their computation is floating point or can be floating point (mapping integer audio or images -> fp -> int is fine), just like graphics rendering. This is a huge possible multimedia market that GPUs can target.

There's no way that they can do any server tasks. SSL and the like are all purely integer. Having a slight rounding error when mapping audio from fp -> int is acceptable. Having a small rounding error on your SSL connection, well, kinda makes it useless.

Evotistic · Nov 14, 2006

Excellent post, DemoCoder, was a good read.

DemoCoder · Nov 14, 2006

Rufus210 said:
There's no way that they can do any server tasks. SSL and the like are all purely integer. Having a slight rounding error when mapping audio from fp -> int is acceptable. Having a small rounding error on your SSL connection, well, kinda makes it useless.

G80 SPs have an integer mode with no rounding issues, the only limitation being they aren't full 32-bit integers, and most likely use the 24-bit mantissa to do all integer arithmetic. But this is no problem for integer code. People have been using 24-bit integer DSPs for years. Secondly, with true IEEE754 semantics, one can get away with doing integer arithmetic as long as one is careful. Overflow, division, and binary ops being areas to watch out for.

Moreover, most public key cryptography algorithms are based on arbitrary precision number fields. Of those based on the integers, one can essentially view multiplication as an FFT, and GPU's each FFTs for breakfast. Pretty much all of the basic RSA operations would be no sweat for the G80, e.g. modular exponentiation, chinese remainder theorem, etc. It in fact, the inherent scalabiity of these that makes embedding RSA accelerators into smart-chips on credit cards feasible. SSL accelerators are inherently TLP, since the goal is to deal with hundreds or thousands of incoming concurrent SSL session negotiations and the scalar parallel natural of the G80 is perfect for this. People today are selling FPGA solutions and embedded multi-core linux boxes as "SSL accelerators" which have far less performance/cost.

Rufus · Nov 14, 2006

DemoCoder said:
G80 SPs have an integer mode with no rounding issues, the only limitation being they aren't full 32-bit integers, and most likely use the 24-bit mantissa to do all integer arithmetic. But this is no problem for integer code. People have been using 24-bit integer DSPs for years. Secondly, with true IEEE754 semantics, one can get away with doing integer arithmetic as long as one is careful. Overflow, division, and binary ops being areas to watch out for.

Interesting, I hadn't thought about that. I have no doubt that *if* you can get a working SSL engine on a G80 it'll be blazingly fast at it. Most cyphers in use are designed to be stream-based and easy to implement in hardware, so they're likewise easy to implement with little branching on DSPs, FPGAs, and I guess now GPUs. I'm just not convinced it's entirely feasable from a computation standpoint. Is there any public details of what exactly can execute on the SPs, or do we have to wait for the CUDA SDK to get out of NDA?

Also how long before we have the first sysadmin ask his boss for a G80 so he can do some server testing

.

DemoCoder · Nov 14, 2006

Well, DX10 requires integer registers, so I'm pretty sure both G80 and R600 will support accurate integer arithmetic.

The most expensive operation in SSL isn't the stream cipher (such as RC4, 3DES, etc), but the connection setup in which the session key is negotiated. This is what requires alot of high precision modular arithmetic, ala diffie-hellman/rsa/el gamal/etc. ECC uses Galois fields so it's slightly different (and more expensive). Accelerating that will take a huge weight off of connection setup costs, and level-3 routers will be able to setup the SSL/TLS connection, and parse the HTTP headers before handoff.

Of course, G80 could also be used in a cluster for key-cracking purposes too.

elroy · Nov 14, 2006

Thanks for your excellent post DC. 'Twas a great read!!

Gubbi · Nov 14, 2006

DemoCoder said:
Of course, G80 could also be used in a cluster for key-cracking purposes too.

So do we blame NSA for the upcoming G80 shortage ?

Good post.

Cheers

Osamar · Nov 14, 2006

Nice post and great idea. It shoulb be nice to have a 8800GTX polimorfed into a [8200 physic card + 8600 graphic card] when you like/need it

Xmas · Nov 14, 2006

DemoCoder said:
G80 SPs have an integer mode with no rounding issues, the only limitation being they aren't full 32-bit integers, and most likely use the 24-bit mantissa to do all integer arithmetic.

They are full 32-bit integers. If they weren't, NVidia would have a huge problem.

DemoCoder · Nov 14, 2006

Xmas said:
They are full 32-bit integers. If they weren't, NVidia would have a huge problem.

There's a difference between say, orthogonally supporting 32-bit storage and precision, and having fully orthogonal native support. No GPU supports double precision FP, but that doesn't mean it could not be supported by compilers via emulation, just like GPUs used to support transcendental and other functions via power series expansion. In the case of integer ops, have a look at NV_gpu_program4 GL extension. Notice how the MUL instruction supports a special U24/S24 modifier for "fast" integer multiplies? That would seem to imply that a 32-bit integer MUL is "emulated" and that MUL truly operates with 24-bit precision. This does not appear to be the case for ADD. So there are some special circumstances (right now, it appears to be only MUL) where the HW does ops with internal 24-bit integer precision. But it's not really a big deal, since one can do arbitrary precision arithmetic if one wants via software. DX10 also requires registers with 4 components, but the G80 registers internally are inherently scalar. SIMD vectorization is an illusion.

Arun · Nov 14, 2006

A G8x cluster currently has 16 MADDs and 16 MULs. I believe the MADD is FP40 internally and can execute either FP32 or INT32 operations; the MUL (which is currently only exposed for perspective correction) is FP32 internally and can execute either FP32 or INT24 operations. So, sure, if your program was a huge bunch of dependent integer MULs, INT24 would be twice as fast as INT32. But that doesn't mean INT32 is emulated; one of the two units just isn't capable of it.

As for physics acceleration, call me back when I can use one in an indie game without paying

$, okay?

Havok FX actually has an extra price compared to plain Havok. On the plus side of things, I think CUDA will hopefully popular within the open source community, and engines like Bullet or ODE, or other ones, will get some much needed love from GPGPU.

It remains to be seen, however, what exactly is NVIDIA's business model for CUDA, of course...

Uttar

G80: Physics inflection point

DemoCoder

Jakob

DemoCoder

Jakob

DemoCoder

INKster

dnavas

PeterAce

DemoCoder

Rufus

Evotistic

DemoCoder

Rufus

DemoCoder

elroy

Gubbi

Osamar

Xmas

Porous

DemoCoder

Arun

Unknown.

Similar threads