G80: Physics inflection point

So, sure, if your program was a huge bunch of dependent integer MULs, INT24 would be twice as fast as INT32. But that doesn't mean INT32 is emulated; one of the two units just isn't capable of it.

Well, it's a theory that's testable (assuming driver works correctly in extracting ILP/co-issue) Write an OGL shader that uses the U24 MUL, with/without perspective correction and see what happens. Then try it with the 32-bit MUL. with/without perspective correction as a control group. I don't buy the theory of a separate perspective-correct MUL unit, since you'd have to compute 1/w first, wait for it to finish, and then issue the MUL, so it doesn't make much sense to make it separate from the SF unit. Rather, I think the SF unit is where the MUL is located, that is, it is used to do an RCP, followed by a MUL. If you ever find the "missing MUL", I bet you can't co-issue it with a SF or interpolant read. Whether the main MADD unit can do full 32-bit integer MULs at full speed is another question that hopefully someone will test. Given that 32-bit ADD is supported, it would seem probable, but you never know.


As for physics acceleration, call me back when I can use one in an indie game without paying :LOL:$, okay? :( Havok FX actually has an extra price compared to plain Havok. On the plus side of things, I think CUDA will hopefully popular within the open source community, and engines like Bullet or ODE, or other ones, will get some much needed love from GPGPU.

It remains to be seen, however, what exactly is NVIDIA's business model for CUDA, of course...

I don't think it will be any different than Cg. I think CUDA will be free when it exits NDA, but Quantum Physics won't, just like Gelato isn't. It will then only be a matter of time before ODE et al is ported, although ODE IMHO needs alot of work to match half of what Havok does. It's possible than QPE might be free or reduced cost to "The Way It's Meant to Be Played" developers, since inclusion of QPE practically drives demand to own an NVidia G8x card.
 
In the case of integer ops, have a look at NV_gpu_program4 GL extension. Notice how the MUL instruction supports a special U24/S24 modifier for "fast" integer multiplies? That would seem to imply that a 32-bit integer MUL is "emulated" and that MUL truly operates with 24-bit precision.
Not necessarily. The hardware could be doing the multiply over multiple clock cycles and just stops earlier for a 24x24 mul <shrug>
This does not appear to be the case for ADD. .
Well, adders are a lot cheaper than multipliers! (except, for the pedants out there, 1 and 2-bit multipliers/adders! :) )
 
DemoCoder: Sure, give me a Series-100 driver to support those OGL extensions, with the proper revision (105+?) for the MUL to also be exposed (assuming it will be in the future), and I might be able to test that :p

And yes, I believe it is correct to assume that it is part of the SF unit and that you couldn't co-issue interpolation/SF and that MUL (except for persp. correction). That doesn't mean the MUL isn't there and couldn't theorically be exposed in certain circumstances instead of letting it idle when interpolation is idling... But of course, that's also impossible to prove at this point in time.


Uttar
 
However, some thinking on the crapper has convinced me that the G80 will kick start the physics market long before even DX10 games arrive, in a way that Ageia can only dream of. We've seen hints of the G80 physics power in the "smoke box" demo which runs real time Navier-Stokes computation fluid dynamics calculations on a very fine grid concurrently with a raytracing/raymarching rendering app, showing off single-GPU usability of physics.


I was thinking that perhaps nV was willing to depart from there judicious use of transistors (not to suggest that the G80 is not efficient) for the purpose of cementing their importance in PC. After all, the improvements over last gen are so great that they approach academic. They could have have introduced the 8800 GTX with the same specs and the GTS and still everyone would have been pleased as punch. Even when DX10 arrives I don't see any shortcomings with this chip. It still should enough 'spare' processing power to be utilized as a sort of co-processor. So with that in mind I would suggest that nV was willing to go the route they did in order to embed themselves as a necessary component of a modern PC architecture.
 
NVidia's "Quantum Physics Engine" seems to be a low-level port of Havok to NVidia's CUDA architecture, which seems to repurpose L1/L2/GPR/onchip storage into scratch-pad RAM ala CELL, albeit with 8 times the scalar MADD power of a single CELL, as well as a much faster memory bus and better TLP design/branching performance.

Not that I don't agree that G80 or future GPUs look great as competitors for PPUs, but could you clarify that last part?

Is it better branch performance compared to previous GPUs, or to CELL?
An SPE has a branch granularity of 1, while the GPU in the best-case will be 16x worse.

I may have misread the article, but it also looks like the GPU's total storage is also less than that of an SPE.
 
The GPU won't "mispredict" though, while the SPE definitely will. For short branches, that might even make the GPU slightly better; for big ones with much unique code for both branches, obviously not. As for local storage, the SPE definitely has a theorical advantage there, but I'm not sure it's such a big one for many algorithms if you consider that a major reason for that amount of LS is prefetching, hiding latency, etc. - a GPU is stupidly good at that kind of stuff by design.

I don't have the G80 register file figures, but I figure that stuff must be quite damn big; it can, after all, have 256+ threads in flight per cluster to hide its own latency, and a few times that to hide texturing latency. I wouldn't be surprised at all if you had 256KiB+ of SRAM just for that register file.

Obviously, if what you're doing is having 100 threads hitting stuff randomly in a shared 128KiB memory area you can prefetch, and with exclusively Vec3/Vec4 FP32 operations and not scalar ones, CELL is going to eat your GPU for breakfast. But in slightly more realistic circumstances where you still have massive parallelism to exploit, the opposite is just as true, if not even more so, imo.

Now, imagine if the G90 (ETA: Q4 2007?) had 3x the GFlops, and support for FP64 operations at a lower speed. And maybe slightly bigger caches for the same price too. Ouch @ CELL.


Uttar
 
Obviously, if what you're doing is having 100 threads hitting stuff randomly in a shared 128KiB memory area you can prefetch, and with exclusively Vec3/Vec4 FP32 operations and not scalar ones, CELL is going to eat your GPU for breakfast. But in slightly more realistic circumstances where you still have massive parallelism to exploit, the opposite is just as true, if not even more so, imo.

?! Did I miss something? The GPU's register file isn't split into ALU-specific regions? Or are you referring to the SPE's local memory area? Maybe I didn't miss something so much as got lost ;^/

Now, imagine if the G90 (ETA: Q4 2007?) had 3x the GFlops, and support for FP64 operations at a lower speed. And maybe slightly bigger caches for the same price too. Ouch @ CELL.

A lot may depend on IBM's/Sony's actual interest in continuing to build/improve Cell. And how are you getting to 3x GFlops? [Besides peering over at that NDA'd R600 ;-) ] Assuming the same ginormous diesize with a shrink to 65nm, you only get 1.9x (all other things being equal). That's a pretty steep increase in speeds (approaching 2Ghz) that you seem to be forecasting.... Or do you believe that a fair amount of transistor savings or ALU/TEX rebalancing is in the cards for G90?
 
I think it died (DirectPhysics), but given that ATI and NVidia are shipping separate GPGPU drivers that allow developers to bypass DX, I think it is only a matter of time before MS takes note and tries to reconcile the two IHV proprietary interfaces into a standardized one.

That would seem to make sense. Whatever the drivers are doing, it would be interesting in a bit, once CTM and CUDA are better understood, to get an analysis of what they do similarly, and what they don't, and the performance envelope of both hardwares doing those tasks.

I don't know if G80 is the inflection point, other than as a matter of NV joining ATI in pushing this area hard with hardware that is up to the task, which G7x clearly was not. The two IHV's working together to push an area (whether it be SM3 or physics, or whatever) are always going to be a much, much stronger force than either of them pushing a given area by themselves. It's very much a case of the whole being greater than the sum of the parts when both IHVs put their shoulder to the wheel.
 
The GPU won't "mispredict" though, while the SPE definitely will.
I read through the article section on branching, but I didn't see how what was written indicated there wouldn't be branch mispredicts. Does G80 stall a given thread (or group of 16 threads) and try to overlap it with the execution of other threads?
How many branch execution units are there?

For short branches, that might even make the GPU slightly better; for big ones with much unique code for both branches, obviously not.
I would have thought that if the goal is to hide branch latency, that longer branches would be better in order to fill in more excution cycles without another group of branches hitting the pipeline.

What exactly is G80's branching mechanism?
 
The idea on G8x, just like on any other modern SM3.0.+ GPU (at least, in terms of pixel shaders), is that there is *no* prediction going on. All instructions are executed one after the other, and the vast number of threads in the GPU, rather than prediction, are hiding the ALU's and the scheduler's latencies. You are fundamentally never going to see a new instruction begin before the results of the last instruction are fully known... (on G80, that is, at least!)

If you got two short branches, at worst, it'll execute all potential instructions for the two. On a CPU, a short branch might mean it has to "go back" even further, because it has been executing things further along than the branch with the wrong prediction. All AFAIK and I give no guarantee of perfect accuracy of course.


Uttar
 
I just found the GPGPU slides from the SuperComputing 2006 tutorial, but not from the workshop (which I think goes into more depth). Main page, CTM, and CUDA:
http://www.gpgpu.org/sc2006/
http://www.gpgpu.org/sc2006/slides/08a.segal.ctm.pdf
http://www.gpgpu.org/sc2006/slides/08b.buck.cuda.pdf

The CTM slides are similar to ones we've seen before. I think these are the first public CUDA slides. First thing I notice, proving DemoCoder correct, is "Full Integer and Bit instructions". So we not only have ful int support, but also bit operations. As far as I can tell all HLSL's mark bit operations as reserved / not implemented.

Also says "Parallel Data Cache per cluster 16KB". I assume this is the size of the L1 cache, but it also says it's "As Fast Registers".

There's also a mini code sample:
Code:
dim3 DimGrid(100, 50); // 5000 thread blocks
dim3 DimBlock(4, 8, 8); // 256 threads per block
size_t SharedMemBytes = 64; // 64 bytes of shared memory
KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...);
 
Last edited by a moderator:
"Both vendors committed to double precision soon" :D

Well, I don't have a specific reason to be happy, but with broad support it would seem that DP is going to ramp-up on GPUs nice and quick.

Jawed
 
If you got two short branches, at worst, it'll execute all potential instructions for the two. On a CPU, a short branch might mean it has to "go back" even further, because it has been executing things further along than the branch with the wrong prediction. All AFAIK and I give no guarantee of perfect accuracy of course.
In a CPU, the mispredict penalty is fixed at the number of stages between branch issue and branch evaluation. It doesn't matter how many instructions follow a branch, that's the penalty.
Longer stretches of instructions that follow a branch instruction are preferable, since the mispredict penalty is fixed and can be compensated for in the long-run.

For a GPU, what you've outlined indicates that the primary assumption is that there are enough threads that provide work to overlap execution during the stall period for the branched threads.
I got the impression that having long stretches of shader code without branches would be best here as well, since you can't hide branch latency if everything you try to overlap is branch-heavy.

Running both sides of a branch also somewhat risky, since that cuts throughput in half in the best case, and continues to cut it in half the more aggressively it is used.

I'm not touting CELL or any other CPU as being a better bet for physics, I just don't see completely avoiding branch prediction as being the same as having good branch performance.

The workload may not need good branch performance, but that doesn't by default make GPUs good at branching.
 
I got the impression that having long stretches of shader code without branches would be best here as well, since you can't hide branch latency if everything you try to overlap is branch-heavy.

Well, much like the CPU case, branch penalty is branch penalty. If there is a latency between issue and evaluation, each of those clocks are filled by issuing different threads. So, as long as the number of jobs you have to work on is greater than your latency, it doesn't really matter what kinds of operations are issued.

BTW, I haven't seen any notion of a branching unit in G80, it appears to be part of the scheduler, as far as I know. I would think the hardest (latency-ridden) part of performing a branch on G80, though, is making sure the instruction that needs to be issued is currently in your instruction cache.... I suppose that highly branchy code could conceivably keep your scheduler so busy that it couldn't manage to schedule ALU instructions, though it could hardly be said that the GPU was idling or not performing under those circumstances....

Running both sides of a branch also somewhat risky, since that cuts throughput in half in the best case, and continues to cut it in half the more aggressively it is used.

Predication has its downsides, which is why we'd all like to know what this "32" number is all about :)
Of course, batching has its advantages when your batch also happens to have memory-coherent access, and retaining the latter while improving branch performance appears to be ... an entertaining engineering challenge.
 
Well, much like the CPU case, branch penalty is branch penalty. If there is a latency between issue and evaluation, each of those clocks are filled by issuing different threads. So, as long as the number of jobs you have to work on is greater than your latency, it doesn't really matter what kinds of operations are issued.
We know that the threading capacity for a given cluster is finite, and that every new thread that is pulled in takes up resources. I don't know the particulars, but it sounds to me the best balance, if branching must be done, would be occasional branches with plenty of other non-branches to overlap execution.

G80 could be designed heavily enough that each cluster can support as many threads as can possibly be in-flight due if every shader was just a long string of branches, but that sounds like overkill.

BTW, I haven't seen any notion of a branching unit in G80, it appears to be part of the scheduler, as far as I know.
The article mentioned dedicated branch units: so the other units wouldn't be tied up by a pending branch.

I may have overlooked the numbers, because I don't know how many there are.
 
dnavas: fwiw, here's my current guess for G90. This is 100% speculation, and should not be taken as more than that. Quoting me on this randomly or "leaking" this is not "fun", no... :p
- 65nm, Q4 2007; 400mm2+
- 1.5GHz GDDR4, 384-bit Bus
- 1.5GHz Shader Core Clock
- 650MHz Core Clock
- 32 MADDs/Cluster
- 24 Interps/Cluster
- 10 Clusters


Uttar
 
I don't think anyone is saying GPUs don't suffer a penalty for branches, just that they are designed (presumably) with enough threads in flight to hide the penalty down to a certain level. However really small branches can just be replaced with predication. One may also need to do "extra work" by partitioning one's data manually to increase coherence. A future G80 offshoot could add a branch predictor or branch hint mechanism, but with the number of threads continuing to scale up, I think they'll just continue to improve latency hiding, since for GPUs it is throughput, not latency, that matters.

CELL SPE programming seems to require alot more extra work and compiler magick to obtain optimality. Auto-SIMDization, branch hinting, manual cache strategy, and of course, big pain if you want to truly do a scalar algorithm, since you can't write anything but a quad-word, not to mention the pain of integer ops (32-bit integer MUL requires 5 instructions)

This is not to say that CELL is bad, but GPUs (outside the NV3x) aren't as sensitive to instruction scheduling and latency. It's easy to block a CPU from doing any work and having it sit there idle for dozens or hundreds of cycles. The beauty of the GPU designs is the sheer simplicity of the processing elements, like Niagara on steroids, scaled up to a huge number of cores and threads. As long as your workload is relatively data parallel in nature, or throughput oriented, it's a big win.
 
Back
Top