Nvidia GT300 core: Speculation

Status
Not open for further replies.
Because you need 10s or hundreds of atomics in flight to make the performance bearable.

Seems to me that, sparse or frequent atomic usage doesn't matter as long as you can keep the ALUs busy with work (unless you are bandwidth bound). Throughput is important here, GPU atomics are only for un-ordered usage by design. It is just like a highly latent texture or global memory access. I cannot think of off hand cases where we are latency bound on the GPU now (tiny draw calls, CUDA kernels with tiny grids, with proper work load balancing that issue might go away). Maybe when this current GPU scaling starts to level off and on-chip networking becomes a bottleneck, then we can return to the latency problem.

Are these applications even trying to do other work in the shadow of the atomic latency?

In a lot of cases they simply cannot by hardware design. I attempt to explain below, but keep in mind this is from my in-brain knowledge, I don't have the time to re-lookup all the info in detail. So someone else here might need to correct the rough spots.

In the not try, but "free" work case, ie in the case where you have a hyperthreaded CPU, and the atomic operation doesn't do nasty stuff to the shared ALU pipe, then you just end up with the many times lower amount of hyperthreads CPU have (say 2) vs the variable up to 32 "hyperthreads" you have with GT200. So you loose the ability to hide the latency in full throughput ALU work.

Problem with typical PC usage with atomics is the memory barrier (because atomics are commonly used in cases where order is important, and often hardware forces the barrier) which change compiler instruction reordering, and stall the CPU. With GCC in most cases a full memory barrier (read+write) is built into the atomic intrinsics. With MSVC, you have special cases of acquire and release semantics which do a read only and write only barrier (which are of use on some platforms only, like IA-64, etc).

The PowerPC (old macs, consoles, etc) have special lwarx (load word and reserve indexed) and stwcx (store word conditional indexed) instructions. These two functions in combination with other regular instructions emulate atomic operations. The load word and reserve marks the cache line as reserved. The CPU will toss that reservation if another thread attempts a reservation, or if the line gets dirty before the store word. If the reservation is lost the store word conditional fails and a software retry loop is started.

Old x86 chips had the LOCK instruction prefix physically lock the bus for a given instruction, >=P6 I think Intel moved this to something sane, and cache coherency handles the problem (don't have clock cycle counts in my head for actual and false cache line contention however). Locking does however serialize all outstanding loads and stores (ie forced memory barrier). X86 does stores in order, but loads can go before stores.

Bottom line is atomics CPU side have nasty side effects which reduce ALU throughput. Could be 20-250 cycles where CPU isn't doing real work (ie just doing say an Interlocked*() atomic function).

The issue with NVidia's design is it's write-through with no concept of MRU/LRU, so it's generating worst case latency for everything (this is how it seems, anyway - maybe there's some blending tests out there that show otherwise). It's like having no texture cache at all. The GPU can hide the latency, but it takes a lot more non-dependent instructions to do so than if basic caching was implemented. It's just using the ROPs as they are, i.e. minimal cost.

I'm not convinced that this is a problem or that caches would solve it. ROPs do have coalescing, so effectively a write-combining cache. And it seems as the ROP/MC or whatever is doing the actual ALU work for the atomic operation, and NOT the actual SIMD units. This is what is important. In the worst case, ie address collisions, the normal SIMD ALU units keep right along working, and sure the ROP/MC atomic ALU work gets serialized. It is likely this serialization and extra ROP/MC ALU unit is the true reason for the atomic operation latency (extra ALU unit is throughput bound, increasing latency). Also remember, CUDA atomics return the old value fetched from memory, before the atomic operation happens (so latency would be the same as the global fetch, minus extra writeback causing a reduction in throughput, if not bottlenecked by something else).

So really what would you rather have, cache + CPU doing the atomic operation and CPU thread stalling, or issue the atomic operation and have the memory controller or ROP do the atomic operation while the CPU keeps on processing (this is what I like).

EDIT: sure would be funny if the normal SIMD ALUs did do the atomic operations... clearly I'm making a lot of assumptions here!
 
Last edited by a moderator:
However I don't see DX11 hitting big until PS4/720 generation likely because developers simply don't have the time for the expense of engine and tools rewrites. I think TS for character rendering in DX11 lifetime might very well end up as a probable must-do even if not using TS for static and world geometry. Not that I really like TS, especially given the extra draws for patch types, but like it or not, if it is the 3x as fast path for 100x less memory (clearly less performance and space advantage for smaller LOD), it will be used.
OT
You broke my day :cry: , I was hoping that the unveilling of directx11 would at least convince more developpers to make better use of the tesselator within the xenos.
/OT
 
That's not at all how I recall it.
I recall it more like this:
http://www.hexus.net/content/item.php?item=749&page=5
More than 90% of 3DMark 2001 score represens DX6-DX7 performance. Score is calculated from "game" tests, not from feature tests. You can run the game tests in full quality on every DX6 compatible graphics card. The exception is nature test, which use PS1.1 (no PS1.4 advantage there) for the waters surface (lake). The Nature test runs for 1 minute, while the PS1.1 water surface is visible for 10 seconds total (and covers typically 1/4 or less of the screen), so if the card has good enough DX6 performance and scores well in previous tests, the 10 seconds of simple PS1.1 effect won't change anything. Even DX7 features are quite poor (TnL), because TnL capabilities of ~2GHz (PR) are better than TnL capabilities of average DX7 TnL GPU according 3DM01 results.

That's why I'm quoting DX8 tests only.
 
More than 90% of 3DMark 2001 score represens DX6-DX7 performance. Score is calculated from "game" tests, not from feature tests. You can run the game tests in full quality on every DX6 compatible graphics card. The exception is nature test, which use PS1.1 (no PS1.4 advantage there) for the waters surface (lake). The Nature test runs for 1 minute, while the PS1.1 water surface is visible for 10 seconds total (and covers typically 1/4 or less of the screen), so if the card has good enough DX6 performance and scores well in previous tests, the 10 seconds of simple PS1.1 effect won't change anything. Even DX7 features are quite poor (TnL), because TnL capabilities of ~2GHz (PR) are better than TnL capabilities of average DX7 TnL GPU according 3DM01 results.

That's why I'm quoting DX8 tests only.

Okay, so in synthetic tests the FX is slower, but in actual game content it is faster. Guess I forgot how much fixedfunction stuff there still was in those old games...
So I should have said: "in fixedfunction integer + a bit of ps1.x..."
Oh well.
 
I don't have pseudo code, but it's a lot of math so the reason for the fixed function stage is performance related.
If you tesselate to say >10 pixel triangles it's just not a lot of math compared to contemporary pixel shading ... the only way it's really going to take a significant percentage of the FLOPs is if you tesselate to near pixel sized triangles, but that causes a whole lot of other problems.
 
I'm not sure what to say about the GS path, other than I tried out after DX10 release the simple cases of replicating data to multiple cubemap faces, expanding point data to motion stretched particles, etc, and found it wasn't nearly as fast as other methods (actually performance was horrid). Of course I did this on early NVidia cards (hence my jaded outlook). So GS IMO isn't useful (perhaps just for me?) and seems even less useful with DX11. Maybe it isn't going to be fast on DX11 and that doesn't matter, it will be there for backwards compatibility. Was DX10 GS a ATI or NVidia or Microsoft pushed feature?
I'm not sure if GS is used to create the cubemap array used for the light probes here:

http://ati.amd.com/developer/gdc/2008/DirectX10.1.pdf

I see from that I was wrong about the blending capability of ROPs in NVidia - they must do fp32 blending. So the connection between atomics (where there's no support for fp32) and ROP blending is less clear...

I was thinking TS would be less demanding because once HS is finished for a primitive that output data is reused many many times in DS. Huge data reuse.
Yes. This seems similar to the barycentric data that's held somewhere for the lifetime of a fragment (or a triangle during pixel shading, if you prefer), so that each interpolation can be performed on-demand. This seems (patent-interpretation) to use shared memory, which is obviously a substantial buffer.

So the same could apply in the TS-centred pipeline, though now there's competition amongst DS, GS and PS for space in shared memory. Well at least shared memory is set to at least double. There's then an issue of whether data in shared memory is too localised, e.g. if TS-output ends up in multiple multiprocessors.

Logically when triangles span warps of fragments and screen-space tiles, the barycentric data will be copied to all the multiprocessors that need it. So similar would happen in the TS scenario.

Also perhaps HS input is easier compared to how GS shares large amounts of neighboring VS data? What I'm more interested in is how ATI and NVidia and Intel are going to insure good SIMD occupancy with the non-SIMD sized and aligned HS and DS groups. Seems like it should be possible, but that some points/patches might be higher performing than others do to SIMD (warp) packing and shared-memory access patterns? Maybe it will just be a case of getting in the most common worst case 2 (or 3 max) different primitives in a SIMD group, and you just end up eating the 2-3x cost of shared memory broadcast when threads access shared data. Given all the abuse with vertex skinning waterfalling, I'm likely way over thinking this...
I'm not sure what the finest granularity could be in this case. The presentation I linked earlier aggregates multiple control point positions and normals into threads "to balance the workload". That seems to imply the developer is responsible for evaluating workload balancing.

This sort of swapping and PC like behavior is exactly why I'd rather be programming a console, and most developers I know don't agree with that. Clearly many people want demand paging so they can attempt to remove sloppy texture streamers.
Has anyone delivered a megatexture-like texturing engine in a game yet? I'm not a graphics programmer, so I don't really appreciate this: but why is it that megatexture has graphics programmers scratching their heads. Is Carmack's work something that's too specific to his view on game engines, or will all game engines be doing this eventually? As far as I can tell he's been asking for virtualised texture resources for years, so now he's rolling his own. Does a future D3D essentially level the playing field?

A compromise would be allowing developers a back door to lock things (CUDA's page locked memory is a good step in that direction!). I do understand the desire for ATI and NVIDIA to broaden their markets by eating into tasks which would traditionally be CPU side and to make it easy for programmers to make use of the hardware...
I see this fundamentally as being about having a processor in the system that doesn't cause the rest of the system to behave erratically just because it's busy. e.g. if GT300 is genuinely able to tackle the entire h.264 encoding pipeline with a mixture of task-parallel and data-parallel kernels all purring away, you want the rest of the system to behave as if that work wasn't being done.

DX11 = no R2VB and no EDRAM,
R2VB is part of D3D10 isn't it? And EDRAM is moot with the bandwidth efficiency and sheer throughput of current PC GPUs.

and all hardware having the feature.
DX9 and D3D11 versions of games? :p

Depends on too many factors (shader system, lighting model etc). BTW, NVidia's bind-less GL graphics updates enabled a 7.5x improvement in draw call speed, in one test they were pushing 400K draws/sec before then 3000K draws/sec after. Add this to the possibility that DX11 GPUs can better load balance shaders (perhaps less pipeline emptying) and maybe draw calls aren't and issue.
I had a look here:

http://developer.nvidia.com/object/bindless_graphics.html

but my perspective on OpenGL is seriously limited. Does this capability tell us new stuff about G80's capabilities? Is this the OpenGL mirror of D3D10 capabilities?

TS could be quite awesome for things like shadow maps where you might easily be vertex or draw call bound since Z fill is so damn high.
I suppose it's also a work-around for setup limitations, since you're only shadow-buffering triangles that have a meaningful size from the point of view of each light, therefore not swamping setup with useless triangles.

Jawed
 
Guys, did we arrive at those two Nvidia patents yet? They could relate to GT300, with one of them dealing with a two-fold rasterizer stage utilizing a coarse grained one for generating a non-rectangular field for primitives and "finer grained stage" with a "boustrophedonic" scheme for traversing pixels. The other deals with a method for each of a multitude of clusters to be able to generate pixel data from coverage data.

In short: this could be the G(T)300's scheme of parallelizing rasterization.

"METHOD FOR RASTERIZING NON-RECTANGULAR TILE GROUPS IN A RASTER STAGE OF A GRAPHICS PIPELINE "
http://www.google.com/patents?id=kG2hAAAAEBAJ

"Parallel Array Architecture for a Graphics Processor"
http://www.google.com/patents?id=B_-BAAAAEBAJ
 
I don't have pseudo code, but it's a lot of math so the reason for the fixed function stage is performance related. If Microsoft makes the refrast code available it contains the algorithm.
What about precision? Is any of this beyond fp32? Beyond int32? Seems unlikely. I can't help thinking it's "fixed-function" because there's one already in ATI. Maybe the sense of this being "fixed-function" is no different, logically, from the "fixed function" rasterisation rules that we take for granted. Tessellation by 3.5 always means the same thing.

Jawed
 
Thanks for the responses on Physics on the upcoming GPU's.

DX11 is supposed to have some physics implementation and with the new GPU's a lot more powerful than the currently crop it had me thinking and wondering if physics could be implemented with little or no performance loss in fps.

Looking at the PhysX Sacred 2 patch(youtube video of the differences here) has me thinking that Physics on GPU's will be a really good thing.

The amount of memory and bandwidth that the new GPU's will have with faster and more efficient cores and shaders should help with getting more and better physics in games or at least that's my hope.

As mentioned, fluid and cloth physics should get a boost imo(being easier to simulate), enviromental destruction physics though is a bit more taxing.

Still, all this would be nice if it does get the support it requires whether through PhysX, Havok Physics or with further implementations in DirectX from MS.
 
Looking at the PhysX Sacred 2 patch(youtube video of the differences here) has me thinking that Physics on GPU's will be a really good thing.

I think you have to see physics much like shadows.
When the first games with dynamic shadows arrived (eg Doom3), the effect was VERY expensive, and didn't do much for gameplay itself.
But they did make the game look nicer and more realistic, and now all games have it, and people take the performance hit for granted.
 
More than 90% of 3DMark 2001 score represens DX6-DX7 performance. Score is calculated from "game" tests, not from feature tests. You can run the game tests in full quality on every DX6 compatible graphics card. The exception is nature test, which use PS1.1 (no PS1.4 advantage there) for the waters surface (lake). The Nature test runs for 1 minute, while the PS1.1 water surface is visible for 10 seconds total (and covers typically 1/4 or less of the screen), so if the card has good enough DX6 performance and scores well in previous tests, the 10 seconds of simple PS1.1 effect won't change anything. Even DX7 features are quite poor (TnL), because TnL capabilities of ~2GHz (PR) are better than TnL capabilities of average DX7 TnL GPU according 3DM01 results.

That's why I'm quoting DX8 tests only.


1. I think you are confused there as I believe you mean 3dM2k not 2k1. 2k was a DX6/7 tester and 2k1 was 7/8. 3/5/6 are dx9 for the most part with maybe dx8 just a tincy bit..
2. Secondly, I've scored damn near close to 10k in 2k1 with a GF2(3d Prophet that made Nv angry for being as fast as a Pro card because of core and memory oc) and a P3 1ghz, I've yet to see any 2ghz single core processor with any none TnL gpu come close to that. Hell my laptop with an ATI express 200 and A64 s754 3200+ does even top 5k in 2k1
 
Seems to me that, sparse or frequent atomic usage doesn't matter as long as you can keep the ALUs busy with work (unless you are bandwidth bound). Throughput is important here, GPU atomics are only for un-ordered usage by design. It is just like a highly latent texture or global memory access. I cannot think of off hand cases where we are latency bound on the GPU now (tiny draw calls, CUDA kernels with tiny grids, with proper work load balancing that issue might go away). Maybe when this current GPU scaling starts to level off and on-chip networking becomes a bottleneck, then we can return to the latency problem.
Agreed, GPUs are supposed to hide the latency. And yes, when performance really matters it'll be optimised. Analogous to the foolishness of using incoherent branching on NV40, and how things are much better now - though still a bit of a gotcha.

If you look at any reasonably interesting CUDA kernel the latency stares you in the face. Even with decent arithmetic intensity it's still possible to have almost no latency-hiding due to occupancy (the weather forecasting micro-physics kernel has a monster amount of state per thread). So once you start mixing this and atomics you're forced to consider carefully partitioning your algorithm so that the atomics happen in some kind of reduced-resource-consumption window - e.g. by breaking it up into passes.

In a lot of cases they simply cannot by hardware design. I attempt to explain below, but keep in mind this is from my in-brain knowledge, I don't have the time to re-lookup all the info in detail. So someone else here might need to correct the rough spots.
Wow that's quite a brain-dump. I'm way way out of my depth here. Why I was cautious originally about saying anything about atomics!

In the not try, but "free" work case, ie in the case where you have a hyperthreaded CPU, and the atomic operation doesn't do nasty stuff to the shared ALU pipe, then you just end up with the many times lower amount of hyperthreads CPU have (say 2) vs the variable up to 32 "hyperthreads" you have with GT200. So you loose the ability to hide the latency in full throughput ALU work.
I suppose I view this away from hardware threads and more in terms of application threading. Fibres on a CPU should be independent in their use of atomics (in my opinion).

Problem with typical PC usage with atomics is the memory barrier (because atomics are commonly used in cases where order is important, and often hardware forces the barrier) which change compiler instruction reordering, and stall the CPU. With GCC in most cases a full memory barrier (read+write) is built into the atomic intrinsics. With MSVC, you have special cases of acquire and release semantics which do a read only and write only barrier (which are of use on some platforms only, like IA-64, etc).
So the restrictions are at the compiler/library level?

The PowerPC (old macs, consoles, etc) have special lwarx (load word and reserve indexed) and stwcx (store word conditional indexed) instructions. These two functions in combination with other regular instructions emulate atomic operations. The load word and reserve marks the cache line as reserved. The CPU will toss that reservation if another thread attempts a reservation, or if the line gets dirty before the store word. If the reservation is lost the store word conditional fails and a software retry loop is started.
So you end-up with cache-line sized atomic variables and carefully aligned operations, in order to get anywhere?

Old x86 chips had the LOCK instruction prefix physically lock the bus for a given instruction, >=P6 I think Intel moved this to something sane, and cache coherency handles the problem (don't have clock cycle counts in my head for actual and false cache line contention however). Locking does however serialize all outstanding loads and stores (ie forced memory barrier). X86 does stores in order, but loads can go before stores.
So the memory barrier is enforced even if the operations have nothing to do with the atomic variable.

Bottom line is atomics CPU side have nasty side effects which reduce ALU throughput. Could be 20-250 cycles where CPU isn't doing real work (ie just doing say an Interlocked*() atomic function).
I dare say I'm getting a sense that GPU/game programmers will be blazing a trail, from what you've described. Though there's still a very tricky scaling question beyond a single GPU. I stumbled into this:

http://insidehpc.com/2009/05/12/argonne-researchers-receive-award-for-mpi-performance-study/

which paints a grim picture.

I'm not convinced that this is a problem or that caches would solve it. ROPs do have coalescing, so effectively a write-combining cache. And it seems as the ROP/MC or whatever is doing the actual ALU work for the atomic operation, and NOT the actual SIMD units. This is what is important. In the worst case, ie address collisions, the normal SIMD ALU units keep right along working, and sure the ROP/MC atomic ALU work gets serialized. It is likely this serialization and extra ROP/MC ALU unit is the true reason for the atomic operation latency (extra ALU unit is throughput bound, increasing latency). Also remember, CUDA atomics return the old value fetched from memory, before the atomic operation happens (so latency would be the same as the global fetch, minus extra writeback causing a reduction in throughput, if not bottlenecked by something else).
Are global fetches cached? I disagree fundamentally on the cache question - just because you can hide latency doesn't mean performance is fine without a cache. ATI GPUs have pre-fetching into texture-caches and the thread processor schedules texturing operations around the readiness of the cache, quite separate from the hiding of latency by scheduling non-dependent ALU work. It's no different, conceptually, from a memory controller that does bank-aware scheduling and coalescing. The alternative is longer queues and more non-dependent work being required.

Anyway, with append/consume being a first class citizen of D3D11 apps, and with shared memory atomics saving much of this bother, maybe there won't be much pressure on global atomic performance.

So really what would you rather have, cache + CPU doing the atomic operation and CPU thread stalling, or issue the atomic operation and have the memory controller or ROP do the atomic operation while the CPU keeps on processing (this is what I like).
If with the right library the CPU doesn't stall (if that's even possible), then that's what you want I guess. Otherwise the GPU approach is most likely useful as you're very likely using multiple atomics.

Anyway, since you're the one with experience building interesting data structures and getting stuff running on the GPU, your continued experiments will be very interesting :D Bit of a pity that you can't compare with append/consume in D3D11 but they have separate strengths and weaknesses so it's not all one way.

Jawed
 
1. I think you are confused there as I believe you mean 3dM2k not 2k1. 2k was a DX6/7 tester and 2k1 was 7/8. 3/5/6 are dx9 for the most part with maybe dx8 just a tincy bit..

Yea, 3DMark03 is the one that the FX failed in.
It contained two ps1.x game tests, and the nature test with ps2.0.
FX did fine in the ps1.x tests, but completely died in the ps2.0 test.
Then a driver update appeared where the ps2.0 performance was 'fixed'... nVidia had replaced everything with int and half-precision shaders, and also 'optimized' some other things, like not rendering things that were outside the visible range (abusing the fact that the camera path was fixed). It ran about as fast as ATi's stuff, but it suffered from blocky aliasing because of the limited precision.
That's when Futuremark started with the whole driver approval thing.
Funny enough many people couldn't believe the FX series was THAT bad in ps2.0, and suspected foul play from FM/ATi instead. Then again, who could blame them, really? Games only used fixedfunction or ps1.x, and there was no reason to assume performance problems based on that.
 
Has anyone delivered a megatexture-like texturing engine in a game yet? I'm not a graphics programmer, so I don't really appreciate this: but why is it that megatexture has graphics programmers scratching their heads. Is Carmack's work something that's too specific to his view on game engines, or will all game engines be doing this eventually? As far as I can tell he's been asking for virtualised texture resources for years, so now he's rolling his own. Does a future D3D essentially level the playing field?

Quakewars had something like megatextures for terrain. But this wasn't exactly a good example of the technology IMO. If you are thinking of meta texture in terms of unique texturing on everything, I'm not sure if everyone is going to get on board with that idea until the storage problems are solved.

but my perspective on OpenGL is seriously limited. Does this capability tell us new stuff about G80's capabilities? Is this the OpenGL mirror of D3D10 capabilities?

I don't really have numbers for a good comparison of GL3 vs DX10 draw call performance.
 
Guys, did we arrive at those two Nvidia patents yet? They could relate to GT300, with one of them dealing with a two-fold rasterizer stage utilizing a coarse grained one for generating a non-rectangular field for primitives and "finer grained stage" with a "boustrophedonic" scheme for traversing pixels. The other deals with a method for each of a multitude of clusters to be able to generate pixel data from coverage data.

In short: this could be the G(T)300's scheme of parallelizing rasterization.

"METHOD FOR RASTERIZING NON-RECTANGULAR TILE GROUPS IN A RASTER STAGE OF A GRAPHICS PIPELINE "
http://www.google.com/patents?id=kG2hAAAAEBAJ
I haven't seen this before, but it just looks like a piece of the hierarchical rasterisation feature set that we've seen in other NVidia patents.

"Parallel Array Architecture for a Graphics Processor"
http://www.google.com/patents?id=B_-BAAAAEBAJ
I can't see anything there that's meaningfully beyond G80.

Jawed
 
I haven't seen this before, but it just looks like a piece of the hierarchical rasterisation feature set that we've seen in other NVidia patents.

Yeah I was surprised to see recent patents for something like this. I wonder how much other obvious stuff there is out there waiting to be discovered :)

Though it would be interesting to see whether Nvidia elects to perform tile rasterization on the shader core. They're already sending down the triangle attributes for JIT interpolation so why not? Though I suppose a fixed function tile rasterizer would be tiny and worth the transistor budget.
 
Status
Not open for further replies.
Back
Top