Nvidia GT300 core: Speculation

Status
Not open for further replies.
I still don't understand tessellation well enough to have any decent idea why it's a "fixed-function stage" in D3D terminology, or what the shader code for a D3D11 tessellator would look like.

Anyone want to punt by posting reasonably detailed pseduo-code for D3D11's tessellator?

Why is it fixed-function? When D3D is adding programmable concepts and orthogonalising resource usage, why introduce another FF stage?

Apart from anything else, is tessellation likely to see anything other than cursory usage for years yet? Tweaks such as silhouette-edge enhancement don't affect much in the way of graphics engine or the art pipeline/assets. But all the major stuff tessellation is targeted at seems like a huge investment for engine developers and game studios.

Jawed
 
I'd like to know as well ... I guess adaptive tessellation doesn't suit large branch granularity SIMD units, but that's not a good enough argument to make it completely fixed function. It's an argument to include some smaller granularity shaders :)
 
Hmm, Charlie seems convinced that NVidia is building something very similar to Larrabee, with little fixed-function hardware. Does that seem likely?
Maybe it could happen, though like Charlie I would question whether it would be wise to try to out-Larrabee Larrabee.
Nvidia has set a precedent for shaking things up a fair amount on occasion, thanks to G80, but there was at least a significant API shakeup to coincide with that change.

A clean-sheet design that would basically abandon a huge chunk of the G80-GT200 framework would take time and resources to bring about. Given the time cycles for something like that, the roughly four years since the completion of G80 (assuming GT200's somewhat underwhelming improvements meant it was a secondary effort) would be a frighteningly tight timeline to architect a general purpose VLSI architecture.

Larrabee-esque is also something of a broad categorization with a lot of wiggle room. That there could be commonalities seems inevitable, since anything extending programmability or closing the read/write loop on GPUs would appear to be Larrabee-like.
The amount of time since Intel released actionable details of Larrabee would not be enough to complete G300, so a direct rip-off seems unlikely and also wrongheaded (might be a patent minefield there as well). There are corners of Larrabee's design that were necessitated by the choice of a specific x86 core and the x86 architecture. There would be no reason to voluntarily inflict them on a new design.

To top it off, with Theo, Fuad, and Charlie postulating, we have three current or former Inq writers spouting on future GPU designs (Three and a half if we count the gallstone Charlie must have given the sheer amount excess of bile we see in his writing. Seriously, I think he gave my monitor jaundice. Nvidia must have taken advantage of him on prom night or something) whose various stories need to be stitched together or need to be tracked to see in a few quarters how much each was on the "make shit up" train.

Could it be that NVidia is simply adding D3D11 features by running them on the shaders, e.g. the tessellator?

If that's so, then that doesn't necessarily mean NVidia's ditching most fixed-function units, such as ROPs.
Possibly, maybe.

There are ways to make special-purpose hardware sit alongside general-purpose cores such as making them units within cores, through memory messages, or special signalling paths.
x86 made the first difficult, the second necessary*, and the third for the most part impossible for Larrabee.
Nvidia had the advantage of adding whatever bells and whistles they wanted.

*Maybe if they had done what I would have done... ;)

Could that strategy be playing out in this shader-centric D3D11-features model? Most of the new stuff runs solely on the ALUs. If it performs like shit who cares? The D3D10 features will have maximised die-space and the increase in compute performance is most important for CUDA's sake.
A big problem I see, as was noted in the discussion concerning the latency of Nvidia's atomic ops, was how the read-write-read process for GPUs with their read-only caches was so very long.
As far as general computation is concerned, the rearchitecting of how caches interact would be something Nvidia would be interested in looking at...
 
I have actually started to enjoy Charlie's articles, I just have to remind myself to discern between the well informed rumours (He obviously has some good sources..) and fiction.

NVIDIA has been pushing DX11 tessellation *a lot* in the last year or so with conferences and papers, they have some of their best hardware ans software people (Henry Moreton and Ignacio Castano) working on it. Obviously complex machineries can go wrong in so many ways but I have good (if not great) expectations for their tessellation related-hardware&software.
 
I have actually started to enjoy Charlie's articles, I just have to remind myself to discern between the well informed rumours (He obviously has some good sources..) and fiction.
Sure. Although in that sense I much prefer Fudo to either Charlie or Theo; he's not arrogant, just a bit naive sometimes, and doesn't feel obliged to include highly subjective opinions in everything he writes...

NVIDIA has been pushing DX11 tessellation *a lot* in the last year or so with conferences and papers, they have some of their best hardware ans software people (Henry Moreton and Ignacio Castano) working on it. Obviously complex machineries can go wrong in so many ways but I have good (if not great) expectations for their tessellation related-hardware&software.
Ah, I didn't know Ignacio's working on that now. As for Moreton, I wonder how much of his professional life he has dedicated to this problem... :) Here's hoping the tessellation hardware is everything he dreamed it to be. I'm sure the last few years must have been both very fun and very hectic for him...
 
Tim summarised stuff from last year's GPU 2013 presentation:

http://forum.beyond3d.com/showpost.php?p=1288136&postcount=76

Stuff relating to a virtual pipeline and adaptive workflow partitioning seem quite relevant here. At the same time I'd caution that VS->GS->PS in ATI already expresses these objectives, so NVidia might not have to make such a big step to effect these kinds of changes...

Pre-emption, running multiple concurrent arbitrary compute kernels in effect (rather than concurrent, but specifically labelled kernels such as VS or HS etc.), is to me the "big bang" for traditional GPUs.

Jawed
 
Yea, with the exception of GeForce FX :)
But I don't think D3D11 will be like that. It's too similar to D3D10 + Cuda for that.
I may be mistaken, but if I remeber well, FX performed poorly even in majority of DX8/8.1 tests, but nobody cared, because shader effects were use quite modestly at that time.
 
Maybe it could happen, though like Charlie I would question whether it would be wise to try to out-Larrabee Larrabee.
Nvidia has set a precedent for shaking things up a fair amount on occasion, thanks to G80, but there was at least a significant API shakeup to coincide with that change.
I don't think it's hard to argue that NVidia's D3D10 architecture is, in graphics terms, sufficiently far from D3D11 that it's quite an effort...

A clean-sheet design that would basically abandon a huge chunk of the G80-GT200 framework would take time and resources to bring about. Given the time cycles for something like that, the roughly four years since the completion of G80 (assuming GT200's somewhat underwhelming improvements meant it was a secondary effort) would be a frighteningly tight timeline to architect a general purpose VLSI architecture.
I don't know if it's pre-empting Dally's appointment - where his expertise in "on-die communications fabrics" seems key - to suggest that chopping out some of the fixed functions of a traditional GPU is really about getting data from memory into the ALUs in timely fashion - whereas before the data only had to go from memory to colour-cache for use by the ROPs, say.

ATI was able to do a software MSAA resolve in R6xx because the DMA infrastructure twixt memory and ALUs exists (mem_import and mem_export and the like) and didn't have to be funnelled through the texture units. That's not to say the bandwidths are good enough, but it's a key function point.

There are ways to make special-purpose hardware sit alongside general-purpose cores such as making them units within cores, through memory messages, or special signalling paths.
x86 made the first difficult, the second necessary*, and the third for the most part impossible for Larrabee.
Texturing in Larrabee apparently interacts with the cores' TLBs - I guess this is how a thread can discern "misses" in texturing.

A big problem I see, as was noted in the discussion concerning the latency of Nvidia's atomic ops, was how the read-write-read process for GPUs with their read-only caches was so very long.
The other side of the coin is that atomics in NVidia currently are a cheap add-on based-upon existing ROP functionality. There's no float atomics because floating point blending (at least, fp32 blending) is not possible in the ROPs. It's not even a "designed-by-committee" fumble, it merely expresses what's often bad about GPUs trying to do general computation - they're coaxed into the task, not designed for it.

As far as general computation is concerned, the rearchitecting of how caches interact would be something Nvidia would be interested in looking at...
For what it's worth it seems unlikely NVidia will drop the ROPs (floating point blending is required by D3D11, so there's your fp32 atomics) so I can imagine the caches behaving the same - i.e. being entirely un-adapted.

Jawed
 
Sure. Although in that sense I much prefer Fudo to either Charlie or Theo; he's not arrogant, just a bit naive sometimes, and doesn't feel obliged to include highly subjective opinions in everything he writes...

Ah, I didn't know Ignacio's working on that now. As for Moreton, I wonder how much of his professional life he has dedicated to this problem... :) Here's hoping the tessellation hardware is everything he dreamed it to be. I'm sure the last few years must have been both very fun and very hectic for him...

Henry has been in Argentina the last year on sabbatical, and is just back in the last quarter.

http://www.plaxo.com/directory/profile/77312125370/23984925/Henry/Moreton

The last time I talked to him was Spring 2008 just before he left. So I suspect whatever influence he had, was a bit ago in the design cycle.
 
Pre-emption, running multiple concurrent arbitrary compute kernels in effect (rather than concurrent, but specifically labelled kernels such as VS or HS etc.), is to me the "big bang" for traditional GPUs.
Am I correct in reading this sentence that it is defining pre-emption as concurrent kernel execution, or are you listing pre-emption and concurrency as separate items?

The established definition for preemption used elsewhere would be orthogonal to concurrent kernels, and would actually be a pretty significant change, since it would mean shader units would have to service interrupts--something that so far they are explicitely defined as not doing.

I don't know if it's pre-empting Dally's appointment - where his expertise in "on-die communications fabrics" seems key - to suggest that chopping out some of the fixed functions of a traditional GPU is really about getting data from memory into the ALUs in timely fashion - whereas before the data only had to go from memory to colour-cache for use by the ROPs, say.
As far as G300 would be concerned, Dally would be several years late in the process. Hopefully his predecessor was already moving in the same direction.
 
That sounds too absolute to me as a POV. How about if we say that it could be possible under X circumstances but the investment currently would not be worth the outcome?

no, it may be absolute but it is very true. either you are using the shader core thereby reducing shader performance or you have a separate physics computational core and always pay the overhead.

the only case where this might not be true is where the design is bad to begin with and the performance is not shader or memory limited in anyway.
 
According to the AMD "GDC'09: Your Game Needs Direct3D 11, So Get Started Now!" presentation, tessellation is "3 times faster [than rendering the high polygon count geometry without the tessellator] with 1/100th the size!". Isn't it safe to assume in AMD's case that the data flow from VS->HS->TS->DS->PS is all on chip (GS is another story as previously covered). I only see indications from both AMD and NVidia that tessellation is going to be the fastest path on the hardware to push huge amounts of geometry. That says to me that it isn't going to hit DRAM. I'm also assuming that NVidia keeps VS->HS->TS->DS->PS on-chip as well to be competitive. So if TS is "CUDA like software emulated" I think something would have to change to enable on-chip data routing (hence my previous posts on queuing) instead of piping through DRAM.

(developer hat on) TS clicked for me when I realized that for the same quality, I also likely get a reduction of overall tri count (view dependent TS) and thus better quad occupancy (assuming PS still works on quads in DX11). However I don't see DX11 hitting big until PS4/720 generation likely because developers simply don't have the time for the expense of engine and tools rewrites. I think TS for character rendering in DX11 lifetime might very well end up as a probable must-do even if not using TS for static and world geometry. Not that I really like TS, especially given the extra draws for patch types, but like it or not, if it is the 3x as fast path for 100x less memory (clearly less performance and space advantage for smaller LOD), it will be used.
 
And who would actually care? :p
We've seen Mirror's Edge physx and the masses don't care...

exactly, unless it has an EFFECT on gameplay, it doesn't much matter. The problem with "physics" in Mirror's Edge is that it in general is less than eye candy. Even with it on, you hardly notice it because you are paying attention to other things.

Unlike increased graphics, it doesn't help you disambiguation things in the scene. It is just a sideband animation.
 
I still don't understand tessellation well enough to have any decent idea why it's a "fixed-function stage" in D3D terminology, or what the shader code for a D3D11 tessellator would look like.

Can't access the full paper but these guys seem to think it's doable directly on the shader units.

Since today’s Transformation Units are programmable processors, we envisage our implementation not as a silicon implementation but rather as reprogramming an already existent Transformation Unit inside an already existent GPU. Our simulations show that due to the reduced transformation effort (we replace the transformation of the triangle vertices with the transformation of the surface control points) we can easily reprogram one of the Transformation Units to act as a Tesselator Unit.

 
Well, I do.
Rigid-body physics are only one part of the equation.
You also want fluids, smoke, cloth and all that. And not entirely coincidentally, these are far more computationally expensive, so much harder to do on a CPU.
In other words, this video isn't very impressive as far as physics go.

I would like fluids/smoke/cloth too but I only want them if they are integrated into the design of the game, not put in as an afterthought do to marketing money with no effect in game.
 
Am I correct in reading this sentence that it is defining pre-emption as concurrent kernel execution, or are you listing pre-emption and concurrency as separate items?

The established definition for preemption used elsewhere would be orthogonal to concurrent kernels, and would actually be a pretty significant change, since it would mean shader units would have to service interrupts--something that so far they are explicitely defined as not doing.
At some point WDDM will require pre-emptive interrupts at the kernel level within the GPU, in order to prioritise a specific kernel. The way I interpret this is that pre-emption is only meaningful for a GPU's internals if the GPU is capable of running multiple pipelines concurrently, i.e. truly independent contexts. One context being 3D rendering, another being compute shader, etc. My understanding is that Aero's resiliance and responsiveness can only be guaranteed by a truly pre-emptive approach to kernel (context) operation within the GPU.

As far as G300 would be concerned, Dally would be several years late in the process. Hopefully his predecessor was already moving in the same direction.
The key question is really what level of innovation he can bring. What's truly new, left to do, in this field? Are we really waiting for 3D chips and optical on-die networks?

Intel's Terascale is about a mesh interconnect for cores. Larrabee, Cell have rings. ATI currently has a mish mash of buses (and I'm honestly puzzled how they'll scale up to 20 clusters, with say 800GB/s of L2->L1 bandwidth over a "crossbar" of some type) and NVidia's doing who knows what but there's something logically equivalent to a crossbar in there somewhere as far as I can tell.

If clusters start re-distributing work amongst themselves, can this be packaged neatly like VS->PS apparently does (seemingly with tightly defined, fairly small buffers and constrained data types), or if it's truly ad-hoc how to deal with the lumpiness?

Jawed
 
I may be mistaken, but if I remeber well, FX performed poorly even in majority of DX8/8.1 tests, but nobody cared, because shader effects were use quite modestly at that time.

I think you are mistaken.
The performance on ps1.x was just fine, because the FX series had full integer pipelines. The problem was that they added a floating-point unit as an afterthought, so ps2.0 ran very slowly.
The Radeons on the other hand had pipelines completely dedicated to ps2.0, and ps1.x was run on those.
The result was that the FX series was slightly faster in ps1.x, but completely useless in ps2.0.
 
The other side of the coin is that atomics in NVidia currently are a cheap add-on based-upon existing ROP functionality. There's no float atomics because floating point blending (at least, fp32 blending) is not possible in the ROPs. It's not even a "designed-by-committee" fumble, it merely expresses what's often bad about GPUs trying to do general computation - they're coaxed into the task, not designed for it.

Regardless of how atomic's are implemented in hardware an NVidia GT200 arch, how are you getting performance of GPU atomics to be poor? In my typical usage test on the GTX 275, it was a 7-8 instruction latency (29 hot clocks) <b>with the ability to run other work in that latency time</b>. If we are to compare GPU atomics to what we have on the CPU on consoles (360, PS3), and even PCs, the difference is amazing. Of course most of the numbers I have for the PC are related to needing to do a memory barrier (20-100 of cycles rough estimate on many types of PCs) which likely doesn't match up with the kind of un-ordered atomics happening on GPUs. Also PPC's use cache line reservation loss retry loops so performance is measured in 100's of cycles there, not doing real work in the background.
 
Status
Not open for further replies.
Back
Top