Nvidia GT300 core: Speculation

TimothyFarrar · May 14, 2009

Jawed said:
At some point WDDM will require pre-emptive interrupts at the kernel level within the GPU, in order to prioritise a specific kernel. The way I interpret this is that pre-emption is only meaningful for a GPU's internals if the GPU is capable of running multiple pipelines concurrently, i.e. truly independent contexts. One context being 3D rendering, another being compute shader, etc. My understanding is that Aero's resiliance and responsiveness can only be guaranteed by a truly pre-emptive approach to kernel (context) operation within the GPU.

Jawed, I've been waiting for this magical guarantee of desktop/user responsiveness since Win3.1.

Somehow I don't think GPU preemption is going to get us there...

pjbliverpool · May 14, 2009

ECH said:
Why would folks care about GPU physics when we have games like Ghost Busters that make use of the CPU for it?

I must admit they are damn impressive physics effects. Reminds me a lot of Cell Factor. Its great to see a game fully leveraging multi core CPU's for these types of effects, I would much rather that than for these effects to run off the GPU and 2 CPU cores to sit idle.

On the other hand, I would also like to see more advanced physcs like liquids, smoke, sof body etc.... and if that requires GPU intervention then i'm all for it.

For the time being though, I can certainly appreciate Ghostbusters for being near the top of the physics pile and not having to rely ona GPU to do it. In fact I think i'lll pick up the game after seeing that video!

I wonder how the consoles handle those physics, i.e. are they scaled back, do they utilise the GPU. Cell will probably be ok but anything stressing a quad (Penryn/i7??) at 80% should have a very hard time running on Xenon.

3dilettante · May 14, 2009

Jawed said:
At some point WDDM will require pre-emptive interrupts at the kernel level within the GPU, in order to prioritise a specific kernel. The way I interpret this is that pre-emption is only meaningful for a GPU's internals if the GPU is capable of running multiple pipelines concurrently, i.e. truly independent contexts. One context being 3D rendering, another being compute shader, etc. My understanding is that Aero's resiliance and responsiveness can only be guaranteed by a truly pre-emptive approach to kernel (context) operation within the GPU.

Single-threaded CPUs have had preemption. It's the way they've maintained the illusion of concurrent execution for so long. At its coarsest implentation, a GPU could be interrupted and forced to load a new context for a timeslice. There may at any time be only one kernel on-chip.
Given the considerable amount of setup involved in initializing a graphics context, it might not be ideal, but perhaps something like a value GPU would save die space this way.

The key question is really what level of innovation he can bring. What's truly new, left to do, in this field? Are we really waiting for 3D chips and optical on-die networks?

Intel's Terascale is about a mesh interconnect for cores. Larrabee, Cell have rings. ATI currently has a mish mash of buses (and I'm honestly puzzled how they'll scale up to 20 clusters, with say 800GB/s of L2->L1 bandwidth over a "crossbar" of some type) and NVidia's doing who knows what but there's something logically equivalent to a crossbar in there somewhere as far as I can tell.

If clusters start re-distributing work amongst themselves, can this be packaged neatly like VS->PS apparently does (seemingly with tightly defined, fairly small buffers and constrained data types), or if it's truly ad-hoc how to deal with the lumpiness?

A truly satisfactory manycore on-chip interconnect and communications scheme has not been created as of yet.
Cell's designers said they would have done more than a ring-bus had they the time and resources.
Ring buses do not like hotspots of activity, but a dynamic MIMD environment does not always play nice.
Whether Larrabee will have issues is not yet known.
Terascale's mesh is a testbed to see what works at the scale of hundreds of cores, but what lessons Intel has learned have not been spelled out yet.

It is good to question whether the crossbar setup used for the texture cache read path on GPUs is scalable, and I'm willing to bet that it is nearing its limits. Things would have been far worse if the crossbar wasn't one-way, and writes could go in the other direction, or worse to another client on the crossbar.

Nvidia's hiring of someone whose expertise lies in creating low overhead packetized network communications is a sign that they do not expect the crossbar setup to continue onwards unamended.

As the unit count rises, there will be no physical signalling scheme capable of servicing the peak demands of the entire chip.
It's another boon/bane of specialized hardware pipes in that their inflexibility also massively simplifies the routing+implementation+latency+throughput problem a general network would have.

Cache coherency also battles with constraints signalling networks face. Coherency injects latency the more hops on the network there are to every other cache, but networks need serious physical overhead to keep the hop count low.

The question of how to handle dynamic traffic needs on a manycore chip has not been answered.

A fully generalized communications network may not fully satisfy the peak needs of a chip's functionality.
Perhaps a mixture of more localized buses and specialized storage can reduce the load on a more generalized global network.

nAo · May 14, 2009

Latency is not bad if you can hide it (at a decent cost)

keritto · May 14, 2009

Shtal said:
* 55nm technology
* 576 sq.mm die area
* 512bit GDDR5 memory controller
* GDDR5 2GB memory, doubled GTX280
* 480 stream processors
* DirectX 10, no DX 10.1 support yet.

They'll certainly skip to 40nm cause they not TSMC only customer even if 40nm leakage rumors are shown not too exact on RV740. Well it's small chip

No way that they make the same mistake twice in a row with mega-monster chip. Maybe 350mm2 @40nm only if they stay on that old R&D you preticipte with superclocked 480ALUs ... it's too many for them if thy only continue on G80-GT200 route

Maybe even 2GHz 384 ALUs will be too much this time

And only DX10.1 This is the most bullshit of all. When envidia at the time when rv670 arrived stated they'll skip that minor 10.1 and concentrate on dx11 specification. It would be more than ridiculous to buy these kind of BS and they in money making business no money losing oriented.
On the other hand they announce tottal redesign of their GPUs for DX11. Hope they'll really do that.

3dilettante · May 14, 2009

nAo said:
Latency is not bad if you can hide it (at a decent cost)

Latency is not bad 1) if you can hide it 2) if the overall cost is decent

That's two qualifiers where the details can matter a lot.

In what manner is the latency being hidden, and what knock-on effects does it have on the whole design?

What is a decent cost, and is there nothing else that resources can be dedicated towards with more gain?

DegustatoR · May 14, 2009

ninelven said:
Should I really have to do this?

You can't say anything about architecture's efficiency from numbers like 512 SPs or 512-bit GDDR5 bus or 2.5Tflops peak math performance so it doesn't really matter whether you believe these numbers or not -- you still don't know anything about the underlying architecture and you still can't make anything but guesses about it's potential.

ninelven said:
Since you apparently know GT300's die size why don't you go ahead and post it?

I kinda already did. But here's the latest i heard on this: 495mm^2. No idea how accurate that is.

nAo · May 14, 2009

3dilettante said:
In what manner is the latency being hidden, and what knock-on effects does it have on the whole design?

I am looking at what modern GPUs already do in this regard.

What is a decent cost, and is there nothing else that resources can be dedicated towards with more gain?

I don't know enough to determine what a decent cost is, that's work left to do for the architecture guys

Jawed · May 14, 2009

I know it's old:

http://download.microsoft.com/download/5/b/9/5b97017b-e28a-4bae-ba48-174cf47d23cd/PRI103_WH06.ppt

But fingers-crossed it transpires. Soon.

Jawed

ninelven · May 14, 2009

DegustatoR said:
you still don't know anything about the underlying architecture and you still can't make anything but guesses about it's potential.

Well, thank goodness I am in the GT300 core: Speculation thread.

DegustatoR said:
I kinda already did. But here's the latest i heard on this: 495mm^2.

DegustatoR said:
Well, it says "less than 490mm^2" isn't it...

DegustatoR said:
Hint: GT200b is 490mm^2. So it's more like "less than GT200b" really.

So by implying that it could be significantly smaller than 490mm^2 what you actually meant is it is larger than 490mm^2..... right.

3dilettante · May 14, 2009

nAo said:
I am looking at what modern GPUs already do in this regard.

They struggle to not run out of registers and have terrible read after write performance.
They have high-bandwidth read crossbars, at the price that their scheme only carries data one way. The choice of a crossbar might mean either that uniformity of service and minimization of contention are other design constraints.
On-chip latency apparently becomes harder and harder to hide.

I don't know enough to determine what a decent cost is, that's work left to do for the architecture guys

The one Nvidia hired has a skill set that includes producing efficient and low-overhead networking.

PhilTaylor · May 14, 2009

Jawed said:
I know it's old:

http://download.microsoft.com/download/5/b/9/5b97017b-e28a-4bae-ba48-174cf47d23cd/PRI103_WH06.ppt

But fingers-crossed it transpires. Soon.

Jawed

thats WDDM 2.0 and the Advanced Scheduler.

that got cancelled for Vista and isnt coming back for Win7 since thats on WDDM 1.1.

ATI in particular felt burned by the decision to can the advanced scheduler, since gate count for pre-emption features had already been allocated and was thus wasted gates due to this decision.

MS's history of hyping a feature, getting partner support at significant dollar and manpower cost to the partner, and then dropping said feature - this does not help bring about advanced support and features.

fwiw, there is a lot that can be done without WDDM 2.0 and the advanced scheduler. when you open a menu or window and get a pause, thats User allocating memory in what should be a fast path. any memory allocation is an opportunity for the scheduler to swap you out, and worse yet if the malloc causes page faults the virtual disk driver can disable interrupts while it services that page fault. hence the "feel" of being unresponsive.

the DX architects know about this, and want to fix it, but its going to take a while.

Jawed · May 14, 2009

TimothyFarrar said:
Regardless of how atomic's are implemented in hardware an NVidia GT200 arch, how are you getting performance of GPU atomics to be poor? In my typical usage test on the GTX 275, it was a 7-8 instruction latency (29 hot clocks) <b>with the ability to run other work in that latency time</b>.

Because you need 10s or hundreds of atomics in flight to make the performance bearable.

If we are to compare GPU atomics to what we have on the CPU on consoles (360, PS3), and even PCs, the difference is amazing. Of course most of the numbers I have for the PC are related to needing to do a memory barrier (20-100 of cycles rough estimate on many types of PCs) which likely doesn't match up with the kind of un-ordered atomics happening on GPUs. Also PPC's use cache line reservation loss retry loops so performance is measured in 100's of cycles there, not doing real work in the background.

Are these applications even trying to do other work in the shadow of the atomic latency?

The issue with NVidia's design is it's write-through with no concept of MRU/LRU, so it's generating worst case latency for everything (this is how it seems, anyway - maybe there's some blending tests out there that show otherwise). It's like having no texture cache at all. The GPU can hide the latency, but it takes a lot more non-dependent instructions to do so than if basic caching was implemented. It's just using the ROPs as they are, i.e. minimal cost.

Jawed

Jawed · May 15, 2009

PhilTaylor said:
thats WDDM 2.0 and the Advanced Scheduler.

Ha, well it's actually 2.1 that does what we really want in this regard, "mid-pixel" pre-emption

that got cancelled for Vista and isnt coming back for Win7 since thats on WDDM 1.1.

ATI in particular felt burned by the decision to can the advanced scheduler, since gate count for pre-emption features had already been allocated and was thus wasted gates due to this decision.

Hmm, I think those transistors must have been deleted because there doesn't seem to be anything like this functionality in there :???:

MS's history of hyping a feature, getting partner support at significant dollar and manpower cost to the partner, and then dropping said feature - this does not help bring about advanced support and features.

I presume all the partners are designing hardware functionality and then jostling to get the features into D3D, too. Cat fight.

fwiw, there is a lot that can be done without WDDM 2.0 and the advanced scheduler. when you open a menu or window and get a pause, thats User allocating memory in what should be a fast path. any memory allocation is an opportunity for the scheduler to swap you out, and worse yet if the malloc causes page faults the virtual disk driver can disable interrupts while it services that page fault. hence the "feel" of being unresponsive.

It bugs the hell out of me that Windows prioritises caching disk (to the tune of hundreds of MB) over keeping allocated memory in RAM. But that's another thread.

the DX architects know about this, and want to fix it, but its going to take a while.

Any chance D3D11.1 has WDDM 2.0?...

Jawed

no-X · May 15, 2009

Scali said:
I think you are mistaken.
The performance on ps1.x was just fine, because the FX series had full integer pipelines. The problem was that they added a floating-point unit as an afterthought, so ps2.0 ran very slowly.
The Radeons on the other hand had pipelines completely dedicated to ps2.0, and ps1.x was run on those.
The result was that the FX series was slightly faster in ps1.x, but completely useless in ps2.0.

I found my old archive...

R9600XT (catalyst 4.6):
3DM01 Nature: 73.8 FPS
VS: 178.5 FPS
PS 1.1: 238.6 FPS
PS 1.4: 121.9 FPS

FX5900XT: 25% more expensive (ForceWare 61.40):
3DM01 Nature: 73.5 FPS
VS: 159.0 FPS
PS 1.1: 191.1 FPS
PS 1.4: 107.6 FPS

PhilTaylor · May 15, 2009

Hmm, I think those transistors must have been deleted because there doesn't seem to be anything like this functionality in there

typically unused features get designed out the next big silicon rev.

given the news about Advanced Scheduler being dropped was far enough in advance of Vista shipping, ATI might have removed the feature before 2xxx series.

I left ATI in 2004 after the Vista reset and the news about WDDM so I dont know for sure when the feature got pulled. Being I was responsible for strategic relationships between the 2 companies, neither of those actions helped me in that role.

I presume all the partners are designing hardware functionality and then jostling to get the features into D3D, too. Cat fight.

some partners words get heard louder than others.

this was a case of aiming too far ahead of some sectors of the market. I have to watch what I say here.

Jawed · May 15, 2009

TimothyFarrar said:
According to the AMD "GDC'09: Your Game Needs Direct3D 11, So Get Started Now!" presentation, tessellation is "3 times faster [than rendering the high polygon count geometry without the tessellator] with 1/100th the size!". Isn't it safe to assume in AMD's case that the data flow from VS->HS->TS->DS->PS is all on chip (GS is another story as previously covered).

Two problems with this: what happens when GS is switched on (hmm, is it meaningful? Jolly Jeffers's nice picture has the GS collecting data from DS)? And secondly I think the bandwidth savings they're talking about are on current ATI hardware which multi-passes through memory to perform tessellation, going as far as using R2VB:

http://developer.amd.com/gpu_assets/Real-Time_Tessellation_on_GPU.pdf

I only see indications from both AMD and NVidia that tessellation is going to be the fastest path on the hardware to push huge amounts of geometry. That says to me that it isn't going to hit DRAM. I'm also assuming that NVidia keeps VS->HS->TS->DS->PS on-chip as well to be competitive. So if TS is "CUDA like software emulated" I think something would have to change to enable on-chip data routing (hence my previous posts on queuing) instead of piping through DRAM.

I was surprised to discover the double-ring-buffer VS->GS->PS thing in ATI. What hasn't been answered yet is if that's avoidable. And if not, why would a pipeline centred upon TS be less demanding?

The interesting thing about looking at that WDDM2.0 stuff, and it'll prolly make your blood boil, is the idea that the GPU demand-pages resources and switches on faults etc. It seems to imply that WDDM is running towards a memory-centric view of tasks, whether they're texturing (which page faults to suck stuff off disk) or buffering data twixt pipeline stages. If everything's a virtualised resource and you've built hardware that can find and do other things while pages arrive, then keeping stuff on die is a task reserved for caches

(developer hat on) TS clicked for me when I realized that for the same quality, I also likely get a reduction of overall tri count (view dependent TS) and thus better quad occupancy (assuming PS still works on quads in DX11).

I've been arguing most of this for years, ever since the tessellator first appeared in Xenos. There's always an excuse for why it won't happen soon...

However I don't see DX11 hitting big until PS4/720 generation likely because developers simply don't have the time for the expense of engine and tools rewrites. I think TS for character rendering in DX11 lifetime might very well end up as a probable must-do even if not using TS for static and world geometry. Not that I really like TS, especially given the extra draws for patch types, but like it or not, if it is the 3x as fast path for 100x less memory (clearly less performance and space advantage for smaller LOD), it will be used.

Things like silhouette-enhancement are low-hanging fruit and will make a really great difference. I was thinking Trinibwoy can have procedural door knobs, but now I'm wondering if tessellating such trivial objects is worth the draw call cost.

Jawed

Blazkowicz · May 15, 2009

I'm hoping for really good GPU multitasking and orthogonality some day :
- have it just work with no arbitrary limit on multitasking
- allow multiple vendors running on the same system
- allow multiple OS to use resources, either with a native host and VM guests, or all VM's (with Xen). we may need hardware I/O virtualization on the system at some point.
- network transparency as well. easy remote use just like X11 and VNC.
- really good power management

easy network transparency might be not much sought after because it may hamper GPU sales, I guess. You could do a lot with a single fat GPU in a household.
Some work has been done, with VirtualGL (I haven't yet really grasped how to use it).

Look at what's a high end computer in 2010. A hexacore 32nm CPU, 6 gigs of ram, GT300 card and a 2TB hard drive.

That thing is so hugely powerful, with 12 hardware CPU threads, a huge pile of ram, a monster GPU with half a thousand units etc.
With proper software and thin client hardware (imagine ultra cheap, ultra low power boxes doing hardware decoding of mjpeg based vnc) it would be practical to run a small lan party from that single PC.

Another possibility along the thin client idea : decoupling GPU(s) and video outputs. So you can have a set up with four GPU working on one output, or one GPU with four or eight outputs. (add USB keybs and mices. that was done for 2D multi-seat already, using multiple video cards)

I have no idea whether the vaporware WDDM helps with those points/requests? besides cheap context switches. The technical details are out of my reach

and I can't pretend to understand all of conversations going on here.

Multi-user/remote use is also a bitch on Windows as microsoft will make you pay hundreds or thousands dollars/euros on fake "server versions" and "client acess licenses" ; but it's pretty awesome stuff that Free software allows.

I thus wonder if we'll see something similar to WDDM or vaporware WDDM running on Linux (paired with a vaporware OpenGL version and D3D10/D3D11 stuff wrapped with Wine dlls)

TimothyFarrar · May 15, 2009

Jawed said:
Two problems with this: what happens when GS is switched on (hmm, is it meaningful? Jolly Jeffers's nice picture has the GS collecting data from DS)? And secondly I think the bandwidth savings they're talking about are on current ATI hardware which multi-passes through memory to perform tessellation, going as far as using R2VB:

I was thinking that they weren't referring to the pre-DX11 required R2VB pass, but rather the combination of (1.) having a majority of data in compressed textures, and (2.) keeping data on chip --- in comparison to doing a standard high poly model in IA->VS->PS.

I'm not sure what to say about the GS path, other than I tried out after DX10 release the simple cases of replicating data to multiple cubemap faces, expanding point data to motion stretched particles, etc, and found it wasn't nearly as fast as other methods (actually performance was horrid). Of course I did this on early NVidia cards (hence my jaded outlook). So GS IMO isn't useful (perhaps just for me?) and seems even less useful with DX11. Maybe it isn't going to be fast on DX11 and that doesn't matter, it will be there for backwards compatibility. Was DX10 GS a ATI or NVidia or Microsoft pushed feature?

I was surprised to discover the double-ring-buffer VS->GS->PS thing in ATI. What hasn't been answered yet is if that's avoidable. And if not, why would a pipeline centred upon TS be less demanding?

I was thinking TS would be less demanding because once HS is finished for a primitive that output data is reused many many times in DS. Huge data reuse. Also perhaps HS input is easier compared to how GS shares large amounts of neighboring VS data? What I'm more interested in is how ATI and NVidia and Intel are going to insure good SIMD occupancy with the non-SIMD sized and aligned HS and DS groups. Seems like it should be possible, but that some points/patches might be higher performing than others do to SIMD (warp) packing and shared-memory access patterns? Maybe it will just be a case of getting in the most common worst case 2 (or 3 max) different primitives in a SIMD group, and you just end up eating the 2-3x cost of shared memory broadcast when threads access shared data. Given all the abuse with vertex skinning waterfalling, I'm likely way over thinking this...

The interesting thing about looking at that WDDM2.0 stuff, and it'll prolly make your blood boil, is the idea that the GPU demand-pages resources and switches on faults etc. It seems to imply that WDDM is running towards a memory-centric view of tasks, whether they're texturing (which page faults to suck stuff off disk) or buffering data twixt pipeline stages. If everything's a virtualised resource and you've built hardware that can find and do other things while pages arrive, then keeping stuff on die is a task reserved for caches

This sort of swapping and PC like behavior is exactly why I'd rather be programming a console, and most developers I know don't agree with that. Clearly many people want demand paging so they can attempt to remove sloppy texture streamers. A compromise would be allowing developers a back door to lock things (CUDA's page locked memory is a good step in that direction!). I do understand the desire for ATI and NVIDIA to broaden their markets by eating into tasks which would traditionally be CPU side and to make it easy for programmers to make use of the hardware...

I've been arguing most of this for years, ever since the tessellator first appeared in Xenos. There's always an excuse for why it won't happen soon...

DX11 = no R2VB and no EDRAM, and all hardware having the feature. Devs who haven't done the DX10 conversion are going to have to do lots of changes to how draws are done (ie uniform buffers, and threading) to push DX11 performance. Doesn't take much, say Unreal ships with TS character support, then you've already hit what percentage of developers? With the rest realizing that they need the feature support to be competitive visually.

Things like silhouette-enhancement are low-hanging fruit and will make a really great difference. I was thinking Trinibwoy can have procedural door knobs, but now I'm wondering if tessellating such trivial objects is worth the draw call cost.

Depends on too many factors (shader system, lighting model etc). BTW, NVidia's bind-less GL graphics updates enabled a 7.5x improvement in draw call speed, in one test they were pushing 400K draws/sec before then 3000K draws/sec after. Add this to the possibility that DX11 GPUs can better load balance shaders (perhaps less pipeline emptying) and maybe draw calls aren't and issue.

3dcgi · May 15, 2009

Jawed said:
I still don't understand tessellation well enough to have any decent idea why it's a "fixed-function stage" in D3D terminology, or what the shader code for a D3D11 tessellator would look like.

Anyone want to punt by posting reasonably detailed pseduo-code for D3D11's tessellator?

Why is it fixed-function? When D3D is adding programmable concepts and orthogonalising resource usage, why introduce another FF stage?

I don't have pseudo code, but it's a lot of math so the reason for the fixed function stage is performance related. If Microsoft makes the refrast code available it contains the algorithm.

TimothyFarrar said:
Not that I really like TS, especially given the extra draws for patch types, but like it or not, if it is the 3x as fast path for 100x less memory (clearly less performance and space advantage for smaller LOD), it will be used.

What extra draws for patches are you referring to?

PhilTaylor said:
typically unused features get designed out the next big silicon rev.

given the news about Advanced Scheduler being dropped was far enough in advance of Vista shipping, ATI might have removed the feature before 2xxx series.

Nope. R600 paid the price.

Nvidia GT300 core: Speculation

TimothyFarrar

pjbliverpool

B3D Scallywag

3dilettante

nAo

Nutella Nutellae

keritto

3dilettante

DegustatoR

nAo

Nutella Nutellae

Jawed

ninelven

PM

3dilettante

PhilTaylor

Jawed

Jawed

no-X

PhilTaylor

Jawed

Blazkowicz

TimothyFarrar

3dcgi

Similar threads