NV40 Coming @ Comdex Fall / nVIDIA Software Strategies

Yeah, it not like Depth Test (together with Fog, Alpha Test and Stencil Test) is totally gone from the end of the pipeline however mundane they look in a discussion about PS 3.0. :p

Anyway, all this talk about pipelines and dynamically allocation of ressources got me thinking: As I understand it, the pipelines will always be bound to work on the same triangle (fragments send from the Rasterizer stage). It means that they would normally be working on the exact same shader program with the same textures and data etc. So unless a decent number of the pipelines are often idle, an array of units that is dynamically allocated may not make much sense: The fragments needs the same level of ops. power (and you need to add complex allocation silicon).

Vertex Shaders are a totally different ballgame of course: Here you're not bound by a something like the rasterizer stage, the ops are much less dependent on large data (like textures) and the data just keeps coming in a nice flow.
 
Well, I think the allocation mechanism is fairly simple. The Z-units just enqueues pixels that is to be rendered. The fragment processor dequeues this in a FIFO manner. This won't cause locality to deteriorate, it just means that all fragment shader are kept busy.

Imagine you have 8 Z-test units in a 4x2 configuration, testing part of a triangle like:

1100
1110

1s are to be rendered, 0s culled. In a traditional rasterizer 3 fragment shaders will be idle. Whereas in the NV3x rasterizer, where each section is decoupled with a buffer from the next, you keep all shaders busy (as long as the Z-units can test fast enough to ensure the buffer never empties). The important part is that spatial locality doesn't deteriorate; -all fragments are still part of the same tri, and probably rendered in a tile-by-tile fashion.

However if you were imagining shaders processing different context at the same time you are of course right: locality would plummit and hence performance suck (unless you spend much more effort on caches).

Cheers
Gubbi
 
Gubbi said:
In a traditional rasterizer 3 fragment shaders will be idle. Whereas in the NV3x rasterizer, where each section is decoupled with a buffer from the next, you keep all shaders busy (as long as the Z-units can test fast enough to ensure the buffer never empties).

The NV3x rasterizer way is of course the most efficient to implement if you have enough Z detect/reject units (as you point out), but are you suggesting that R3x0 doesn't do this already? I honestly don't know.

Gubbi said:
The important part is that spatial locality doesn't deteriorate; -all fragments are still part of the same tri, and probably rendered in a tile-by-tile fashion.

Yes, that was my mundane point: One triangle = one shader program and one data set/texture set, so if you keep all pipelines busy - as mentioned above - the idea of a grand array of tex and ops units doesn't seem to really be worth the effort (all pipeline already needs equal amount of ops power).

Gubbi said:
However if you were imagining shaders processing different context at the same time you are of course right: locality would plummit and hence performance suck (unless you spend much more effort on caches).

No! That was a negative reference to make my point! 8)
 
My new favourite: The representative from the Santa Clara, California-based company said the NV40 would be twice as fast as the NV35 and would feature entirely new architecture. - oh, yeah, we know that. Haven't heard this somewhere, sometimes? ;)

And I love 'logical assuming' :rolleyes:... like this: Note that in the second quarter next year Intel rolls out its Grantsdale platform with PCI Express-based connection for graphics cards, so, it is logical to assume that either NVIDIA NV40 already features PCI Express or... (...) Of course, they would love to risk again... :rolleyes:

Xbit...
 
LeStoffer - I would think that quite a few triangles share the same shader program and context, especially as triangles get really small. So maybe it's worth it to add logic to process pixels coming from different tris simultaneously.

Gubbi - how does what you propose handle the ddx/ddy instructions?

Serge
 
psurge said:
LeStoffer - I would think that quite a few triangles share the same shader program and context, especially as triangles get really small. So maybe it's worth it to add logic to process pixels coming from different tris simultaneously.

In a perfect world, absolutely! :)

But the main question is: Will we have silicon to burn on NV40, R400 & PowerVR Super8 to implement this? Maybe. I don't know, but I do remember peolpe arguing some time ago whether 8 pipelines wouldn't be to much because of this 'limitation'.

Anyway, my point is that this 'problem' have to be solved before a true dynamic allocation of ops units in the pixel pipeline part of the GPU makes much sense [to me].
 
I've only flash read the thread, so sorry if someone has already made this point...

Regarding the unification of VS/PS units. This is actually of key importance in an IMR architecture as in current designs, with seperate VS and PS units, either unit can end up being stalled by the other for long periods of time e.g. Consider a complex PS applied to a large triange followed by a large number of small triangles with complex VS applied, this can result in the VS processing of the later triangles being stalled by the PS processing of the first large triangle. This makes it quite hard for an IMR to come close to its peak throughput, resulting in relatively poor sustained performance.

By intergrating VS and PS into a single unit it should be possible to make sure that the unit is busy all of the time by giving cycles to what ever needs them, be it geometry of pixel processing. So, if you combine the area associated with to seperate units you might get one with say 2x the peak perfromance of one of the seperate units, but it will also be fully utilised at all times.

However the combining of the units doesn't come for free...

John.

PS - This is also nothing revolutionary.
 
JohnH: Don't think anyone explicitly made the point, but I personally supposed most people in the discussion knew about it.

And the more I think about it, the cheapest dynamic allocation is. In the NV35/NV4x architecture, that is!

In the NV35, and presumably NV40, you've really got only TWO type of shading units:
- FP32 with the possibility to emulate two TMUs instead of doing one FP op
- FP32 ( plain )
( that is from my understanding, of course, hard to say for sure right now )

Now, if we assume not more than 1 instruction on 2 is texturing even in the PS, there'd be no problem to give simply 1+1 "packages" to the VS as "additional power".
So the dynamic allocation would mostly be in the sense of giving more VS power, not more VS or more PS power.
So, you could have something like:
4FP32 plain units, 4 FP32 Texture-capable units in the VS.
Done as a 4x(1+1)
So you wouldn't need additional power if there's only a little bit of texturing.
Heck, you could send additional power to th PS to, but I don't know if that'd be worth the transistor cost.

As for the PPP - I'd guess that might share power with the PS too, maybe? That is... if it got dynamic allocation.

To determine where you need to put those units, looking at the caches is crucial. Because if you're beggining to have too much in a cache, you're gonna have a stall soon - so, based on cache, you should be able to determine where you need more power and where you need less.

BTW, if the NV40 doesn't have dynamic allocation:

According to The Inquirer ( I don't trust them too much for that, but heck, let's assume for a while ) , the NV40 got 8 pipes ( I trust them on that, heard it from other people too - or well, according to me, the equivalent of 8 pipes ) with 2 FPUs per pipe.

I've got no idea if that 2FPUs/pipe thing is accurate. But if it is, there are many possibilities, among which:
1) Both can emulate 1 TMU ( -> 8 texture operations *or* 8 FP operations ) , resulting in
2) One can emulate 2 TMUs, the other is a plain FP32 unit. Thus it would be equivalent or better than a 8x2, depending on situation
3) TMUs are now again decouped, thus being a 8x1 with twice the shading power of a traditional 8x1 ( how the heck to fit that in 150M transistors, though... )
4) One of them can emulate 1 TMU, the other cannot emulate any ( most likely, slightly better than a traditional 8x1 )


In other words, if they call it a "8x1", we'll have yet another architectural description nightmare. Great, huh?
If they call it a 8x2, then solution 2 is fairly obvious - would be VERY nice, IMO. If the NV35 takes 130M transistors, they'd have to do some serious compression work - but it should be possible. Remember it would have 8 plain FP32 FPUs and 8 texture-capable ones, while the NV35 got 8 and 4, respectively.


Uttar

EDIT: Corrected a few mistakes, added the idea of using cache.
 
Only posted the comment as I'd noticed a couple of statments about VS detracting from PS perf.

There's any number of ways you can split the shader unit across the tasks e.g. divvy up units themselves or tweat as single lump which switches between input buffers when full (more like multi-threading), both have advantages and disadvantages wrt area costs and/or perf limitations.

No reason why the 'PPP' couldn't use the same units again, although there might be some part of that functionality that doesn't map...

John.
 
But the main question is: Will we have silicon to burn on NV40, R400 & PowerVR Super8 to implement this? Maybe. I don't know, but I do remember peolpe arguing some time ago whether 8 pipelines wouldn't be to much because of this 'limitation'.

You may exclude PVR future iterations, since AFAIK PS from VS on a TBDR are perfectly de-coupled (read JohnH´s post above).

***edit: the first one, not the second one directly above mine.
 
Although thats true there are other advantages to combining the units e.g. in theory it should be possible to increase performance for less net area than equive perf seperate units. So ultimatley still worth doing on a tbdr.

John.
 
JohnH said:
Although thats true there are other advantages to combining the units e.g. in theory it should be possible to increase performance for less net area than equive perf seperate units. So ultimatley still worth doing on a tbdr.

John.
It might be even better for a TBDR than an IMR. This way the tiler could use all of its functional units all the time without having to bother with doing both vertex and fragment ops at the same time to maximize performance. An IMR may have to contend with lowered memory locality when attempting to maximize usage of a unified shader architecture, a TBR would not (which may be a reason for the unified IMR to have some caching between doing vertex and fragment ops...though it shouldn't need that much...just a dozen or so triangles should do it).
 
Hmm, that seems correct to me, Chalnoth.
But I think we may have to see it differently there. With a unified IMR, you ideally need to allow very high speed allocation of the units, and not just "all here" and then "all there" ( although you could, really, would be rather odd though... )

With a unified TBDR, you may dedicate ALL units to VS, then all units to PS. You don't need to be able to do 50-50, 25-75, 10-90, ... - just 0-100 and 100-0! :)

So, it might actually also be easier to implement in a TBDR...


Uttar
 
Uttar said:
Hmm, that seems correct to me, Chalnoth.
But I think we may have to see it differently there. With a unified IMR, you ideally need to allow very high speed allocation of the units, and not just "all here" and then "all there" ( although you could, really, would be rather odd though... )

With a unified TBDR, you may dedicate ALL units to VS, then all units to PS. You don't need to be able to do 50-50, 25-75, 10-90, ... - just 0-100 and 100-0! :)

So, it might actually also be easier to implement in a TBDR...

Uttar

Why not just have 2 FIFOs feeding the shader-array. One for fragments and one for vertices. Use a low mark/high mark algorithm to determine which gets processed, possibly taking rendering context into account (ie. switch FIFO on context switch and start prefetching the context.)

Cheers
Gubbi
 
Its a little more complicated than that as the processing elements will invariably have quite a deep pipelines which you need to keep full for best perf, but you're basically right.

Utar/Chalnoth. In order to avoid stalling the host a TBDR would still want to fine grain schedule the processing elements between the two tasks. But yes the granularity could be couser than on an IMR which might make the pipelining issue easier to solve.

John.
 
Joe DeFuria said:
DaveBaumann said:
I...then I'll have a 3GHz 800MHz FSB CPU/Mobo in the test rig to tick over with when its not testing! :p

:oops:

Sorry for being OT, but I have been running a 3GHz (2.4C @ 3GHz) machine with a 1GHz FSB for a while now :mrgreen:

So yes, mine is bigger than yours! 8)
 
Back
Top