NV40 Coming @ Comdex Fall / nVIDIA Software Strategies

overclocked · Jun 10, 2003

IIRC R390/Loci/R400/420: they are the same solution, the new high-end chip, probably on 130 nano. But! It isn't the same as the original (year old), brand-new hyppasuppa' architecture-based R400: that became R500... so, current R390/Loci/R400/420/whateveryouwant/Stanley-cup will be kinda 'intermediate' high-end chip 'til the R500 hits the market...

Hehe thanks for clearing that one up.

Regarding these new features like Vertex and Pixelshader 3.0, whats differ them from Version 2.0 that we have with todays DX9 hardware?

Click to expand...

Josiah · Jun 10, 2003

So, when are we supposed to expect R390/Loci/R400/420/whatever to be announced? Will it trounce NV40? (or NV45, depending on the timeframe)

Richthofen · Jun 10, 2003

Typedef Enum said:
Dave,

Do you know something about NV40's AA possibilities? IE have you spoken to somebody from nVidia, and they said something like...

"Yeah, yeah...We're well aware of the fact that most people, in the know, regard our AA as sub-standard...But I can promise you that it will all change with our next gen. part."

Something like that?

@planet3dnow there were some marketing people from Nvidia last week to answer some user questions.
They said there will be a change in the AA department but nothing specific.
That could mean anything but well they were marketing people...

gokickrocks · Jun 10, 2003

well they used a blur technique for 2x and 4x...so the next logical step would be one for 6x

DegustatoR · Jun 10, 2003

DaveBaumann

Given that NV30 is still very similar to NV25 in much of its configuration

Now that is not true. Technologically, jump from NV25 to NV30 is a much bigger jump than between NV35 and NV40...

But i doubt that NV40 will introduce full dynamic allocation too. It's just too early for that i think.

Uttar

I agree PS3.0./VS3.0. isn't particularly ambitious.

...for NVIDIA

But do you have facts proving there's no dynamic allocation?

"Facts" will be released around november i hope

Just thoughts and rumors for now...

T2k

AFAIK R360 is something R3x0-based chip on 150 nano - around July, IMHO.

I'm not sure that this is a new chip... It looks like a speed-bin of R350 to me.

Josiah

So, when are we supposed to expect R390/Loci/R400/420/whatever to be announced? Will it trounce NV40? (or NV45, depending on the timeframe)

R360 -- soon(TM)

R400(390) -- november, comdex fall
NV40 -- november, comdex fall

Which will be faster/better in quality we'll see in november.

Dave Baumann · Jun 10, 2003

DegustatoR said:
DaveBaumann

Given that NV30 is still very similar to NV25 in much of its configuration

Click to expand...

Now that is not true.

In what way does the configuration differ, other than the VS?

Luminescent · Jun 10, 2003

NV30 and NV25 differ in the way their internal units are partitioned and utilized. In NV30 there is no set pipeline pathways; just a collection of units, which are used allocated as dictated by any particular application. NV25 is incapable of such abilities.

Arun · Jun 10, 2003

Sounds like Luminescent's agreeing with me

Ah well, DaveB doesn't I think though, but trying annoying him some more with it might not be lost time

Although a little clarification, Luminescent: There is not just "a" collection of units, there are two: one for the PS, one for the VS. *No* sharing between those two parts is going on AFAIK.

I believe that what nVidia claims for the NV30's VS is also true for the PS. It's just that in the case of the VS, it can only output 1 Vertex/clock I think, while in the case of the PS, it can output many pixels per clock ( 4 or 8, depending on whether you output Color or not ) , so even though it's the same idea, there are some real differences.

How could we, at least partially, proof this theory?
Well, if we could proof there *is* a register usage performance hit in the VS too, then we'd be on to something. Nothing has been tried in this domain yet, however, because it's much harder to accurately calculate that than in the PS case, because you can't directly output pixels.
Just seeing if there's a performance hit, even if we don't know it accurately, shouldn't be too hard, however. I'd try it, but I still don't have any NV3x

Uttar

Dave Baumann · Jun 10, 2003

And what proof is there of that?

IMO the single biggest giveaway that this isn't the case is that the FP units on NV30 are also the texture address operators - if they were truely reallocatable then some could be used for texture addressing and some could be used for processing in the same clock - however, Dawns code and the other tests we've seen on this forum do not bear that out at all - texture addressing happens in one cycle and FP operation happen on the next. As I said before, NV25's texture address processors were also floating point and it look as though these were extended with the added Shader functionality.

Many of the basic principlas of NV30 is staill based on a 2x2 configuration, which also belies a truely configurable pipeline.

Gubbi · Jun 10, 2003

Luminescent probably meant that NV30 has a configuration somewhere along the lines of:

Code:

-----------    --------    -------------    --------    ---------------- 
| z-test/ |--\ | FIFO |--\ | Fragment  |--\ | FIFO |--\ | Framebuffer  |
| stencil |--/ |      |--/ | Processors|--/ |      |--/ | blend units  |
-----------    --------    -------------    --------    ----------------

Whereas NV25 is a more traditional 2x2 pipe design.

Cheers
Gubbi

Arun · Jun 10, 2003

Dave: And just how useful would it be for nVidia to support that?
As I've said before, I believe the NV3x *still* works on 4 ( or 8 ) pixels simultaneously. It isn't loads and loads of calculations functionality for one pixel at a time, this wouldn't be efficient at all.

So, considering that, it's more optimal to run the same instructions on all pixels in the same clock cycle, isn't it?

Now, another thing which indicates the NV3x doesn't have pipelines:
In pipelines, you'd have to respect an order. For example, I think the R300 needs to do Texturing, and then Arithmetic ( or it could be the opposite ). That means if you do one texturing operation dependant on an arithmetic operation, you need two clock cycles.
In the case of the NV3x, however, I seem to remember that the order has no importance. My memory COULD be tricking me, however, so I might have to check on this.

As I said before, I'm not sure of this. I believe it is correct, but there are, as you say, things which might be done in another way in case it was true. I attribute this to the NV3x having had many things not implemented due to lack of time, or badly implemented. Of course, it could simply be that my theory is incorrect. Only time will tell.

Uttar

Gubbi · Jun 10, 2003

Uttar said:
So, considering that, it's more optimal to run the same instructions on all pixels in the same clock cycle, isn't it?

How can you possibly do this with data-dependent branches ? (isn't NV3x supposed to do this?)

I'd think that each fragment processor has it's own PC (program counter), whether the instruction cache/memory is multiported or simply duplicated I have no idea.

Cheers
Gubbi

Arun · Jun 10, 2003

Gubbi said:
Uttar said:

So, considering that, it's more optimal to run the same instructions on all pixels in the same clock cycle, isn't it?

Click to expand...

How can you possibly do this with data-dependent branches ? (isn't NV3x supposed to do this?

Yes, but only in the VS. In the PS, there's no branching at all. So in the VS, you don't run the same thing at the same time if it's not possible, but I'd guess you'd still try to do so if possible because it should result in better performance I think. In fact, there even was a B3D article discussing the problems with dynamic branching!

Uttar

Gubbi · Jun 10, 2003

Uttar said:
Gubbi said:

Uttar said:

So, considering that, it's more optimal to run the same instructions on all pixels in the same clock cycle, isn't it?

Click to expand...

How can you possibly do this with data-dependent branches ? (isn't NV3x supposed to do this?

Click to expand...

Yes, but only in the VS. In the PS, there's no branching at all. So in the VS, you don't run the same thing at the same time if it's not possible, but I'd guess you'd still try to do so if possible because it should result in better performance I think. In fact, there even was a B3D article discussing the problems with dynamic branching!

Thanks for educating me

However, I still think it would be very inefficient to run all shaders in lock-step. When one of the fragment shaders accesses a texel that isn't in the texture cache all shader units stall.

Considering the limited length of pixelshader programs, I don't think it is out of the question that instruction-memory is simply duplicated.

Cheers
Gubbi

demalion · Jun 10, 2003

What indication is inspiring this dedication to "not a pipeline"?

A pipeline is a conceptual "start->end" organization of processing throughput. The NV3x has those AFAICS. The rest of what you are describing are implementation details.

When you say it doesn't have pipelines, it doesn't make sense to me.

It does if you say it doesn't have a "traditional" pipeline, but "programmable pixel pipeline" already indicates that evolution as programmability increases.

It does if you say it has flexibility in pipeline organization, but that wouldn't be saying anything new regarding the NV3x. Going further and saying it is completely flexible in that ability also contradicts things having to be changed so the NV35 could (possibly) output 8 color output per clock, while the NV30 could not (and not even that is confirmed yet).

Once you are processing multiple data concepts, in whatever design, you seem to have pipelines. And things still have to be done in order, the pipeline is just a tool to hide undesirable effects of that, with the implementation determining the method.

Concerning ILDP, if your comments in that thread are related to what you are proposing: I'll mention two opinions I formed when reading through the PDF: 1) the ways in which its improvements are inapplicable as a GPU solution, unless you replicated them for parallelism (in which case they'd be pipelines) 2) they still talk about the ILDP design as a pipeline, and the benefits it offers are for interdependency optimization (i.e, not a substitute for parallelism, atleast for GPUs inherently parallel workload, but as a tool for enhancing the functionality of the parallelism) and for hardware implementation allowing higher clock speeds. EDIT Does seem interesting for both branching and looping evolution and the idea of shader output AA solutions, however.

In pipelines, you'd have to respect an order. For example, I think the R300 needs to do Texturing, and then Arithmetic ( or it could be the opposite ). That means if you do one texturing operation dependant on an arithmetic operation, you need two clock cycles.
In the case of the NV3x, however, I seem to remember that the order has no importance. My memory COULD be tricking me, however, so I might have to check on this.

Pipeline discussion.

What you are describing looks to me like a pipeline implementation decision to hide dependency latency, if true. It even makes sense for this "component cascade" idea, if you implemented it towards the goal of more processing throughput per pixel (rather than more pixels, and higher efficiency of processing for each one).

Since the data dependency doesn't actually disappear, it would have to be hidden. What are you calling the conceptual execution structure that would hide that, if not a pipeline?

madshi · Jun 10, 2003

Guys, your discussion is interesting, but the original topic of this thread was "NV40 Coming @ Comdex Fall / nVIDIA Software Strategies" and that's also the reason why I'm watching this thread. Would it be possible to move your NV2x/NV3x related discussion to another thread? Thank you!!

MuFu · Jun 10, 2003

Uttar said:
overclocked said:

Maybe the Nv40 will use SOI/LOW-K or?

Click to expand...

I think the NV40 will use SOI, but not Low K. Could be wrong though.

DegustatoR: I agree PS3.0./VS3.0. isn't particularly ambitious. The PPP is quite nice, as you say, but yes, it's nothing "revolutionary". But do you have facts proving there's no dynamic allocation?

Uttar

Do you have facts that prove it does?! I'm not so sure - CMKRNL didn't say anything about it (AFAIK) and even if she did I'm not sure I'd trust it given her occupation and the timeframe.

Dynamic allocation sounds way off and fairly difficult to implement in hardware *and* software - my vote goes for a fairly conventional, fully-FP 8x1 architecture with limited programmable tesselation support. Also new AF/AA algorithms! The things Kirk hinted at in that recent ET article suggest very fast FP throughput more than anything else. If they've dropped FX hardware support recently then that's already a big step. It's not like nV to overhaul too much of their tech.

MuFu.

Gubbi · Jun 10, 2003

Madshi, welcome to the wonderful world of online messageboards.

On pipelines: There seems to be a bit of confusion as to what level the discussion is at. There is the overall configuration of the GPU (rasterizer in particular), and then there is the internal workings of shaders (processors in their own right).

How the internals of particular shader units works I have no idea (except that the R3xx pixelshaders are 3-way VLIW, or did I get that wrong?)

My posts above concern the overall organization of the units in NV30.

In a traditional pipeline you first test Z to see if you should render the pixel at all (edit: actually in older designs Z is tested last I believe). If the test passes, fragment(s) are calculated for the pixel, and the pixel is eventually blended.

In a modern application this is wasteful, because:

1.) many pixels are supposed to be rejected by Z-test, starving the rest of the units in the pipeline.
2.) Modern shaders uses (or are expected to use) multiple cycles to calculate a fragment, thus starving the blend unit.

I think this is the rationale behind NV3x.

Number 1 is the *big* sinner, since fragment shaders are now so expensive (floating point support, what have you...) in terms of die area.

On the other hand Z-test units (and stencil) are really cheap, just glorified comparators (they are bandwidth hungry however). So it makes sense to put alot of Z-test units at the front of the pipeline which then just enqueues passed pixel candidates for further processing.

Subsequently you can do with fewer framebuffer blend stages since you fragment processors won't produce output every cycle. This of course causes fillrate performance to be reported as low, when fillrate is measured as single texture performance.

It appears that NV35 is just a tweak over NV30; better float16 performance, more blend units and much more bandwidth;, which is good, because then we can reject more pixels in the front of the pipeline, zooming past all the culled pixels before our fragment shaders are starved.

Cheers
Gubbi

Dave Baumann · Jun 10, 2003

Actually, I don't think it 'traditional' to Z test first, AFAIK this was done last - its only relatively recently that this was done first. However, pixel level z reject is is the last option for the likes of NV30 and R300 since they have ZCull or HierZ which are capable of rejecting multiple pixels outside of the pipeline.

Gubbi · Jun 10, 2003

You're right about early/late Z-test, I edited my post but probably to late...

...And you're right about more advanced Z-culling, but it still makes sense to have more Z-test units than fragment shaders for tris that fail hierachical Z-culling (small tris and tris with high edge/area ratio) and for MSAA.

Cheers
Gubbi

NV40 Coming @ Comdex Fall / nVIDIA Software Strategies

overclocked

Josiah

Richthofen

gokickrocks

DegustatoR

Dave Baumann

Gamerscore Wh...

Luminescent

Arun

Unknown.

Dave Baumann

Gamerscore Wh...

Gubbi

Arun

Unknown.

Gubbi

Arun

Unknown.

Gubbi

demalion

madshi

MuFu

Chief Spastic Baboon

Gubbi

Dave Baumann

Gamerscore Wh...

Gubbi

Similar threads