Outstanding performance of the NV40 = old school 3dfx mojo?

Tahir2 · Apr 7, 2004

Funny that the NV40 isnt even out yet, NVIDIA has not made the specs public and anyone in the know is not allowed to talk about it... and yet people are still setting themselves up for a disappointment like they do every time a major new core is released by NVIDIA.

Are we really all resigning ourselves to the fact that the R420 is going to get beaten by the NV40 even though we know nothing but unsubstantiated rumours about the two new cores from ATI and NVIDIA?

I just hope NVIDIA does not disappoint its many fans once again... and all indications seem to point towards more smoke from the masters of hype. In fact I think NVIDIA's most ardent fans to it no favours by proclaiming the best thing since sliced bread.

Here's wishing people would just 'shhhshhh' for a while and have a little patience.

arjan de lumens · Apr 7, 2004

Sage said:
as for geometry caching being a problem-

what about with a PPP? Try using a heiarchial subdivision surface and suddenly you don't have anywhere near the bandwidth requirements for geometry. It would also be useful for LOD, you wouldn't even have to load all of the levels of detail, dependant on distance. Of course, it would still require a rather large on-chip buffer but the bandwidth requirements would be greatly reduced. And, remember that a TBDR can get comparable speed with less pipelines and lower clock speeds. That means that you have plenty of transistors to use on that buffer.

A PPP would help an IMR much more than it would help a TBDR; the problem with TBDR and geometry is that you need to build, for each tile, a list of polygons touching the tile, and for the whole frame, a list of *post-transform* vertices. (think about it; you cannot know which tiles a polygon covers until AFTER you have done the transform, and when rendering the tile - which you cannot do until you have completed all geometry calculations for the full frame - you need to fetch or generate post-T&L transform data from somewhere). A PPP would decrease the amount of data that need to be sent in to geometry processing, but with the large number of vertices/polys it generates, it will increase the memory traffic needed for post-T&L poly data in the TBDR dramatically. The problem with throwing an on-chip buffer to this task is that the size needed for this buffer is unbounded; it is proportional to the total number of actual polygons rendered in the frame, so that if my PPP produces 10 million trianges for the frame, your TBDR needs a ~200 ( :?:

)MB buffer to handle it.

There are some solutions to this problem; for tessellation of HOS patches, you can tessellate once to determine which tiles are covered by the patch at T&L time, then again to re-do the tessellation at the time of rendering the tile. This solution can cut the buffer size by a factor of 5-100 (by storing HOS indices instead of polygon indices in the per-tile lists), at the cost of doing all the tessellation work multiple times. Or you can do what the 'scene manager' in Kyro does in the face of excessive polycounts:

1: Collect N polygons from T&L
2: Render a frame with the N polygons
3: Collect N additional polygons from T&L
4: For each tile, preload the tile with the current frame-buffer contents, render the new polygons on top of that, and write the result back to the framebuffer
5:If you still have any polygons left, repeat from step 3

The 'scene manager' thus solves the unbounded-buffer-size problem of TBDRs, at a performance and memory usage penalty.

KimB · Apr 7, 2004

elroy said:
TBDR:
Advantages

Less work to do
Lower transistor count
Lower power consumption
FSAA for "free"

On less work: Not necessarily true. If you do an initial z-pass, then the TBDR will do the exact same rendering work, but it must also bin and sort the triangles for storage in the scene buffer (which does take a relatively small amount of time, but it's still more work).

On lower transistor count: Certainly not for the same number of pipelines/same amount of features.

On lower power consumption: Again, not necessarily true. This would hedge on doing less work and having better memory bandwidth usage. Neither is necessarily the case for a TBDR, as the z-buffer is replaced by a scene buffer, which, with high geometry densities, ends up being less efficient.

On free FSAA: Almost. MSAA will be nearly free, but then it's not expensive in modern hardware anyway. The main problem is that the worst-case scenario for a TBDR can be exacerbated if FSAA is enabled.

Disadvantages

? Problems with high geometry counts (is this an invalid point?)
Not proven at a high end level

The only question to the validity to the high geometry counts problem is how high in geometry density do you have to go before it's an issue?

IMR

Disadvantages (the ones I have listed are all compared to a TBDR)

Requires higher memory bandwidth
Requires more pipelines, and therefore transistors
Produces more heat

On higher memory bandwidth: not necessarily. Frame/z-buffer compression with MSAA combined with the host of other memory bandwidth savings technologies we've seen to date in IMR's level the playing field significantly. The main pull for TBDR's isn't necessarily using less bandwidth, but rather that it's easier with a TBDR to make more efficient use of the available bandwidth.
On more pipelines: this assumes that a TBDR with fewer pipelines will perform equally. This is based upon a comparison between old TBDRs and the immediate-mode renderers of the day. Those IMR's had none of the memory bandwidth savings techniques that today's parts do, and so the comparison is invalid.

KimB · Apr 7, 2004

arjan de lumens said:
1: Collect N polygons from T&L

2: Render a frame with the N polygons

3: Collect N additional polygons from T&L

4: For each tile, preload the tile with the current frame-buffer contents, render the new polygons on top of that, and write the result back to the framebuffer

5:If you still have any polygons left, repeat from step 3

The 'scene manager' thus solves the unbounded-buffer-size problem of TBDRs, at a performance penalty.

Well, there's another possible issue as well. PowerVR claims that one of the benefits of their architecture is that it allows automatic depth-sorting for transparent objects. If any software assumes this automatic depth-sorting, doing the above will break that depth sorting.

Oh, and don't forget that the above algorithm does require an external z-buffer. That's the main source of the performance hit of doing the above on the Kyro.

Actually, though, I think the above solution is great. I think that if some hardware actually did the above, but instead had a small scene buffer that wasn't meant to store the entire scene (say, small enough to fit in on-chip cache), the efficiency of immediate-mode rendering could be increased significantly, particularly in terms of using available memory bandwidth efficiently.

gunblade · Apr 7, 2004

Tahir said:
Funny that the NV40 isnt even out yet, NVIDIA has not made the specs public and anyone in the know is not allowed to talk about it... and yet people are still setting themselves up for a disappointment like they do every time a major new core is released by NVIDIA.

Are we really all resigning ourselves to the fact that the R420 is going to get beaten by the NV40 even though we know nothing but unsubstantiated rumours about the two new cores from ATI and NVIDIA?

I just hope NVIDIA does not disappoint its many fans once again... and all indications seem to point towards more smoke from the masters of hype. In fact I think NVIDIA's most ardent fans to it no favours by proclaiming the best thing since sliced bread.

Here's wishing people would just 'shhhshhh' for a while and have a little patience.

Most rumor of nv40 is about performace number, but it seems the features are exposed on the opengl SDk. As for Ati, the rumor seems to indicate it will not have SM3.0. Several hints are witnessed on that point. One of them is given in the recent GDC powerpoint presentation. Another one is in DX9.0c SDk where it mentioned ps2.0 can work with VS3.0.(kind of hinted that some IHV will not have both ps3.0 and vs3.0)

.
So far feature-wise, we know one part is more 'next-gen' than the other one, just like R3x0 is certainly more next-gen in the sense it have MRT and full ps2.0 feature and speed. Still, there are more to the core that we don't know about. Can't wait until 13th April.

Sage · Apr 7, 2004

arjan de lumens said:
There are some solutions to this problem

my point exactly! The solution isn't one-step, but it's certianly achievable and will happen (well, as long as someone is making TBDR's it will).

KimB · Apr 7, 2004

Sage said:
Chalnoth said:

Secondly, I hope you meant expand scene buffer space, as it's impossible to expand on-chip buffer space (without a new chip).

Click to expand...

uhh isn't that what we're talking about? new chips?

Well, I was originally talking about scene buffer overflow, not anything about an inability to store all the data for a single tile (I don't think this is even an issue, actually: I really don't think that the data for a single tile must be all on-chip at once).

good point. then it certainly needs some work done to find an efficient method of doing this. of course, you wouldn't be so naive to think that IMR's havn't had huge speedbumps themselves to overcome, have you?

The point I'm trying to show is that the problems with IMR's are more manageable than the problems with TBDR's. Essentially, I'm saying that a TBDR's worst-case differs from its best-case by much more than IMR's. I don't like that (for example, imagine a game where you can play 99% of the time at high framerates at 1600x1200x32 with 16x FSAA. Then, once you have 11 characters on the screen instead of 10, and all happen to be firing at once, the framerate tanks. Whoops.).

edit:
oh yes, and let's not forget that IMR's are going to hit a geometry limit as well as imposed by memory space and bandwidth. higher-order surfaces are a must for the future, and they will effectively remove the limitation of both IMR's and TBDR's if done properly. Again, I direct your attention to PPP and heiarchial sub-d's.

You're assuming that you can easily transform the patches into screenspace and still use the same tessellation algorithms. I'm not sure this is possible given the divide by w that occurs in perspective transform. And you'd also have to re-do the tessellation for each tile that includes the patch. I'd say that's definitely not an optimal situation, and given that longer shaders should hide the memory bandwidth limitations of future video cards anyway, I don't think that this would be beneficial to performance.

KimB · Apr 7, 2004

gunblade said:
just like R3x0 is certainly more next-gen in the sense it have MRT and full ps2.0 feature and speed.

The NV3x has similar technologies (and in many ways more advanced technologies). Just because they're not all exposed in DirectX doesn't make it less "next-gen" than the R3xx in my mind. The importance of FP performance in calling the R3xx "more next-gen" is debatable.

In other words I'd say that your statement is certainly not certain.

ERK · Apr 7, 2004

Tagrineth said:
Shrink the tiles, and expand your on-chip buffer space.

Boom, problem solved.

I had a tangentially related question:
IMRs render in 'quads' (2x2 pixel blocks) for greater efficiency. Could there be more to be gained by rendering in 2x3, or 3x3, or 2xN, or NxN blocks? I am not a 3D expert, but I would guess there may be some speed gain at the likely expense of circuit complexity. I realize these 'super-quads' would not be much like TBDR, or would they?
Or perhaps doing more than one quad at a time is better than doing fewer, bigger blocks.

Thanks,
Eric

Ailuros · Apr 7, 2004

Agreed. But how many samples are we talking here? 4x FSAA was supposed to be "free" on NV3x, and from some leaked info, it looks like it will be on NV40.

Depends how you define "free" anyway.

MSAA is essentially fill-rate and bandwidth "free" on TBDRs; yet there most certainly has to be a threshold in terms of amounts of samples too. It's all pretty relative.

On higher memory bandwidth: not necessarily. Frame/z-buffer compression with MSAA combined with the host of other memory bandwidth savings technologies we've seen to date in IMR's level the playing field significantly. The main pull for TBDR's isn't necessarily using less bandwidth, but rather that it's easier with a TBDR to make more efficient use of the available bandwidth.

There's an upper threshold in amount of samples for IMRs where the combination of those methods show said benefits. Not only would the efficiency start to degrade from a certain amount of samples and higher (always with pure MSAA), but the buffer consumption would be huge. For 6xMSAA it exceeds the 100MB range last time I checked.

Again same question concerning bandwidth: what about high precision framebuffers?

On more pipelines: this assumes that a TBDR with fewer pipelines will perform equally. This is based upon a comparison between old TBDRs and the immediate-mode renderers of the day. Those IMR's had none of the memory bandwidth savings techniques that today's parts do, and so the comparison is invalid.

I wouldn't dare in claiming just yet that TBDRs have a clear advantage with floating point pipelines unless I see it.

Besides a SIMD channel =! SIMD channel between even IMRs; in order to reach such a conclusion for a TBDR vs IMR SIMD channel, those would have to be identical or pretty damn close.

The only question to the validity to the high geometry counts problem is how high in geometry density do you have to go before it's an issue?

I'm going to be bold here and guestimate that if you reach that height an IMR will have it just as hard; always based on todays standards and not future evolutions.

On lower transistor count: Certainly not for the same number of pipelines/same amount of features.

No idea about transistor counts; so far I didn't see any significant differences in existing sollutions.

I've an objection with the pipeline amount though (similar as above); I can easily think of very long or deep SIMD or very short ones, which renders that particular part into a rather weird generalisation.

Sage · Apr 7, 2004

the problems with IMR's only appear more managable becuase we are talking about a market that has invested many MANY times the resources into developing IMR's than TBDR's, and so the underdeveloped TBDR's are trying to compete against much more developed IMR's in an arena build around IMR's. Give TBDR's the same treatment that IMR's have recieved and you will see they are not nearly as bad as it seems.

Ailuros · Apr 7, 2004

Chalnoth said:
gunblade said:

just like R3x0 is certainly more next-gen in the sense it have MRT and full ps2.0 feature and speed.

Click to expand...

The NV3x has similar technologies (and in many ways more advanced technologies). Just because they're not all exposed in DirectX doesn't make it less "next-gen" than the R3xx in my mind. The importance of FP performance in calling the R3xx "more next-gen" is debatable.

In other words I'd say that your statement is certainly not certain.

More advanced technologies get rendered virtually useless, if your overall arithmetic efficiency is lower than the competition.

Of course can anyone come up with excuses or alternative POVs on that one, yet it must be a coincidence that internally NV doesn't seem to be particularly proud of the entire NV3x line.

Ailuros · Apr 7, 2004

Sage said:
the problems with IMR's only appear more managable becuase we are talking about a market that has invested many MANY times the resources into developing IMR's than TBDR's, and so the underdeveloped TBDR's are trying to compete against much more developed IMR's in an arena build around IMR's. Give TBDR's the same treatment that IMR's have recieved and you will see they are not nearly as bad as it seems.

If you should be refering to API "incompatibilities" IMHO they were mostly solved with Series3 from PowerVR. It's not the API's fault if the hardware is lacklustering and in that case there wasn't any support for cube maps and T&L.

Bouncing Zabaglione Bros. · Apr 7, 2004

T2k said:
This is the saddest, crappiest piece of yakyakyak recently. This Sander Sassen guy doesn't say ANY information, this whole piece of something (WTF is this? Not news, not article, lacks ANY info...) is a 100% BSing, a very sad try to catch your attention for 25 seconds...

He quotes Mark Rein as a developer and talk about his "programming" for Epic.

Sxotty · Apr 7, 2004

Tahir, people are not that excited really.

It is more like look the nv30 sucked, they are excited that the nv40 they hope will be competitive, not own, just put up a good fight across the board instead of just in some weird situations.

KimB · Apr 7, 2004

ERK said:
IMRs render in 'quads' (2x2 pixel blocks) for greater efficiency. Could there be more to be gained by rendering in 2x3, or 3x3, or 2xN, or NxN blocks?

Well, current architectures have a performance hit when different pixels in a quad do different things. So if you increase the size of your quads, you will increase the liklihood of having lower efficiency. As a simple example, if only half of the pixels in your quad are on the triangle being rendered, only two of the four pipelines dedicated to rendering that quad will be used.

So, if you increase the size of your quads, memory efficiency may be increased, but processing efficiency will be decreased.

elroy · Apr 7, 2004

Chalnoth said:
elroy said:

TBDR:
Advantages

Less work to do
Lower transistor count
Lower power consumption
FSAA for "free"

Click to expand...

On less work: Not necessarily true. If you do an initial z-pass, then the TBDR will do the exact same rendering work, but it must also bin and sort the triangles for storage in the scene buffer (which does take a relatively small amount of time, but it's still more work).

On lower transistor count: Certainly not for the same number of pipelines/same amount of features.

On lower power consumption: Again, not necessarily true. This would hedge on doing less work and having better memory bandwidth usage. Neither is necessarily the case for a TBDR, as the z-buffer is replaced by a scene buffer, which, with high geometry densities, ends up being less efficient.

On free FSAA: Almost. MSAA will be nearly free, but then it's not expensive in modern hardware anyway. The main problem is that the worst-case scenario for a TBDR can be exacerbated if FSAA is enabled.

Disadvantages

? Problems with high geometry counts (is this an invalid point?)
Not proven at a high end level

Click to expand...

The only question to the validity to the high geometry counts problem is how high in geometry density do you have to go before it's an issue?

IMR

Disadvantages (the ones I have listed are all compared to a TBDR)

Requires higher memory bandwidth
Requires more pipelines, and therefore transistors
Produces more heat

Click to expand...

On higher memory bandwidth: not necessarily. Frame/z-buffer compression with MSAA combined with the host of other memory bandwidth savings technologies we've seen to date in IMR's level the playing field significantly. The main pull for TBDR's isn't necessarily using less bandwidth, but rather that it's easier with a TBDR to make more efficient use of the available bandwidth.
On more pipelines: this assumes that a TBDR with fewer pipelines will perform equally. This is based upon a comparison between old TBDRs and the immediate-mode renderers of the day. Those IMR's had none of the memory bandwidth savings techniques that today's parts do, and so the comparison is invalid.

I think your last point sort of concludes everything Chalnoth. All we have are the TBDR of yesteryear, so I can only compare them to what we had back then. I think I'll leave this argument for the moment, at least until PVR S5 comes out (if it comes out!; me runs from The Baron

). That way we can get a better comparison of how they compare today. At the moment, it seems the overall consensus of others on the board (who know this stuff better than me!) is that the advantages of TBDR have been largely nullified by the HSR techniques of the current gen stuff. Which is what I was trying to figure out in the first place

.

Ailuros · Apr 7, 2004

elroy,

That way we can get a better comparison of how they compare today. At the moment, it seems the overall consensus of others on the board (who know this stuff better than me!) is that the advantages of TBDR have been largely nullified by the HSR techniques of the current gen stuff. Which is what I was trying to figure out in the first place.

I don't think they have been nullified.

KYRO/Series3 had one major problem IMHO: while it had the understandable advantage in terms of bandwidth over competing sollutions as soon as vertex bandwidth requirements started to rise, that very same advantage got somewhat eliminated by the latter. Simplier put: you have an advantage of +5 in one case, yet a disadvantage of -5 in another, where does it leave you? Square one.

KYRO wasn't only a value sollution and as that it was layed out (hence the lack of HW T&L), both K1 and K2 appeared also too late on shelves.

Yes IMRs raised their efficiency with a clever combination of bandwidth saving techniques and will continue in doing so, yet that doesn't mean that PowerVR has been sitting idle in terms of development either or that they've left problematic cases unsolved, or that they haven't any advantages at all. They just have to get such a design out the door and that preferably on time to prove to all the naysayers that TBDR can very well be an alternative.

In order to keep things on a more reasonable level in answer to that one:

Those IMR's had none of the memory bandwidth savings techniques that today's parts do, and so the comparison is invalid.

Those TBDRs of the past had none of the current techniques that today's designs do, and so the comparison is invalid.

OpenGL guy · Apr 7, 2004

Chalnoth said:
ERK said:

IMRs render in 'quads' (2x2 pixel blocks) for greater efficiency. Could there be more to be gained by rendering in 2x3, or 3x3, or 2xN, or NxN blocks?

Click to expand...

Well, current architectures have a performance hit when different pixels in a quad do different things. So if you increase the size of your quads, you will increase the liklihood of having lower efficiency. As a simple example, if only half of the pixels in your quad are on the triangle being rendered, only two of the four pipelines dedicated to rendering that quad will be used.

So, if you increase the size of your quads, memory efficiency may be increased, but processing efficiency will be decreased.

Isn't a quad by definitation four? Maybe you mean tile?

KimB · Apr 7, 2004

No, increasing the size of your quads would obviously make them quads no longer.

Outstanding performance of the NV40 = old school 3dfx mojo?

Tahir2

arjan de lumens

KimB

KimB

gunblade

Sage

13 short of a dozen

KimB

KimB

ERK

Ailuros

Epsilon plus three

Sage

13 short of a dozen

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

Bouncing Zabaglione Bros.

Sxotty

KimB

elroy

Ailuros

Epsilon plus three

OpenGL guy

KimB

Similar threads