New are you ready video

Mephisto · Oct 30, 2002

Chalnoth said:
And claiming that the Radeon 9700 is, "About as efficient as an IMR can be," is ludicrous. There's a whole lot more that can be done.

No, there isn't. R300 does 16 textures per pass, it does depth and color buffer compression as well it has an efficient hierarchical z-buffering with a per-pixel depth test as the final pre-z-test if nessesary. In addition to this, there is a crossbar memory controller. Where is there room left?

The only thing left to improve efficiency requires either developer support (sappy idea IMO) or is based on tile based approaches (either true defered or just tile-based IMR without the full overdraw removal through geometry raycasting, but with the benefits of onchip blending).

nAo · Oct 30, 2002

Mephisto said:
No, there isn't. R300 does 16 textures per pass, it does depth and color buffer compression as well it has an efficient hierarchical z-buffering with a per-pixel depth test as the final pre-z-test if nessesary. In addition to this, there is a crossbar memory controller. Where is there room left?

Yeah..sure..fine..whatever...

Dave Baumann · Oct 30, 2002

nAo said:
Yeah..sure..fine..whatever...

Thats not a particularily constructive statement. While I agree that I hardly think we've seen the end of the road for improving IMR efficiency / overdraw reduction, something a little more useful would be good...

Humus · Oct 30, 2002

Mephisto said:
No, there isn't. R300 does 16 textures per pass, it does depth and color buffer compression as well it has an efficient hierarchical z-buffering with a per-pixel depth test as the final pre-z-test if nessesary. In addition to this, there is a crossbar memory controller. Where is there room left?

Ah, that reminds me of something ...

"Everything that can be invented has been invented."
- Charles H. Duell, U.S. Commissioner of Patents, in 1899.

Gollum · Oct 30, 2002

Great display of open-mindedness Mephisto! There's always a million things in a chip's design that can be improved upon. One of them is inventing new technolgies or ways of doing things differently, others require tweaking and changing of existing parts to greater efficiency. You just got to look at the history of x86 processors to see how much can be done with an architecture given enough time.

SA, one of this boards most respected contributors, only recently made a post about just this topic:

The point is that there is a great deal of inefficiency yet in today's hardware. Improving the rendering efficiency provides a route to improving performance that does not necessarily require costly new processes, large numbers of pipelines, etc.

Mephisto · Oct 30, 2002

SA, one of this boards most respected contributors, only recently made a post about just this topic:

I know, but a lot of his suggestions require either developer support (don't you think developers already have enough to care about? Or don't you think spending CPU cycles on boring things like sorting triangles is a good idea for todays CPU limited games?), are tile based approaches (like I mentioned) or slight improvements over current implementations (hierarchical z).

My question was meant seriously. The big steps are over, all the cool features we discussed over the last one or two years are implemented in some way in todays hardware, except for may be the fancy Z3. My response was targeted at Chaloth's "a lot more" can be done. Might someone tell me what? I'm not talking about small percentages, but the 20%++ the NV30 needs to match the R300 performance in bandwith limited situations.

T2k · Oct 30, 2002

Johnathan256 said:
Yes. The single chip NV30 won't be faster than an R300, as the R300 has more raw bandwith and the R300 is an efficient architecture too (a IMR can't be much more efficient that the R300 is).

Click to expand...

How do you know how much raw bandwidth the NV30 will have? You don't know! You have no idea much how bandwidth the NV30 will have so don't type this kind of crap.

...just as you or Chalnoth don't have ANY idea about NV30 - so, all these kind of assumptions based on stupid speculations, isn't it?

T2k · Oct 30, 2002

Humus said:
Mephisto said:

No, there isn't. R300 does 16 textures per pass, it does depth and color buffer compression as well it has an efficient hierarchical z-buffering with a per-pixel depth test as the final pre-z-test if nessesary. In addition to this, there is a crossbar memory controller. Where is there room left?

Click to expand...

Ah, that reminds me of something ...

"Everything that can be invented has been invented."
- Charles H. Duell, U.S. Commissioner of Patents, in 1899.

arjan de lumens · Oct 30, 2002

Got me thinking - where in the basic IMR architecture is there any potential for any improvement over R300? (that is, other than adding brute force: bandwidth, pipelines, texture and vertex units etc)

Z-buffering/Early Z-test/Z-compression/hierarchical Z: R300 is ATI's third pass at hierarchical Z - I doubt there is much left to gain here other than in conjunction with bounding volumes.
Bounding volumes rejection - may be useful combined with Hierarchical Z - requires developer support.
Anisotropic mapping - very little left to gain. Given the kind of performance hit R300 takes when doing aniso, it looks like ATI has actually superceded the Feline algorithm (!).
Texture compression - Some room for improvement over S3TC - VQTC looks like a better method in general. Requires some developer support.
Immediate mode tiling - requires extensive developer support, as long as OpenGL/Direct3d don't get scene graph support. You can do this on an R300 today, using OpenGL's scissor test to define a 'tile', if you feel so inclined
Geometry data compression - R300 supports N-patches and displacement mapping, which are, after all, just compact ways to represent complex geometry - other than that, there may be a little room for compressing vertex arrays.
Antialiasing - with any given number of samples per pixel, R300's compressed multisampling should be about comparable to Z3 wrt bandwidth usage - Z3 may offer slightly better quality. There are faster AA methods as well, but they tend to require substantial developer effort in order not to break down all the time.
Stencil buffer compression - here, there seems to be room for substantial improvements (I guess; Nvidia and ATI have been silent on this issue so far)
Framebuffer compression (other than collapsing same-color samples for multisampling) - potential for moderate improvements for skies, featureless walls and other surfaces with sufficiently gradual color changes. Possibly difficult to do efficiently enough to be useful.
In vertex and pixel shaders, conditional jumps may be used to skip useless calculations, in particular lighting calculations for vertices/surfaces facing away from a light source. Can easily be used to speed up static T&L (I suspect NV30 is doing this); otherwise, requires developer support.
Any other ideas, anyone?

DemoCoder · Oct 30, 2002

Ned Greene's heirarchical-z occlusion culling. Wavelet compression of texture data. Geometry compression (not amplification) via stuff like topological surgery, etc. Scenegraph acceleration, using bounding volumes, etc.

KimB · Oct 30, 2002

Well, one thing:

I don't feel immediate-mode tiling necessarily needs to have developer support.

So that you understand what I'm trying to say, what I mean by immediate-mode tiling is simply an architecture that forward-caches geometry in order to do occlusion tests not only on geometry that has already gone through the pixel pipelines, but also on geometry that has yet to go through them (by a reasonable amount...depending on how much geometry is cached).

On the programming side, from what we've seen, it may take fewer passes and/or CPU power to do some algorithms used in the near future on an NV30. The primary example seen in the NV30's white papers is matrix blending for skeletal animation. With the NV30, you could potentially use a single program for the entire model, whereas with the R300 you'd need to split up your model, doing more CPU work overall. Not a huge improvement in speed, but I'd be surprised if this was an optimal situation for describing the NV30's programming strengths.

Other than that, there are certainly better ways to compress the frame and z-buffers than what ATI is currently doing (Not based on any special knowledge of ATI's design...more based on the fact that it is an impossibility for the best possible algorithm to have been discovered yet).

The NV30 will also likely use an 8-way crossbar memory controller, based on the doubling of pipelines over the GeForce4. nVidia's experience with this sort of controller will also likely lead to a more efficient design than ATI's.

And regardless of which way you slice it, it's never "as good as it's going to get." There's just no such thing.

Althornin · Oct 31, 2002

Johnathan256 said:
Yes. The single chip NV30 won't be faster than an R300, as the R300 has more raw bandwith and the R300 is an efficient architecture too (a IMR can't be much more efficient that the R300 is).

Click to expand...

How do you know how much raw bandwidth the NV30 will have? You don't know! You have no idea much how bandwidth the NV30 will have so don't type this kind of crap.

Pfft.
*crackle* Pot to kettle! Pot to Kettle! Come in, Kettle! *crackle*

arjan de lumens · Oct 31, 2002

DemoCoder said:
Ned Greene's heirarchical-z occlusion culling. Wavelet compression of texture data. Geometry compression (not amplification) via stuff like topological surgery, etc. Scenegraph acceleration, using bounding volumes, etc.

OK ... Wavelet compression offers an obvious and compact way to store a mipmap pyramid, but to decompress even one texel, you need to read about 5x5 or so texels from every mipmap level above it, making hardware decompression at texture fetch time surprisingly slow and difficult.

Geometry compression is a subject I probably need to read more about.

I believe I more or less mentioned the other points?

Chalnoth said:
I don't feel immediate-mode tiling necessarily needs to have developer support.

So that you understand what I'm trying to say, what I mean by immediate-mode tiling is simply an architecture that forward-caches geometry in order to do occlusion tests not only on geometry that has already gone through the pixel pipelines, but also on geometry that has yet to go through them (by a reasonable amount...depending on how much geometry is cached).

I'm afraid that I don't understand

- I don't see how this can become 'immediate-mode tiling' - sounds more like a partially-deferred scheme to me, and I don't see the connection to tiling. Care to explain further?

Chalnoth said:
Other than that, there are certainly better ways to compress the frame and z-buffers than what ATI is currently doing (Not based on any special knowledge of ATI's design...more based on the fact that it is an impossibility for the best possible algorithm to have been discovered yet).

Which doesn't preclude the current ATI algorithm from being within, say, 0.1% of the "best possible" algorithm (although I do believe there is a bit more room than that left).... For frame/Z compression, you can always improve the compression ratio by making each block larger, like 16x16 pixels instead of the 8x8 that ATI is currently using. Doing so can increase bandwidth usage substantially, though, due to the fact that every time you touch even one pixel in a block, you need to decompress and recompress the entire block. Also, you can get better compression ratios by using algorithms like Huffman or arithmetic coding on each block - such algorithms will result in very slow reads because the data must be unpacked serially. So in all, there is a tradeoff in block size, algorithm complexity & parallellism, and bandwidth usage - if you have better suggestions than the ATI method, come with them.

SA · Oct 31, 2002

It is possible to render to an IMR with no sorting, no triangle binning, no tiling and yet have no overdraw and use very little z buffer bandwidth even for large depth complexities.

arjan de lumens · Oct 31, 2002

SA said:
It is possible to render to an IMR with no sorting, no triangle binning, no tiling and yet have no overdraw and use very little z buffer bandwidth even for large depth complexities.

How? What kind of preprocessing is needed on polygon data to do this kind of magic?

OpenGL guy · Oct 31, 2002

arjan de lumens said:
SA said:

It is possible to render to an IMR with no sorting, no triangle binning, no tiling and yet have no overdraw and use very little z buffer bandwidth even for large depth complexities.

Click to expand...

How? What kind of preprocessing is needed on polygon data to do this kind of magic?

None! Just make them all backfacing and cull them

P.S. What's the correct answer, SA?

Dave Baumann · Oct 31, 2002

DemoCoder said:
Ned Greene's heirarchical-z occlusion culling.

Can anyone give a clue as to how this differs from what ATI has already?

Chalnoth said:
The NV30 will also likely use an 8-way crossbar memory controller, based on the doubling of pipelines over the GeForce4. nVidia's experience with this sort of controller will also likely lead to a more efficient design than ATI's.

I have my doubts over that actually, whether its a 256bit bus or not. If its 128Bit then I'd say almost definitly not. Plus, we still don't know the configuration of the pipes - is it 8 pixels per clock only in FP16 mode?

DemoCoder · Oct 31, 2002

http://www.gamasutra.com/features/19991109/moller_haines_02.htm

There some more PDFs floating around from Ned's original paper, but I can't find them anymore. I believe nAo posted them in the old B3D forum years ago. Ned Greene now works for nVidia interestingly enough.

no_way · Oct 31, 2002

hmm .. batched primitive processing can be improved i guess. i.e. if you draw a 1000 batch of triangles, generally you dont care in what order they are being drawn, so the chip can take care of sorting and culling within the batch. maybe thats what SA is talking about ?

otherwise, i just dont understand. Lets say you just begun to draw a scene, and draw a single large tri across half a screen. the chip doesnt know, whether youll be doing a endscene next, draw something else or not. it doesnt have no info about next primitives you are going to draw. So how can it decide by itself that the triangle needs to be drawn or not ? It _has_ to draw it, or defer the rendering until more info becomes available ( more primitives are sent, or scene is finished )
So where's the catch here ?

Prometheus · Oct 31, 2002

I think nvidia should stop with this stupid "are you ready" game and just release nv30.

We have been ready for a few months now and getting impatient with continues delays.Screensavers and flash games,how lame!!!

New are you ready video

Mephisto

nAo

Nutella Nutellae

Dave Baumann

Gamerscore Wh...

Humus

Crazy coder

Gollum

Mephisto

T2k

T2k

arjan de lumens

DemoCoder

KimB

Althornin

Senior Lurker

arjan de lumens

SA

arjan de lumens

OpenGL guy

Dave Baumann

Gamerscore Wh...

DemoCoder

no_way

Prometheus

Similar threads