Getting the most out of current GPU design.

BoardBonobo · Jan 14, 2003

I remember that somebody made a post saying that there are lots more low hanging fruit on the 3D optimisation tree to be picked before a big change has tooccur in GPU\VPU design. This seems especially pertinent now that complexity and speed issues are effectively starting to cap PCB design.

What are these further optimisations and and how would they be implemented to the best effect?

Fuz · Jan 14, 2003

Good question. I beleive it was SA who posted the info before.
I too would like to know what the "low hanging fruit" that have not been picked yet are.

mboeller · Jan 14, 2003

IMO;

he talked about the 2pass deferred rendering now used in the DeltaCrome. So maybe he works for S3.

nAo · Jan 14, 2003

mboeller said:
he talked about the 2pass deferred rendering now used in the DeltaCrome. So maybe he works for S3.

3DLabs..I believe.

Arun · Jan 14, 2003

AFAIK:
2 Pass rendering can increase FPS, so it can be nice.
But it also increases latency. So it's futile and annoying for Q3 or UT2K3, since latency is the most important factor there.
As for RPGs, where mostly smoothness is important, this could be really useful.

To know what "low hanging fruit" are left, it's a good idea to try to see what are the main problems in the GPU pipeline.

1. Memory Bandwidth. Where is most bandwidth used today, if we activate 4x AA?
First, Z Reads & Writes. Z Compression and Hierarchical Z is already used, and few things can be done here.
Second, Static Geometry. Vertices which aren't transferred over AGP every frame are read from memory every frame. Very little is done about that. A solution is compression done in the VS. This is currently possible, but few programmers use it. In the future, maybe it'll become more common if memory bandwidth becomes even more of an issue.
Third, Color Writes. This probably takes little memory bandwidth thanks to Color Compression.
2. Bottlenecks. Current architectures are either transform-bound or fillrate-bound. As I said in another thread, an idea might be to use shared calculators, so that bottlenecks doesn't really exist anymore. I'm not saying everything should be shared, but a good part.

Uttar

Basic · Jan 14, 2003

2 pass rendering for IMRs (z-pass first) can increase framerate. And if it do so, it will also decrease latency.
There wouldn't be any interleaving of the passes. So if you get higher framerate, that means that the sum of rendering times for the two passes is decreased, which in turn means lower latency.

KimB · Jan 14, 2003

Basic said:
2 pass rendering for IMRs (z-pass first) can increase framerate. And if it do so, it will also decrease latency.
There wouldn't be any interleaving of the passes. So if you get higher framerate, that means that the sum of rendering times for the two passes is decreased, which in turn means lower latency.

Correct, but if done in the driver, then it will require a significant amount of caching by the driver, increasing overall CPU and system memory bandwidth requirements. Depending on the game, this may or may not be a good thing. This will also double the power needed in the geometry stage of the pipeline, which could be problematic for performance in certain situations.

Personally, I'd just like to see games do this themselves. Doing it in the driver is challenging at the very least, and a potential performance problem at the worst (exchanging one bottleneck for another).

Arun · Jan 15, 2003

Basic said:
2 pass rendering for IMRs (z-pass first) can increase framerate. And if it do so, it will also decrease latency.
There wouldn't be any interleaving of the passes. So if you get higher framerate, that means that the sum of rendering times for the two passes is decreased, which in turn means lower latency.

Actually, I was assuming IMR 2 pass meant that you first do Z and cache everything. Then, you simultaneously prepare the next frame Z and render the scene with the cache.
That would seem to exploit more parallelism, IMO...

And yes, it could decrease latency. But then fillrate would have to be the bottleneck. If you're memory-bound, it's completely useless. And since you got to read static vertices two times and stuff, it might indeed make your memory-bound slightly more easily. But there's also only one Color Write. All of that is not too important, however. But if the game is geometry limited for example ( not like that exists... ) , you better not hope for lower latency!

Uttar

DemoCoder · Jan 15, 2003

Deferred shading can be done to increase speed on DX9 GPUs. First pass: render Z, and write pixel shader parameters to FP frame buffer/MRT. Second pass, effectively a 2D video post-processing pass, render one full screen quad, setup your huge 128+ instruction pixel shader. Voila: no overdraw, no wasted expensive pixel shading, no wasted recomputation of T&L on second pass.

If you had true dynamic branching, you could even pack multiple shaders into one pixel shader and branch based on an object ID value written in the frame buffer.

KimB · Jan 15, 2003

DemoCoder said:
If you had true dynamic branching, you could even pack multiple shaders into one pixel shader and branch based on an object ID value written in the frame buffer.

Speaking of which, we may well see even larger supported packed framebuffer types (256 bits per pixel and up) in order to store varied information for multipass rendering.

I remember seeing one technique already that uses a packed 128-bit framebuffer to do all of the lighting in a DOOM3-style technique by just rendering the one screenspace quad.

Out of curiosity, I wonder if there will ever be an incentive to move to 64-bit floating-point precision in the pipelines? If we move to full 32-bit z-buffers soon (which I really want to see), and z-buffer errors are not yet eliminated, we may need to for optimal precision.

Dave Baumann · Jan 15, 2003

Uttar said:
But then fillrate would have to be the bottleneck. If you're memory-bound, it's completely useless.

Why would you have to be fillrate limited? Thats the entire point of rendering Z first so as to optimise you early z rejection routines and save on a lot of fillrate (well, texel/shader).

Nagorak · Jan 15, 2003

Uttar said:
AFAIK:
2 Pass rendering can increase FPS, so it can be nice.
But it also increases latency. So it's futile and annoying for Q3 or UT2K3, since latency is the most important factor there.
As for RPGs, where mostly smoothness is important, this could be really useful.

What RPGs need is competent programmers who actually have a clue what they are doing. As it stands now, ATi and Nvidia could specifically tailor their drivers for RPG performance and the games would still run like total crap because of the horrible coding behind those games.

Arun · Jan 15, 2003

DaveBaumann said:
Uttar said:

But then fillrate would have to be the bottleneck. If you're memory-bound, it's completely useless.

Click to expand...

Why would you have to be fillrate limited? Thats the entire point of rendering Z first so as to optimise you early z rejection routines and save on a lot of fillrate (well, texel/shader).

That's exactly what I meant.
Here's the full quote:

And yes, it could decrease latency. But then fillrate would have to be the bottleneck.

My point is that for it to decrease latency, fillrate got to be the bottleneck and not memory.

Uttar

pcchen · Jan 16, 2003

Why? You can still be memory bound and 2-pass still reduce latency. For example, if most of your memory bandwidth goes to texture fetch, 2-pass can eliminate most of them and therefore reduce latency.

However, if most of your bandwidth goes to frame buffer access, 2-pass won't buy you much, perhaps even slow you down.

Getting the most out of current GPU design.

BoardBonobo

My hat is white(ish)!

Fuz

mboeller

nAo

Nutella Nutellae

Arun

Unknown.

Basic

KimB

Arun

Unknown.

DemoCoder

KimB

Dave Baumann

Gamerscore Wh...

Nagorak

Arun

Unknown.

pcchen

Moderator

Similar threads