R700 Inter-GPU Connection Discussion

coredump · Jul 2, 2008

Valid point. Being a consumer, I would say that the 2x scaling has to be valid across all games (or at least the games you care about).

With respect to being setup limited -- I can't think of a single application that's setup limited. Which is probably why we haven't seen either side crow about improving setup dramatically.

I'm thinking about being AFR/SFR agnostic.

Geo · Jul 2, 2008

coredump said:
(or at least the games you care about).

This is the problem with having to rely on someone else for your profiles. It ends up being "all the games THEY care about". Ask John Reynolds about ATI scaling in flight sims, for instance.

igg · Jul 2, 2008

Geo said:
This is the problem with having to rely on someone else for your profiles. It ends up being "all the games THEY care about". Ask John Reynolds about ATI scaling in flight sims, for instance.

Very good point. The weak profiles are one of the major disadvantages of ATI cards.

Mintmaster · Jul 2, 2008

coredump said:
With respect to being setup limited -- I can't think of a single application that's setup limited. Which is probably why we haven't seen either side crow about improving setup dramatically.

Rarely is anything strictly limited by one thing. There's always parts of a frame limited by one thing and parts limited by another. I don't know how you're determining whether any given application is setup limited, anyway.

I guarantee you that significant portions of rendering time are spent on vertex setup. Shadow maps are a perfect example. Assume 35 GPix/s of Z-only fillrate (GT200 can double that). Even a big 2048x2048 shadow map with 3x net overdraw (i.e. that which the GPU cannot eliminate with occlusion culling) will only take a third of a millisecond to render. You can't even setup a quarter million triangles in that time. With cascaded shadow maps, you have to do this multiple times per frame.

Also, consider lengthy sections of triangle lists that generate no pixels at all. Triangles outside the viewing frustum and backfacing triangles generate take time to weed out, and unless you are fortunate enough to have tens of thousands of pixels already queued up, most of your chip will be idling.

All these things make framerate increase slower than math/texturing/ROP ability, and thus cost you scaling if you don't improve setup speed.

Neither IHV is making noise about improving setup because neither has done it yet. This is the last part of the graphics pipeline that remains serial.

coredump · Jul 2, 2008

Mintmaster said:
Rarely is anything strictly limited by one thing. There's always parts of a frame limited by one thing and parts limited by another. I don't know how you're determining whether any given application is setup limited, anyway.

I guarantee you that significant portions of rendering time are spent on vertex setup. Shadow maps are a perfect example. Assume 35 GPix/s of Z-only fillrate (GT200 can double that). Even a big 2048x2048 shadow map with 3x net overdraw (i.e. that which the GPU cannot eliminate with occlusion culling) will only take a third of a millisecond to render. You can't even setup a quarter million triangles in that time. With cascaded shadow maps, you have to do this multiple times per frame.

Also, consider lengthy sections of triangle lists that generate no pixels at all. Triangles outside the viewing frustum and backfacing triangles generate take time to weed out, and unless you are fortunate enough to have tens of thousands of pixels already queued up, most of your chip will be idling.

All these things make framerate increase slower than math/texturing/ROP ability, and thus cost you scaling if you don't improve setup speed.

Neither IHV is making noise about improving setup because neither has done it yet. This is the last part of the graphics pipeline that remains serial.

That's an interesting point. However, I would say that shadowmaps are relatively degenerate in terms of work load -- the verticies are simple.

Also, most modern hardware is very good at tossing out invalid verticies.

If you look at the Xenos hardware:

http://en.wikipedia.org/wiki/Xenos

It can set up 500M verts/second. For a 1920x1080 monitor that's 1 vert per pixel with an overdraw of 4.

Anyway, I can see what you're saying. My guess is still that a decent bump (~10Gb/sec) of inter-chip communication will remove the bottleneck for 99% of the applications.

/crouches and waits for benchmarks

nAo · Jul 2, 2008

A lot of hw out there can remove a subset of not visible triangles before the setup stage.

coredump · Jul 2, 2008

Looking at it another way: what applications don't scale well? It looks like UT3 (R700 vs. crossfire 4870) would be a good test case:

http://www.anandtech.com/video/showdoc.aspx?i=3151&p=10

Mintmaster · Jul 2, 2008

nAo said:
A lot of hw out there can remove a subset of not visible triangles before the setup stage.

Depending on what you call setup, anyway.

To me it's everything between the vertex shader (well, GS in the DX10 era) and rasterizer (actually, even parts of the rasterizer). You separate some of the front end stuff, like culling/clipping. My discussions with Arun tell me he likes to separate some of the back end stuff, like interpolator and rasterizer stuff.

Mintmaster · Jul 2, 2008

coredump said:
That's an interesting point. However, I would say that shadowmaps are relatively degenerate in terms of work load -- the verticies are simple.

That's just it though - in the unified shading era, almost all vertices are "simple". BTW, shadow maps still have to do all the same position calculations as scene vertices, including matrix blending.

Also, most modern hardware is very good at tossing out invalid verticies.

Not any faster than one per clock, which is no faster than the setup rate on modern GPUs, so.

It can set up 500M verts/second. For a 1920x1080 monitor that's 1 vert per pixel with an overdraw of 4.

Forget about pixels per polygon, because it doesn't work that way. There are lots of vertices off the screen. You only have a limited granularity in scene culling by the CPU. You have lots of vertices not facing the camera.

On top of that you have draw more than what's inside the view frustum for cascaded shadow maps and reflection maps.

Finally, vertex loads are very clumpy in nature, so it's extremely rare to have pixel and vertex load balanced for much of a frame. If you have 5M polygons to draw, you'd be setup-limited for, IMO, around 4M of them because they only cover, ~10% of the pixels. That costs you 8 ms per frame on Xenos. All your other rendering - 90% of the pixels - have to be done in the remaining time, e.g. 8.7 ms for 60fps. Furthermore, no matter how many shader units and ROPs you have, they can only reduce this latter part.

This vertex load is 40% lower than your example, and we're still triangle-limited 48% of the time.

Anyway, like I was saying earlier, this only applies to SFR. The best way to look at SFR, then, is that it doubles resolution, not framerate.

coredump · Jul 3, 2008

Mintmaster said:
Not any faster than one per clock, which is no faster than the setup rate on modern GPUs, so.

Bull. You can reject as fast as you can read in/when they pop out of vertex shading. Assuming the verts are behind the view frustum all you have to do is check the sign bit. There are also trivial checks for the other six sides.

Mintmaster said:
Forget about pixels per polygon, because it doesn't work that way. There are lots of vertices off the screen. You only have a limited granularity in scene culling by the CPU. You have lots of vertices not facing the camera.

But that's my point. See above

Mintmaster said:
Finally, vertex loads are very clumpy in nature, so it's extremely rare to have pixel and vertex load balanced for much of a frame. If you have 5M polygons to draw, you'd be setup-limited for, IMO, around 4M of them because they only cover, ~10% of the pixels. That costs you 8 ms per frame on Xenos. All your other rendering - 10% of the pixels - have to be done in the remaining time, e.g. 8.7 ms for 60fps. Furthermore, no matter how many shader units and ROPs you have, they can only reduce this latter part.

This vertex load is 40% lower than your example, and we're still triangle-limited 48% of the time.

Anyway, like I was saying earlier, this only applies to SFR. The best way to look at SFR, then, is that it doubles resolution, not framerate.

I can see that. I would like to know if there's any statistical data out there that examines the workload.

I think you make an lot more effective argument for a doubling in framerate to be CPU limited before its going to be GPU limited.

Humus · Jul 3, 2008

Mintmaster said:
On top of that you have draw more than what's inside the view frustum for cascaded shadow maps and reflection maps.

I'd phrase that as "you have to draw different stuff than what's inside the view frustum". It's not necessarily more stuff to draw. You draw the stuff that's inside the shadowmaps' view frustums, which could be either more or less than what's in the regular view frustum.

But I agree, shadow maps tend to be setup limited. So you can often boost resolution a fair amount until you start seeing any drop in performance.

sebbbi · Jul 3, 2008

Humus said:
I'd phrase that as "you have to draw different stuff than what's inside the view frustum". It's not necessarily more stuff to draw. You draw the stuff that's inside the shadowmaps' view frustums, which could be either more or less than what's in the regular view frustum.

Actually the shadow caster object has to be visible in the shadowmap view frustum and it's shadow volume has to be visible in camera view frustum. Otherwise it's culled. So the "different stuff" is usually "less stuff".

Mintmaster · Jul 3, 2008

coredump said:
Bull. You can reject as fast as you can read in/when they pop out of vertex shading. Assuming the verts are behind the view frustum all you have to do is check the sign bit. There are also trivial checks for the other six sides.

Bull? Go run some tests. No GPU rejects more than one triangle per clock. That is my point.

You're entirely wrong in thinking "most modern hardware is very good at tossing out invalid verticies." It's still only one per clock, and since G80, that hasn't been any faster than setup on any GPU.

I can see that. I would like to know if there's any statistical data out there that examines the workload.

You might be able to get some from ATTILA. I did this stuff at ATI a few years ago, but of course the games were different then.

I think you make an lot more effective argument for a doubling in framerate to be CPU limited before its going to be GPU limited.

That depends on the game, but there aren't many games that run below 60 fps on decent CPUs with graphics settings turned down. However, there are games that run below 30 fps on high end GPUs.

Remember that you only need the GPU to be maybe 20% setup limited for enhanced setup to make a difference.

Humus said:
I'd phrase that as "you have to draw different stuff than what's inside the view frustum". It's not necessarily more stuff to draw. You draw the stuff that's inside the shadowmaps' view frustums, which could be either more or less than what's in the regular view frustum.

Well, I was thinking about a full shadowing solution, where your shadow frustum(s) must enclose the view frustum. Otherwise, you have the chance of missing some shadows.

coredump · Jul 3, 2008

Mintmaster said:
Bull? Go run some tests. No GPU rejects more than one triangle per clock. That is my point.

I stand corrected. Doesn't this imply that setup is not a bottleneck?

2900guy · Jul 3, 2008

r700 will have 2gb. damn.

Mintmaster · Jul 3, 2008

coredump said:
I stand corrected. Doesn't this imply that setup is not a bottleneck?

Again, it depends on what you define as setup. When I say setup-limited, I mean "polygon throughput limited but not VS/GS limited".

If you remember my example above, 5M total polys per frame -- whether visible or not -- will take up half the render time at 60fps. I'd say that's a pretty relevant limit when that could mean only 500k front-facing polys on the screen.

Mat3 · Jul 3, 2008

2900guy said:
r700 will have 2gb. damn.

I saw that too, although that only says it supports it. If it really is 2GB though, then it's the final nail in the coffin for any hope of shared memory, since 2GB only makes sense if it's 1GB X 2.

fellix · Jul 4, 2008

So, that means the premium volume production of GDDR5 currently piles up for the X2 boards?

Pantagruel's Friend · Jul 4, 2008

sebbbi said:
Many games still animate particles on CPU, but this is not necessary. You can also implement the whole animation step on the GPU. For example like this...

OK, thanks for the example. Basically, that's what I was getting at in the first place - this method needs persistent data on the GPU, and in a multi-GPU case with AFR, this needs to be traded between the GPUs for every frame. Things may even get worse if the buffer is small enough to stay in the L2 cache.

neliz · Jul 4, 2008

fellix said:
So, that means the premium volume production of GDDR5 currently piles up for the X2 boards?

not necesarily, there will also be GDDR3 models but right now I don't think anyone could give us a clear picture of the amount of GDDR3 and 5 models that ati will produce, both of the single and dual GPU setups.

R700 Inter-GPU Connection Discussion

coredump

Geo

Mostly Harmless

igg

Mintmaster

coredump

nAo

Nutella Nutellae

coredump

Mintmaster

Mintmaster

coredump

Humus

Crazy coder

sebbbi

Mintmaster

coredump

2900guy

Mintmaster

Mat3

fellix

Pantagruel's Friend

neliz

GIGABYTE Man

Similar threads