R700 Inter-GPU Connection Discussion

Lets see....
- lower cost
- higher yields
- better performance

If they can share the same framebuffer then there goes your arguement.
Only because manufacturing (apparently) can no longer handle the transistor count of monolithic GPUs. Without this restriction, I think monolithic is always superior (more efficient, less troublesome) to a split up, glued together solution. It seems to make sense to me. SLI, CF, R680 and R700 are basically just split up monolithic chips.
 
Thats what I'm talking about.
lee-mcqueen-md.jpg


I think Geo's spidey sense is right, we should not get carried away and set ourselves up for dissappointment. :)
 
If they can share the same framebuffer then there goes your arguement.

it will be much better when they're sharing the framebuffer, although it's will be still difficult to share the cache contents - some applications (like Global Illumination) seem to rely on that heavily. But then, there are always extreme cases :smile:

Mintmaster: OK. I really wish I had more time to read these forums :cry:
 
So on the down side it looks like it won't utilise shared memory, its still an AFR solution (and hence will suffer input lag) and its still possible that it will suffer from micro struttering and profile requirments (but fingers crossed on those ones!).

On the bright side, its looking like it might attain 175% of the 4870's performance on average and only sell at $499. In other words, it slaughters the GTX 280 in performance and costs $150 less :D

At this point it all comes down to how reliable and seamless the scaling is, and whether micro stuttering is still an issue. Solve those, and ATI has a hands down winner.
 
On the bright side, its looking like it might attain 175% of the 4870's performance on average and only sell at $499. In other words, it slaughters the GTX 280 in performance and costs $150 less :D

At this point it all comes down to how reliable and seamless the scaling is, and whether micro stuttering is still an issue. Solve those, and ATI has a hands down winner.

Yes. Of course it's really wonderful pre-release (before any nasty old facts arrive) when you can just assume those problems away, innit? :LOL:

I think I prefer to keep my expectations at the "keep whipping them to make profile creation and management at least as good (and preferably better) than their competitors" level for now.

Tho getting an extra 15% beyond traditional CF is still worth a "nice job, guys".
 
If they can share the same framebuffer then there goes your arguement.
You need to do more than that to avoid doubling the vertex load in SFR. AFR having to copy persistent data if necessary is always going to cost BW, too.

Yield problems for monolithic designs can be avoided with good redundancy, so cost doesn't have to be affected, and there's no fundamental reason that performance will be better.

We don't really have the whole story as to why GT200 is less effiicent per mm2, nor why NVidia couldn't clock it as high as G92. The last time flagship was clocked slower than the midrange was 2004 with NV40. It could be that a 5-cluster, 256-bit version would be half as fast.
 
How much bandwidth is required?

You mentioned that an "insanely fast" link would be needed. I'm not so sure. That makes and assumption of where you're sharing data. Or where data sharing is needed to achieve the magic ~2X speed up.

If the application is not vertex shader limited, you're only talking about sharing render targets and some texture data.

I tend to think this is the case 99+% of the time

Just my thinking...
 
With unified shaders they never are ... on the flipside, doubling vertex load always eats into your performance.

So what's the delta performance improvement? For traditional SLI/crossfire setups the Z and render buffers are shared perfectly, while the Vertex, shader, texture/render targets are segregated. Traditional multi GPU systems go non-linear when:

1. You have to share data (implying synchronization between the engines)
2. You are bottle necked above the texture pipeline.

My guess it that you don't have to be unified that far up the pipe before the vast majority of the non-linear bottlenecks go away. I have no idea how you would prove that until benchmarks come out :)
 
Last edited by a moderator:
So lets be clear, you think GPUs are texture limited 99% of the time ... and they would still be texture limited 99% of the time even if you doubled the vertex load?
 
Yeah. And by limited, I mean non-linear. My guess is that the compactness of vertex data (ore more correctly, vertex shader operands) vs. texture data is a leading factor.
 
I was thinking of stuff like waterfalls or explosions generated with GS. it was only my assumption that it preserves data across multiple frames - wouldn't it?

Most likely the animation data is loop backed to vertex shader, and the geometry shader is run every frame (generates the particle quads). Reusing the quads is not something you usually want.
 
I see - and in this case, the CPU takes care of all the animation, with no persistent data on the GPU?

Mintmaster: no, I didn't take it like that. it just saddens me sometimes to think of this huge bunch of knowledge floating at arm's length, and I simply don't have the time to give it a good reading through.
 
I see - and in this case, the CPU takes care of all the animation, with no persistent data on the GPU?

Many games still animate particles on CPU, but this is not necessary. You can also implement the whole animation step on the GPU. For example like this:

When animating the particles, the GPU reads a hardware buffer that contains the previous frame positions, direction vectors (speed), particle life times, etc animation parameters, and animates the particles one frame ahead, and writes the updated dynamic particle data (positions and direction vectors) to a different buffer (pinpointing between 2 buffers).

Each frame, the particle renderer uses this data to form quads (either by fixed function point sprite quad expansion, geometry shader or R2VB) and renders the quads to the screen. This also happens completely on GPU, no data transfer between CPU and GPU is needed.

This way you don't need to process the particles on CPU at all, and save nice amount of data transfer bandwidth between CPU and GPU (and ease up the synchronization overhead between CPU and GPU).
 
You mentioned that an "insanely fast" link would be needed. I'm not so sure. That makes and assumption of where you're sharing data. Or where data sharing is needed to achieve the magic ~2X speed up.
It all depends on what you define as "achieving 2x speedup". Does it have to be every application? Every game? 90% of games? 50% of games?



If the application is not vertex shader limited, you're only talking about sharing render targets and some texture data.

I tend to think this is the case 99+% of the time
With unified shaders you're rarely vertex shader limited, but you can very easily be setup limited. One triangle per clock just isn't good enough to make poly count a non-factor for scaling.

Now, are you talking about AFR or SFR? With AFR, it doesn't really matter if the game is vertex/triangle limited or not. Scaling is only hampered in terms of losing space and needing extra memory transfers.
 
Back
Top