The R7xx Architecture Rumours & Speculation Thread

From what I'm reading from this thread, the likely outcome is AFR mode where each individual chip gets a frame.

If it's quadfire with quad GPUs, we're looking at a case where we get 16 frames in flight.

I don't know what it means to have 16 frames in-flight. I guess there are supposed to be ways to compensate, but how can a GPU be reliably tasked to render frame 16 if a user input changes world state in frame 14?

Forcing a halt basically negates a good slice of the acheivable throughput.
 
3dilettante - I completely agree, although I suppose you could play some games along the lines of: if user input causes sharp camera movement, flush the frames in flight and render a frame with significantly less detail (hopefully detail loss could be hidden with motion blur). That doesn't really work if user input doesn't necessarily correspond to rapid camera motion (e.g. 3d platformer)... but maybe in cases like those you could composite the part of the scene that must respond to user input quickly (player avatar) with the background. The background in such a case would be rendered according to the viewpoint of a fairly smoothly moving camera and would be much less sensitive to latency issues. Not being a graphics programmer I don't know how workable that is, but it seems like it could be a bit of bear to implement...

Another thought I had is that maybe AFR is being pushed because GPU makers don't ever really expect the kind of setup you describe to be affordable or common enough that it warrants changing things. Given what I'm hearing, I guess I don't expect to see many-core GPU setups (where core in this case is equivalent to a full GPU rather than a processing "cluster") in the gaming space...

If/when we do, from a developers standpoint (and assuming this isn't already available), what about having individual command streams of multiple GPUs exposed (rather than heroic efforts from driver teams to spread a single command stream across N GPUs)? That way you could divy up work in a game specific way, e.g. send different passes needed for a given frame to different GPUs... Fairly coarse grained tiling (a la Xenos, but with different tiles on different GPUs) could be implemented on top of such an API for those games where latency is really important.
 
Well, in R4xx vs NV4x, they did trail in features, because NV had support for VS/PS3.0, VTF, FP render targets, etc.

Even R300 supported FP render targets.

This feature should definitely stay that way! hardware MSAA resolve has to die, we need shaders based resolves, we need to know where subsamples are and we also need to know if a pixel has been fully compressed (all subsamples are equal) or not.

Agreed. Once you put HDR into the mix, shader based AA is really the only option since regular resolve doesn't take tonemapping into account, which can totally break AA.

I thought about this a bit more, and it seems to me that compression cannot *guarantee* you that a pixel's samples are not identical when it is not compressed.

Do we really need such a guarantee though? I'm envisioning such a feature to be a conservative check to avoid computations, rather than being very precise. If you really really need it to be precise you could always compare all samples yourself.
 
If/when we do, from a developers standpoint (and assuming this isn't already available), what about having individual command streams of multiple GPUs exposed (rather than heroic efforts from driver teams to spread a single command stream across N GPUs)? That way you could divy up work in a game specific way, e.g. send different passes needed for a given frame to different GPUs... Fairly coarse grained tiling (a la Xenos, but with different tiles on different GPUs) could be implemented on top of such an API for those games where latency is really important.
Or things like physics effects, skinning, shadow buffer rendering all running on different GPUs in parallel?

Multiple concurrent contexts all running as "parallel threads" with minimal synching (e.g. from skinning to shadowing)?

Or, you could distribute the rendering of shadow buffers across all available GPUs?

etc.?

I figure Xenos supports this style of operation, with up to 8 contexts running in parallel, so why not generalise this under D3D and let the OS schedule across the swarm of GPUs (perhaps with hints from the developer?)

Jawed
 
I figure Xenos supports this style of operation, with up to 8 contexts running in parallel, so why not generalise this under D3D and let the OS schedule across the swarm of GPUs (perhaps with hints from the developer?)
I might be wrong on this but I don't think Xenos runs more than context at time, afaik contexts are just a nice way to quickly switch between different rendering states bundles.
 
I might be wrong on this but I don't think Xenos runs more than context at time, afaik contexts are just a nice way to quickly switch between different rendering states bundles.
I have a slide somewhere that shows them overlapping by varying amounts. There's a general comment that this is a good way to hide render state changes.

But why 8? Wouldn't 2 or 3 do if it was just for hiding state changes?

I'd like to know more about them, but it's always a dead end when I raise the subject :cry:

Jawed
 
Jawed..any decent GPU does that otherwise every state change would kill performance :)
Contexts are useful cause they can br preloaded and managed, as you noted yourself it would be basically impossible to use more than 2 or 3 contexts during normal rendering.
 
They seem to have ducked the issue by going with AFR.

Apart from bandwidth and latency, presumably the "logical single GPU" also has a problem with having distributed highest-level control processors, while the work they're trying to organise is actually "serial"? A vertex stream, for example, doesn't readily parallelise across two or more physical GPUs because each chip requires all the buffer, and geometry assembly and at least part of setup works serially.

The earliest command level parallelism appears to come at the time of rasterisation.

Perhaps we're thinking about the multi-chip approach all wrong... Just because it's multi-chip, doesn't mean all those chips are the same.

On that note: Rampage, anyone?
 
Jawed..any decent GPU does that otherwise every state change would kill performance :)
Contexts are useful cause they can br preloaded and managed, as you noted yourself it would be basically impossible to use more than 2 or 3 contexts during normal rendering.
So why have 8?

Jawed
 
Instead of wasting CPU time, push buffer space and GPU time to generate, store and load groups of render state changes.. you can 'cache' them and reuse them, preload once and use many times
 
What if...

...
Multi chip solutions are asynchronous in another, the 3dlabs, way - just as ShaidarHaran implied?

You'd then have almost perfect scalability (since threads already stay in a given SIMD with current cards) and the only thing you'd have to build a bit oversized would be the dispatcher/control-logic.

After that you could simply add as many SIMD/ROP-Chips as your gatekeeper-chip can possibly feed.

Only thing that could hamper this i can think of right now would be the stream-out path, which would have to go via VRAM an not a fast on-die-path. Unfortunately i cannot guess how important just this feat would be...
 
The problem with multi-chip like this is your memory subsystem starts to scale like crap as you add workers. Are you building a memory controller chip too, in your system? How are you going to connect them all? What if those chips want to live in different sockets or slots? Will they all have their own discrete memory? Share a big pool? Both?

I don't see it myself, mostly because of the above.

I can think of a reasonably cute way to fix the frame-ahead AFR problem where there's feedback for a small number of GPUs, but when I think about say 8 GPUs taking part in interactive rendering, just the CPU and bus overhead required to keep them all working properly is scary enough, never mind everything else.
 
The problem with multi-chip like this is your memory subsystem starts to scale like crap as you add workers. Are you building a memory controller chip too, in your system? How are you going to connect them all? What if those chips want to live in different sockets or slots? Will they all have their own discrete memory? Share a big pool? Both?

Well, the problem is already solved - albeit only inside a single die. You'd only have to port it to a MCM solution.
 
Well, the problem is already solved - albeit only inside a single die. You'd only have to port it to a MCM solution.
Yeah. You only have to do a couple more 512-bit buses for a chip that already has one. Why haven't this been done already??.. 8)
 
Yeah. You only have to do a couple more 512-bit buses for a chip that already has one. Why haven't this been done already??.. 8)
512 Bit Bus is only externally to VRAM - not internally between each single individual Shader-Cluster and ROP-Partition. I don't think you'd need to replicate all that.
 
Yeah. You only have to do a couple more 512-bit buses for a chip that already has one. Why haven't this been done already??.. 8)

According to a slide the R600 has a internal 1024bit ringbus with, I think 742MHz. So the bandwidth is ~95GB/sec

if you use a hypothetical 4-chip R700 solution with 2 32bit HT3.0 interfaces at full speed (~2600MHz) per chip you would end up with a bandwidth of roughly 2x 41,6GB/sec per chip (as far as I have understood the wiki-article about HT this "bus" can send and receive informations at the same time, hope this is correct).

So from a bandwidth perspective it should be possible to divide a big chip like the R600 into smaller chips without the need to have to curtail the bandwidth. Also such a solution would be a big stepping stone for AMDs fusion system, which seems to use HT3.0 too.
 
What, no one likes my asymmetric processing theory?

I think one of the attractions for the IHVs of multi-chip is predictable scalability and price/performance while not adding materially to other costs like design costs, storage costs, parts tracking costs, etc. I won't say there will never be asymmetric cases, as arguably NVIO is exactly such a case. Nevertheless, they'd seem to like to avoid them.

I think part of what's driving this is the expansion of the pricing envelope from a top end of say $399 for a video solution in 2002 to well over $1,000 for top end consumer video solutions today. When you consider that entry-level discreet is around $49. . . that's a huge scope to hit with one basic design, and it seems to me the IHVs are hunting around right now for the best way to do that. . . i.e. whats the best spot in the entire scope to aim at as the reference point from which everything else is a variation for it's price target market?
 
Back
Top