The R7xx Architecture Rumours & Speculation Thread

AlexV · Aug 15, 2007

Mariner said:
I agree with your points entirely. However, it doesn't reduce the irony that 3dfx went out of business in part because they opted for a multi-chip solution but the current IHVs are moving back towards this choice to stay in business. :smile:

Well, 3dfx wanted to go/was multi-chip at a point in time where it was suboptimal, and scaling through single-monolithic designs was possible and desirable. Some things that were suboptimal at some point in time can become optimal at a different point in time

.

ChrisRay · Aug 15, 2007

I've had this discussion with Uttar several times on MSN. But when/if multi chip solutions become the norm. SLI/Crossfire are simply going to have to evolve with it if they are to remain feasible at all.

Jawed · Aug 15, 2007

armchair_architect said:
"Linked Adapters" is Microsoft's name for the OS bits required for SLI/Crossfire. Since with Vista the OS handles memory management and scheduling between multiple contexts, it's not possible for drivers to implement multi-GPU without OS involvement.

Slide 15 in this WinHEC slide deck is the best reference I can come up with at the moment.

So in other words, it's current technology, not something for the future.

What I'm wondering is how much the "single logical GPU" concept might enhance game compatibility. Does this merely enable CF/SLI under Vista or does it also smooth performance/compatibility?

Jawed

satein · Aug 15, 2007

IF R700 is supposed to be a multi-chip configuration, may it be possible that it will config as a cluster of multi GPU core on the same board? That means there will be a MASTER core holding control units, i.e, MC, Thread schduling and distributed loads to the other core (the Master core will do no calculation but mainly remains its task on control the cluster). Also, the master core may get different configuration like to fix with 512MB, 256 and 128 MC bus (or they may appear to fix in a serial bus interface instead) depending on the class of product line (High-end, mainstream, low-end). The slave core in cluster may contain any logic of ALU and Shader (My guess, I have very little knowledge on this thing) and it has a bus that will allow to loop around the other core and return to the master core (yes, the ring topology here). Oh, in this case, my idea is that the UVD & AVIVO units will be on the master core too.

With this configuration, I hope it will present itself as the "Hardware crossfire" which will be setup at bios level rather than on driver level under the OS, thus ensures compatibility as the OS will see this pool of GPU cluster as a single unit GPU. And this would make developing driver for different OS a bit easier (or not).

It may sound non-sense on this idea... but just a guess...

Regards,

trinibwoy · Aug 15, 2007

Considering how difficult it is to get games to scale with multi-GPU using driver workarounds what are the odds of there being a feasible hardware implementation of the same?

3dilettante · Aug 15, 2007

They better think of something.

I'd hate to see a future with quad crossfire quad-GPU boards that decide to default to AFR because AMD opted to not make a profile for a game.

I don't know if I can handle 16 frames worth of latency.

trinibwoy · Aug 15, 2007

Well it looks like any multi-chip solution is going to be a lot more integrated than the SLI/Crossfire setups we see today so I doubt AFR and SFR will be relevant. Scaling has to be guaranteed and the only way I can see to do that is to maintain the same general resource access and sharing capabilities of current single chip architectures albeat with higher latency.

Jawed · Aug 15, 2007

Trouble is Huddy seemed quite emphatic that AFR is the preferred mode and that ATI will set about encouraging developers to code with AFR scaling in mind.

Jawed

ChrisRay · Aug 15, 2007

The AFR latency concern is definately one I share. With dual GPUS its not really noticable to me ((Some are more sensitive to it than others)). But with quad SLI it was certainly noticable to me. And Nvidia seems pretty peachy right now about AFR, and its scaling.

I had recently asked the Nvidia SLI team about the possible advantages of SFR on a unified architecture where the pixel/vertex bottlenecks are different. And the scaling would definately be better. However it seems Nvidia has set AFR as the preferred scaling mode. Which I think is a result of the simplicity and effectiveness of the rendering modes. And devrel seems to be focused on making games more AFR friendly.

Chris

trinibwoy · Aug 15, 2007

Well AFR might not be so bad if all chips have access to a global VRAM pool and render targets and other buffers are easily shared. What exactly does it mean for a developer to code with AFR in mind though?

ChrisRay · Aug 15, 2007

trinibwoy said:
Well AFR might not be so bad if all chips have access to a global VRAM pool and render targets and other buffers are easily shared. What exactly does it mean for a developer to code with AFR in mind though?

As far as I know. Its basically partially involved with making sure developers dont render into the next frame. Which can really mess up AFR scaling.

nAo · Aug 15, 2007

ChrisRay said:
As far as I know. Its basically partially involved with making sure developers dont render into the next frame. Which can really mess up AFR scaling.

Welcome tone mapping and exposure computations!
No big deal..one can defer usage of exposure computed at frame N to frame N + M - 1, where M is the number of GPUs used in the system. As long as developers are aware of these issues it shouldn't be that hard to fix them

armchair_architect · Aug 16, 2007

On thing that's bothered me about R600 is that if AMD planned it to be a stepping stone to a multi-chip R700, why did they go with a 2D organization of SIMDs/textures? One dimension is SIMDs within a cluster (which I assume all share a ring stop), and the other dimension is SIMDs that talk to a given texture unit.

In R600, this means lots of cross-chip communication, since each cluster has to talk to each texture unit. In a multi-chip system (assuming they're going for something more integrated than two independent chips with classical Crossfire) it translates to lots of inter-chip communication, which implies a pretty impressive bus between the two chips.

Thoughts? Is this really a problem? What architectural changes might they make to address it?

nAo · Aug 16, 2007

armchair_architect said:
In R600, this means lots of cross-chip communication, since each cluster has to talk to each texture unit. In a multi-chip system (assuming they're going for something more integrated than two independent chips with classical Crossfire) it translates to lots of inter-chip communication, which implies a pretty impressive bus between the two chips.

Yep, you nailed it down very well

Thoughts? Is this really a problem? What architectural changes might they make to address it?

Well..if they're pushing AFR..that's the answer you're looking for -> they're not going to address it.

Jawed · Aug 16, 2007

armchair_architect said:
Thoughts? Is this really a problem? What architectural changes might they make to address it?

They seem to have ducked the issue by going with AFR.

Apart from bandwidth and latency, presumably the "logical single GPU" also has a problem with having distributed highest-level control processors, while the work they're trying to organise is actually "serial"? A vertex stream, for example, doesn't readily parallelise across two or more physical GPUs because each chip requires all the buffer, and geometry assembly and at least part of setup works serially.

The earliest command level parallelism appears to come at the time of rasterisation.

I suppose you could argue that geometry-related data is low in volume and relatively compact (compared with pixels/fragments/texels) so coherency traffic shouldn't be too costly. If they've built a cache-coherent multi-GPU system, then it should work, right? Erm...

---

How big is a texture result? Presumably, regardless of the source format, the filtered (bilinear//trilinear/AF) texture result output by a TU is always Vec4 FP32, presumably because the result is put directly into a register. So I'm guessing that for trilinear or better filtering, the texture result would consume less chip-to-chip bandwidth (in a multi-chip GPU) than sending the raw texels for the chip to filter locally. Anyway, the AFR plan appears to obviate this consideration.

Jawed

Sound_Card · Aug 16, 2007

Latency would be lower between the two chips if they were on the same package right? So if the two chips are rending via AFR, what exactly would happen if another R700 was droped in? You would have AFR going on on each board, whith AFR between the boards... right? That sounds kinda...complex.

Why is R600 the stepping stone for R700? Why not RV670/R670?

3dilettante · Aug 16, 2007

Enforced serialization on my massively parallel workload makes me a sad panda.

How would this affect frame rates?

I would suppose that in optimal cases, max fps would go through the roof.
The minute any serialization happens, it sounds like the minimum fps is going to crater.

Jawed · Aug 16, 2007

If all the GPUs were coalesced then you'd need, for example, a post transfrom vertex cache that's shared amongst all the GPUs. It's interesting to note that R600 can "spill" PTVC into VRAM (it's virtualised) - this is to support amplification from GS.

If that's the case, then couldn't PTVC readily support sharing (writing and reading) by all GPUs?

There's a performance hit in PTVC spillage, but I'm not sure why, if the on-die PTVC can cache what's in VRAM. Put another way, I don't know why PTVC latency can't be hidden. If you know you're going to spill (GS has to declare maximum output) or you know the GPU is coalesced, then you can schedule around the latency, can't you?

As far as I can tell, PTVC is 32 or 64KB in R600 ("upto 8x increase in vertex cache size vs. X1000 series" - "vertex cache ... 4k/8k on X1000"). I presume it's talking KB, though it could be talking k vertices.

Jawed

Mintmaster · Aug 16, 2007

trinibwoy said:
Well AFR might not be so bad if all chips have access to a global VRAM pool and render targets and other buffers are easily shared. What exactly does it mean for a developer to code with AFR in mind though?

Therein lies the challenge. Memory can only be wired to one chip or the other. Having one chip access the other chip's memory pool to the point that it can be considered equivalent to a global memory pool is really hard.

Sound_Card · Aug 16, 2007

Can anybody answer how Crossfire would work if you have a card that is in crossfire mode that is in crossfire with another card in crossfire mode? :???:

More intresting is that ATi is introducing tri fire and quad fire. So I'm confused even more with how R700 will react to that kind of set up.

The R7xx Architecture Rumours & Speculation Thread

AlexV

Heteroscedasticitate

ChrisRay

<span style="color: rgb(124, 197, 0)">R.I.P. 1983-

Jawed

satein

trinibwoy

Meh

3dilettante

trinibwoy

Meh

Jawed

ChrisRay

<span style="color: rgb(124, 197, 0)">R.I.P. 1983-

trinibwoy

Meh

ChrisRay

<span style="color: rgb(124, 197, 0)">R.I.P. 1983-

nAo

Nutella Nutellae

armchair_architect

nAo

Nutella Nutellae

Jawed

Sound_Card

3dilettante

Jawed

Mintmaster

Sound_Card

Similar threads