The R7xx Architecture Rumours & Speculation Thread

AnarchX · Aug 14, 2007

Sound_Card said:
Each chip consisting of say 24 super scalar ALU's, 2 TU's, and 1 RBE?

But how connect them? Three-digit GB/s numbers are not easy to reach on a package. :???:

I do not think it will be more than two GPUs, which work in an enhanced Crossfire.

Jawed · Aug 14, 2007

AnarchX said:
But how connect them? Three-digit GB/s numbers are not easy to reach on a package.

I think a package is the perfect place to put "high bandwidth". I assume that a package can be made multi-layer.

Xenos is a nice example with the GPU and EDRAM dies sharing a package - although the bandwidth there is only 32GB/s.

Alternatively, you could argue that the marginal cost of putting the interconnect on the main circuit board wouldn't be that high as it already needs to support 1GB of RAM, say, with a total memory bus of 512-bit.

Jawed

Gelanin · Aug 14, 2007

I must admit this is very far from my area of expertise, and at risk of making a complete fool out of myself, wouldnt it be possible to sort of just hide the multiple cores, so from a user/software point of view it would look like a normal 1-chip solution, while on the inside it would be multiple cores.
Sort of like if you replace the Stream Processing Units Clusters in the picture below, with dedicated chips.

Then again, that might just complicate things alot and make it a nightmare to actually get close to maximum theoretical performance.

Not sure that actually made any sense to anyone what i was trying to ask, but oh well

Jawed · Aug 14, 2007

Amusingly enough I've just been looking at this thread:

http://forum.beyond3d.com/showthread.php?t=31207

This stands out:

http://forum.beyond3d.com/showpost.php?p=769619&postcount=17

oh look!

The Windows Vista operating system will include native support for multiple graphics accelerators through an ATI sponsored technology called Linked Adapter. Linked Adapter will treat multiple graphics accelerators as a single resource (GPU and memory), and working together with parallel engine support, schedule the most efficient workload possible across the graphics processors and graphics memory pool to maximize performance.

Jawed

3dilettante · Aug 14, 2007

Gelanin said:
I must admit this is very far from my area of expertise, and at risk of making a complete fool out of myself, wouldnt it be possible to sort of just hide the multiple cores, so from a user/software point of view it would look like a normal 1-chip solution, while on the inside it would be multiple cores.
Sort of like if you replace the Stream Processing Units Clusters in the picture below, with dedicated chips.

I think parts of it could be done that way, but I think it would be best if the reads to the register file and caches didn't require jumps to other chips.

There are a number of data paths that would be cut by partitioning by unit type, and the division seems to have cut the samplers and command processor from accessing memory directly.

There would have to be some shifting of hardware around and perhaps additional queues and mirrored caches not present currently.

Jawed · Aug 14, 2007

Blazkowicz said:
but maybe, as an extension of your L2 sharing idea, we can think of the R700 as a single processor on multiple dies? like a 386+387, a voodoo 1 or a pentium pro (with one or two L2 dies). one of the dies would be the master and speak to the PCIe bus and you'd effectly have a single GPU software wise. The on-package interconnects would have to be really fast (how is that made on the pentium pro? or the L3 dies on POWER chips?)

Is that feasible and could they still easily be able to use a single die for midrange boards.

This is much like my original conception of R700, from when we first heard about it being multi-chip.

Though I interpreted it as likely requiring a dedicated "IO" chip (much like NVIO, in a way), acting as a "ring stop" dedicated to PCI Express, doing AVIVO/UVD and other stuff.

Huddy's comments, though, point to a software model that's founded upon explicit chip splitting. It makes my heart sink.

The "Linked Adapter" thing may actually ride to the rescue. It appears to be a part of D3D, going forwards. But it might only be a minor "unification" handling simple cases and making two or more chips "appear as one".

Multiple chips acting as a single logical chip is clearly hard to implement. I'm afraid to say I saw ATI's virtual memory/multiple render contexts infrastructure (along with the latency-tolerance of GPUs, generally) as the silver bullet.

I have to admit, it'd be nice to have a thread where we discuss current and forward-looking rendering techniques and assess their suitability for CF/SLI-type GPU arrangements: deferred rendering, effects physics, high-quality shadowing (e.g. VSM or CSM) etc.?

I wonder if the use of multiple contexts, in parallel, is what'll soften the blow here. If, going forwards, D3D encourages developers to program rendering passes as contexts that are capable of parallel execution (with the obvious caveat that serially-dependent rendering passes need to be respected) then it may be that multiple physical chips disappears as an issue.

The programmers will already, effectively, be programming multiple logical GPUs, to run in parallel.

Jawed

Jawed · Aug 14, 2007

DemoCoder said:
That's why its no guarantee that DX11/12 features implemented now in the R600 will actually be the optimal way to implement DX11/12 features when the market is actually ready for them. You might find ATI jettisoning the work they did in the R600 by the time the R800 rolls along.

Yeah Truform. Or the fog ALU.

Or, MSAA resolve, which is no longer implemented in hardware.

Jawed

nAo · Aug 14, 2007

Jawed said:
Or, MSAA resolve, which is no longer implemented in hardware.

This feature should definitely stay that way! hardware MSAA resolve has to die, we need shaders based resolves, we need to know where subsamples are and we also need to know if a pixel has been fully compressed (all subsamples are equal) or not.

Jawed · Aug 14, 2007

nAo said:
we also need to know if a pixel has been fully compressed (all subsamples are equal) or not.

I haven't noticed any sign of this coming in D3D. Presumably it's right at the back of the queue behind all the other MSAA-related stuff.

Could this be realised as an alternate resource view? e.g. two concurrent views of an MSAA'd buffer: all samples as one view and compression flag as another view (1 bit per "element", whoopee!). D3D10 can't support multiple views on the same resource at the moment, though, can it? So that'd be a way off I suppose.

Easier to copy the compression flag into one bit of stencil?

Jawed

Rys · Aug 14, 2007

Jawed said:
D3D10 can't support multiple views on the same resource at the moment, though, can it? So that'd be a way off I suppose.

It can support multiple surface views. The limitation is binding them to the pipe as input and output simultaneously, which isn't supported (for obvious reasons).

Andrew Lauritzen · Aug 14, 2007

Jawed said:
I haven't noticed any sign of this coming in D3D. Presumably it's right at the back of the queue behind all the other MSAA-related stuff.

Could this be realised as an alternate resource view? e.g. two concurrent views of an MSAA'd buffer: all samples as one view and compression flag as another view (1 bit per "element", whoopee!). D3D10 can't support multiple views on the same resource at the moment, though, can it? So that'd be a way off I suppose.

I don't see why it couldn't be exposed directly as a HLSL function, since querying MSAA sub-samples is also pretty "custom". Just a simple function that takes a location and returns a boolean.

That said upon further reflection (since our last discussion of this) I think checking whether the sub-sample depths are equal is quite sufficient for most cases (excepting perhaps some issues with EQUAL depth functions... not sure though). I don't recall whether one can create a multi-sampled depth buffer resource view though, so it may be necessary to write out a second buffer to store the depths. That said since this functionality is arguably the most useful for deferred rendering, you may already have such a buffer sitting around (position, view-space z, etc).

Arun · Aug 14, 2007

I thought about this a bit more, and it seems to me that compression cannot *guarantee* you that a pixel's samples are not identical when it is not compressed. This is because the compression works at a tile level - so it can guarantee that compressed pixels are actually identical, but it doesn't guarantee anything for non-compressed pixels. Doing things at a tile level is necessary because of common memory burst sizes.

Furthermore, It seems to me that if, for a reasonably sized tile, only *two* (or at least very few) samples over the entire tile were identical, then the compressed version of the tile might be larger than the uncompressed version of it! This is implementation dependent, of course, but it shouldn't be hard to see that it might happen.

In that case, you have two choices to give guaranteed information back to DirectX: Either you check for identical samples at reading time (expensive, extra mostly useless hardware and/or routing, might as well do it in the shader) or you allow the tile to take more memory than its uncompressed peak. The latter solution, of course, forces you to reserve more memory than otherwise necessary, so it's both useless and expensive (well, unless you had memory-burst-sized paging, but that's ridiculous).

The third option, of course, is to gently ask Microsoft never to add that feature and tell the game programmers to just do it in the shader. And then never think about it again. All AFAICT, of course...

nAo · Aug 14, 2007

AndyTX said:
That said upon further reflection (since our last discussion of this) I think checking whether the sub-sample depths are equal is quite sufficient for most cases (excepting perhaps some issues with EQUAL depth functions... not sure though). .

Umh..since depth is the only supersampled information that you get while using multisample I think you really don't want to check it to understand if your pixel has been 'compressed' or not. But maybe you were referring to something else..

nAo · Aug 14, 2007

Arun said:
I thought about this a bit more, and it seems to me that compression cannot *guarantee* you that a pixel's samples are not identical when it is not compressed. This is because the compression works at a tile level

I call it compression but I never referred to it as tile comrpession, but as pixel compression.
I don't even know if latest gen GPUs could, in theory, easily determine if a pixel is fulle 'compressed' or not.

Furthermore, It seems to me that if, for a reasonably sized tile, only *two* (or at least very few) samples over the entire tile were identical, then the compressed version of the tile might be larger than the uncompressed version of it! This is implementation dependent, of course, but it shouldn't be hard to see that it might happen.

Yes, if you only have a few samples per pixel to retrieve such a compression flag wouldn't be probably any faster than manually reading and checking the samples, but if you have many subsamples per pixel (8 or more in the future) you really don't want to sample all of them and check if they're all equal.

Jawed · Aug 14, 2007

Oh and there's the ATI patent application for uncompressed, partially compressed or fully compressed MSAA samples. Just to muddy the waters on the meaning of "compressed" even further :smile:

Jawed

Andrew Lauritzen · Aug 15, 2007

nAo said:
Umh..since depth is the only supersampled information that you get while using multisample I think you really don't want to check it to understand if your pixel has been 'compressed' or not. But maybe you were referring to something else..

Right yeah, I guess it wouldn't work properly if reading from the depth buffer (which apparently you can't do right now anyways). However it should work properly with deferred rendering if you write out position/depth to the G-buffer, which you currently have to do anyways because of the aforementioned limitation about reading MSAA depth buffers (although Humus says this limitation is going away in DX 10.1 - will be interesting to see how the sampling interface is specified).

Anyways, you probably have/want to store position/depth when using deferred shading anyways, so it should be sufficient to compare that single attribute for equality, no?

armchair_architect · Aug 15, 2007

"Linked Adapters" is Microsoft's name for the OS bits required for SLI/Crossfire. Since with Vista the OS handles memory management and scheduling between multiple contexts, it's not possible for drivers to implement multi-GPU without OS involvement.

Slide 15 in this WinHEC slide deck is the best reference I can come up with at the moment.

So in other words, it's current technology, not something for the future.

Mariner · Aug 15, 2007

mboeller said:
Hi;

after reading this thread I'm wondering why everyone assumes that R700 is an two-chip architecture.

Why no four-chip architecture?

IMHO it would be far more logical to have an 4 chip highend implementation; a 2 chip midrange implementation and a single chip low-end/mainstream implemenation of the same basic architecture.

Sounds rather 3dfx-ish to me.

If I remember correctly, back in the days of VSA-100, multi-chip solutions were criticised for not being able to compete on cost with larger single chip cards.

There will be a touch of irony if we now find the IHVs moving back to multi-chip solutions because they are more economically feasible to produce than the huge chips which have been needed to increase performance in recent times!

AlexV · Aug 15, 2007

Mariner said:
Sounds rather 3dfx-ish to me.

If I remember correctly, back in the days of VSA-100, multi-chip solutions were criticised for not being able to compete on cost with larger single chip cards.

There will be a touch of irony if we now find the IHVs moving back to multi-chip solutions because they are more economically feasible to produce than the huge chips which have been needed to increase performance in recent times!

How so?You can't scale clocks and transistor count endlessly, and given the way complexity of these things goes up, a wall is imminent for monolithic designs, IMO, unless some fab-revolution occurs. There's also the issue of yields...I'm not convinced that some huge 1 billion transistor chip, made on the latest and greatest 55/45nm process, would have satisfactory yields, whilst two 500 million ones done on the already established(at that point) 65nm node would probably fare quite well on that front(yes, this is a simplified view).

I'm sure that both players are exploring the multi-chip path, but when they'll actually walk it is another issue.

Mariner · Aug 15, 2007

I agree with your points entirely. However, it doesn't reduce the irony that 3dfx went out of business in part because they opted for a multi-chip solution but the current IHVs are moving back towards this choice to stay in business. :smile:

The R7xx Architecture Rumours & Speculation Thread

AnarchX

Jawed

Gelanin

Jawed

3dilettante

Jawed

Jawed

nAo

Nutella Nutellae

Jawed

Rys

Graphics @ AMD

Andrew Lauritzen

Moderator

Arun

Unknown.

nAo

Nutella Nutellae

nAo

Nutella Nutellae

Jawed

Andrew Lauritzen

Moderator

armchair_architect

Mariner

AlexV

Heteroscedasticitate

Mariner

Similar threads