AMD: R7xx Speculation

Bouncing Zabaglione Bros. · Feb 21, 2008

Mariner said:
To be cynical, if NVidia plan to continue to compete at the high-end with large monolithic GPUs versus AMD's dual-chips then it isn't in their best interests to push development of AFR in TWIMTBP games at the present time.

Such a stance would, in theory at least, benefit NVidia in the short-term as the dual-GPU solutions are still very much in the minority.

Personally, I think NVidia are hard-nosed enough to hold back from assisting developers too much with multi-GPU solutions for this reason. This kind of attitude is why they have such a successful business.

Just a speculation.

Nvidia is still going to want X2 or SLI type solutions for their ultra-high-end, grab-the-benchmarks products. If those give only single card performance, it's won't look very good for Nvidia either.

Jawed · Feb 21, 2008

DegustatoR said:
That makes me question if there is some fundamential incompatibility between AFR (or MGPU rendering in general) and newer DX10 games -- some methods and algorythms maybe that make AFR/SFR rendering quite ineffective?

http://www.hexus.net/content/item.php?item=11874&page=2

Multi-GPU scaling works very well with DirectX9 code but is rather more difficult to program for in a DX10 environment, we were told.

Later in the article:

Crysis suffers from two major problems with respect to multi-GPU scaling, as far as ATI is concerned. The engine doesn't lend itself well to queuing up the necessary frames and DX10 scaling is inherently poorer than DX9. These two factors combine to diminish the returns.

Now, Crysis, whatever it is, is certainly not a D3D10 title in ground-up terms - but the message seems loud and clear regardless: there's something innately useless about D3D10 and multi-GPU, which is ironic as 3 and 4 ATI GPU configurations appear to be a Vista-only proposition.

Jawed

nicolasb · Feb 21, 2008

DegustatoR said:
That makes me question if there is some fundamential incompatibility between AFR (or MGPU rendering in general) and newer DX10 games -- some methods and algorythms maybe that make AFR/SFR rendering quite ineffective?

From Anandtech (link):

Due to the state of AMD's driver optimizations DX10 games currently only scale well to 3 GPUs and not much beyond (Crysis/Bioshock), while DX9 games will generally scale better all the way up to 4 GPUs. We expected the opposite to be true but AMD provided us with technical insight as to why it is the case:

"The biggest issue is DX10 has a lot more opportunities for persistent resources (resources rendered or updated in one frame and then read in subsequent frames). In DX9 we only had to handle texture render targets, which we have a good handle on in the DX10 driver. In addition to texture render targets DX10 allows an application to render to IBs and VBs using stream out from the GS or as a traditional render target. An application can also update any resource with a copy blt operation, but in DX9 copy blt operations were restricted to offscreen plains and render targets. This additional flexibility makes it harder to maximize performance without impacting quality.
Another area that creates issues is constant buffers, which is new for DX10. Some applications update dynamic constant buffers every frame while other apps update them less frequently. So again we have to find the right balance that generally works for quality without impacting performance.
We are also seeing new software bottlenecks in DX10 that we continue to work through. These software bottlenecks are sometimes caused by interactions with the OS and the Vista driver model that did not exist for DX9, most likely due to the limited feature set. Software bottlenecks impact our multi-GPU performance more than single GPU and can be a contributing factor to limited scaling.
We’re continuing to push hard to find the right solution to each challenge and boost performance and scalability wherever we can. As you can see, there are a lot of things that factor in."

Jawed · Feb 21, 2008

nicolasb said:
From Anandtech (link):

Thanks for the extra detail.

So it would seem that multi-GPU could only be of practical use if the multiple GPUs function as a single logical GPU, with a single memory space and no AFR nonsense.

Jawed

DegustatoR · Feb 21, 2008

Jawed said:
Now, Crysis, whatever it is, is cer...Us via something other than todays AFR/SFR...

Jawed · Feb 21, 2008

DegustatoR said:
The question is -- is it DX10 API specific problem (and we can expect to see a 'fix' in DX11) or is it a more general problem with some rendering algorythm (motion blur?) which is widely used in new games and will be used even more in the future?

D3D10, according to the text from Anandtech. The flexibility it affords developers appears to be the source of the problem.

I don't see how the problem can be fixed in the API. As time goes by the API will provide more and more flexibility.

If it's the latter then i don't see how AMD can seriously be planning to use only MGPU solutions in hi-end in the future... Unless they'll find a way to combine GPUs via something other than todays AFR/SFR...

That's the nub and I can't see how R780 will be accepted unless it solves this problem directly.

What mystifies me is that ATI isn't using the obvious "high AA" scenario with Crysis - where each GPU renders the same frame but with 2xMSAA to create a 4xMSAA result. In theory this is a get out of jail free card, because it isn't AFR-based - and anyone who's remotely enthusiastic about gaming graphics would like to be able to turn on MSAA if possible.

So, why has MSAA-scaling become a non-subject?

Jawed

Ailuros · Feb 21, 2008

I truly often wonder as a layman if (whenever IHVs truly come close to hit a manufacturing process wall) the ultimate sollution is truly multi-chip and not some sort of multi-core on a single die. Appart from all the software problems (applications not digesting well with load balancing methods used), driver resources spent to overcome any of those problems etc. there's always the redundancy problem that is bound to such sollutions.

DegustatoR · Feb 21, 2008

Ailuros said:
I truly often wonder as a layman if (whenever IHVs truly come close to hit a manufacturing process wall) the ultimate sollution is truly multi-chip and not some sort of multi-core on a single die.

Process wall mean that you can't increase the complexity of a chip anymore since you're already hitting the TDP barrier.
From that POV any solution beyond this 'wall' should have it's own cooler -- so we're talking about multi-card system instead of multi-GPU-on-a-card and multi-GPU-in-a-package here.
This is another reason why i think that AMD is wrong about multi-GPU right now -- in the end the fastest multi-GPU solution will always be the one with the fastest single GPU.

ShaidarHaran · Feb 21, 2008

DegustatoR said:
Process wall mean that you can't increase the complexity of a chip anymore since you're already hitting the TDP barrier.
From that POV any solution beyond this 'wall' should have it's own cooler -- so we're talking about multi-card system instead of multi-GPU-on-a-card and multi-GPU-in-a-package here.
This is another reason why i think that AMD is wrong about multi-GPU right now -- in the end the fastest multi-GPU solution will always be the one with the fastest single GPU.

By the same logic the fastest multi-CPU system would always be the one with the fastest discrete single CPUs, however we know this not to be true.

Silent_Buddha · Feb 21, 2008

DegustatoR said:
Process wall mean that you can't increase the complexity of a chip anymore since you're already hitting the TDP barrier.
From that POV any solution beyond this 'wall' should have it's own cooler -- so we're talking about multi-card system instead of multi-GPU-on-a-card and multi-GPU-in-a-package here.
This is another reason why i think that AMD is wrong about multi-GPU right now -- in the end the fastest multi-GPU solution will always be the one with the fastest single GPU.

Unless, of course, they find a way to make multiple GPUs interact as if they were a single monolithic GPU. Without all the cludges that are needed for AFR (bleh), SFR (slight less bleh) or any other cludge that has multiple GPUs render alternate frames or seperate parts of frames.

That ideally is the future of high end graphics once you hit the wall for monolithic GPUs.

You can be sure Nvidia is also putting significant cash into R&D of this same issue. However, unlike AMD/ATI they have a much larger pool of cash to fund R&D and can afford to contnue investment in R&D Of monolithic GPUs while also funding R&D for multi-GPU.

AMD/ATI on the other hand is putting everything into R&D of multi-GPU it "appears."

And currently ATI has the upper hand in this (multi-monitor with multi-GPU rendinering being the latest "breakthrough").

We'll see what Nvidia unveils when they release their dual GPU on a card interpretation and if they also have some major advances in multi-GPU rendering. Cross fingers for something good.

Regards,
SB

DegustatoR · Feb 21, 2008

ShaidarHaran said:
By the same logic the fastest multi-CPU system would always be the one with the fastest discrete single CPUs, however we know this not to be true.

This is true for the same CPU platform and architecture.

Silent_Buddha said:
Unless, of course, they find a way to make multiple GPUs interact as if they were a single monolithic GPU. Without all the cludges that are needed for AFR (bleh), SFR (slight less bleh) or any other cludge that has multiple GPUs render alternate frames or seperate parts of frames.

Still the same rule applies -- the system with the fastest single GPU will be the fastest multi-GPU system too.

Kaotik · Feb 21, 2008

DegustatoR said:
This is true for the same CPU platform and architecture.

Still the same rule applies -- the system with the fastest single GPU will be the fastest multi-GPU system too.

You complete disregard the fact that all architectures/chips/technologies don't scale as well as others when you start putting more and more chips to work together, let if be SFR or AFR

MfA · Feb 21, 2008

Apart from AFR there are no trivial overhead ways of going multichip.

You can do sort first, which pushes your transformed geometry through memory (tiling). You can do sort middle, which pushes your transformed geometry through chip interconnect (closest to how parallelism works internally). You can do sort last, which means you need one framebuffer per chip, combining them at the end.

If you control the entire pipeline you can do sort first without going through memory, but that's not the case for PCs.

Ailuros · Feb 22, 2008

DegustatoR said:
Process wall mean that you can't increase the complexity of a chip anymore since you're already hitting the TDP barrier.
From that POV any solution beyond this 'wall' should have it's own cooler -- so we're talking about multi-card system instead of multi-GPU-on-a-card and multi-GPU-in-a-package here.
This is another reason why i think that AMD is wrong about multi-GPU right now -- in the end the fastest multi-GPU solution will always be the one with the fastest single GPU.

My reasoning for a multi-core vs. a multi-chip sollution is that in the first case it sounds a lot easier to remove redundancy. A single high end chip might be by X times faster than a mainstream variant of the same family, yet it doesn't have across the chip X more units either..

Silent_Buddha · Feb 22, 2008

DegustatoR said:
Still the same rule applies -- the system with the fastest single GPU will be the fastest multi-GPU system too.

Not really, if at some point in the future you hit a manufacturing wall and the most transistors you can fit on a chip are say 20 billion (as an example)...

Yet your competition has been devloping multip chip for the past few years. You've been able to beat him up til that point because one monolithic chip is "generally" more efficient than multiple cores/gpus that have to communicate with each other.

And lets say that other GPU company is able to make a system using 4x10 billion transistor chips.

Granted by doing so it's not as efficient so even though it has 2x the transistors it's only 1.5x faster.

As a monolithic chip, you've hit your wall and there isn't a bloody thing you can do about it. So you end up losing to the other vendor. Sure you could Tie yours together using AFR or something else kludgy. However, if you haven't been doing at least as much R&D as they have and come up with more elegant solutions. Your 2xMonolithic chip is likely going to lose to the other solution.

However with a system in place to scale up multiple chips you can continue to scale as much as needed as long it doesn't outpace the consumers willingness to spend that much on it. And thus you can contnue to add more components to the system you've devised and continue to beat the company that didn't put as much R&D into multichip.

We're most likely still years away from such an event, however it's evident that AMD/ATI, Intel, and Nvidia all see that wall coming at some point.

Does that mean multi-chip working as a monolithic chip is going to beat a monolithic chip with the same amount of transistors? Unlikely. But while one company can afford to R&D both the other can't. And if the future is multi-chip, you'd better start or have started R&D already as development of a working system could easily take half a decade.

And if you are relying on AFR for the future...you've already lost.

Regards,
SB

DegustatoR · Feb 22, 2008

Ailuros said:
My reasoning for a multi-core vs. a multi-chip sollution is that in the first case it sounds a lot easier to remove redundancy. A single high end chip might be by X times faster than a mainstream variant of the same family, yet it doesn't have across the chip X more units either..

What is multi-core? Every single one of todays GPU can be considered multi-core. When we're talking about multi-GPU we're talking about multi-chip in one package (ala Xenos, NV45, Yorkfield etc) or multi-chip on one board (X2, Voodoo 5, Rage MAXX etc). Multiple GPU cores in one _chip_ is pretty pointless since you're doing essentially the same crystal as one monolithic GPU (from the costs and R&D point of view) but you're leaving many-many redundancy in such a chip which will eventually make it much slower than the proper single chip of the same transistor count.
Redundancy of any multi-GPU design is another issue which hasn't been widely discussed i believe. Look at the 3870 X2: it's a card with two ~700M transistors chips (~1400M combined) made on a 55nm process and yet it's very close in it's perfomance to a card with one ~700M transistors chip made on a 90nm more than a year before X2.
If you ask me that's VERY ineffective design which makes me doubt this "multi-GPU solution can bring tomorrows perfomance today" argument. From the looks of it any multi-GPU solution needs a better process to be able to compete with a latest monolithic GPU at all. So if RV770 (wether it'll use AFR or something better -- doesn't matter here since we're talking about the best case for X2 -- nearly two fold increase of perfomance in CF mode) will be made on 55nm then i have severe doubts that any multi-GPU RV770-based solution will be able to compete with G100 which is supposedly being made for 55nm also. I afraid that RV770 in multi-GPU configuration needs to go to 45nm or even further to compete with G100.
(That's assuming that all the latest rumours about G100 and RV770 are true of course.)

Silent_Buddha said:
Not really, if at some point in the future you hit a manufacturing wall and the most transistors you can fit on a chip are say 20 billion (as an example)...

Yet your competition has been devloping multip chip for the past few years. You've been able to beat him up til that point because one monolithic chip is "generally" more efficient than multiple cores/gpus that have to communicate with each other.

This is exactly what i'm saying -- up until we hit a process wall any multi-GPU solution will be less attractive than comparable in any way (complexity, perfomance, TDP etc) single-chip solution.
But once we hit the wall and can't increase the perfomance of single chip anymore the fastest multi-chip will still be based on the fastest single-chip.
Now what 'competition' are you talking about? AFAIK NV was R&Ding multi-GPU solutions far longer than ATI/AMD. If they are doing single-GPU top-end right now that doesn't mean that they don't have a multi-GPU solution in R&D for 'after the wall' times. It's just that TODAY single-GPUs are still the best way to go since this 'wall' hasn't been hit yet.
All in all what i'm saying is that ATI is too early with this multi-GPU switch (again as it was with ALU:TMU ratio) and while i do understand their reasoning (it allows them to save some money by not developing a top-end chip and replacing it with multi-GPU solution) i believe that they're playing dangerous game here since they're leaving their hi-end sector unprotected from a highly possible NV's monolithic GPU retaliation.
Another thing to consider is that middle-end and hi-and GPUs has to be balanced differently and using middle-end GPUs for high-end solution (or vice versa) isn't that great since you're loosing some %s of effectiveness here.
Well, guess will have to wait and see. AMD seems to be pretty confident in their multi-GPU-for-hi-end approach. NV is being quiet for the last year. Maybe they're considering going MGPU too since there really is no reason for them to try harder now after AMD has chosen this path. 9800GX2 instead of a new single-GPU solution in the end of 2007 is a confirmation of this i'm afraid =(

Arun · Feb 22, 2008

I think what people kinda mean with the whole multi-core mumbo jumbo in GPUs is that it's easy for the hardware designers to scale it. A CPU like Barcelona that's designed for 4 cores with a shared L3 cache should be incredibly easy to scale to 8 cores as long as the cache & memory controllers have been designed with an arbitrary number of cores in mind.

I don't think many GPUs in the industry's history were like that. Many derivatives either weren't on the same process node or had changes here or there. The whole notion of a "family" in the GPU world is more about the (majority of the) RTL than the synthesis or the verification phases. This also leads to necessary differences in the drivers. There might be a few exceptions, but that's not the point.

Certainly you might be able to improve this and reduce your per-chip costs that way, making it more attractive to go single-chip at the high-end and also release more low-end derivatives to target more price points 'natively'. It's not a proven strategy, but I guess it could work. Is that what you're thinking of, Ailuros? Also, regarding a transistor count wall: that's senseless, we'll hit a power wall way before then, most likely.

mczak · Feb 22, 2008

Ailuros said:
Further down STALKER which uses MRTs afaik (not the most advance engine around anymore either heh), seems to prove your point; it's just unfortunate that AA cannot be enabled in that application:

http://www.computerbase.de/artikel/...ti_radeon_hd_3870_rv670/21/#abschnitt_stalker

It's not just STALKER either which exhibits a large hit when only 16xAF (but not AA) is applied. There are now lots of reviews of the 9600GT, and I had to read countless times how the 9600GT has a much smaller AA hit without any evidence whatsoever that it wasn't in fact a much smaller AF hit (argh) (though, arguably, the 9600GT should indeed have a very small AA hit, smaller than 8800GT).
Some numbers from the computerbase 9600GT review (http://www.computerbase.de/artikel/hardware/grafikkarten/2008/test_nvidia_geforce_9600_gt_sli/) when only 16xAF was enabled but the game didn't support AA, this should even be without HQAF btw:
Gothic 3 1600x1200 no AF: 9600GT 15% faster than 3850 512MB
Gothic 3 1600x1200 16xAF: 9600GT 63% faster than 3850 512MB
(the 9600GT has pretty much free 16xAF here, the 3850 certainly has not...)
Stalker 1600x1200 no AF: 9600GT 24% faster than 3850 512MB
Stalker 1600x1200 16xAF: 9600GT 52% faster than 3850 512MB
UT3 1600x1200 no AF: 9600GT 4% slower than 3850 512MB
UT3 1600x1200 16xAF: 9600GT 4% faster than 3850 512MB
(not a huge difference this time but enough to swap rankings of the cards)
Bioshock 1600x1200 no AF: 9600GT 2% faster than 3850 512MB
Bioshock 1600x1200 16xAF: 9600GT 5% faster than 3850 512MB
(not really a difference here, this game does not appear to be texture limited at all since both cards essentially get 16xAF for free)

So if the rv770 indeed somehow has more texture units this should help it being more competitive in some games rv670 currently isn't really (at least not with the quality options AA/AF enabled). If that happens, I already bet that everybody will write "AA is fixed!" when in reality it's just fast as ever but the AF hit got smaller... That's not to say though the performance hit from enabling AA might not get smaller indeed, if ROPs get updated or memory bandwidth increased.

DegustatoR · Feb 22, 2008

Arun said:
Also, regarding a transistor count wall: that's senseless, we'll hit a power wall way before then, most likely.

It's the same wall actually =)

DegustatoR · Feb 22, 2008

mczak said:
There are now lots of reviews of the 9600GT

It's OT, but what bugs me the most about G94GT is how close it is to G92GT in real world apps.
I can't explain it but i suspect that by going with 256-bit bus in G92 NV has made a mistake, and G92GT is heavly b/w limited right now.
This also gives me a reason to suspect that G92-based 9800GTX may use GDDR4/5 memory b/c that would be the area which if improved will give G92-based card the most benefit.
Maybe i'm wrong however =)

AMD: R7xx Speculation

Bouncing Zabaglione Bros.

Jawed

nicolasb

Jawed

DegustatoR

Jawed

Ailuros

Epsilon plus three

DegustatoR

ShaidarHaran

hardware monkey

Silent_Buddha

DegustatoR

Kaotik

Drunk Member

MfA

Ailuros

Epsilon plus three

Silent_Buddha

DegustatoR

Arun

Unknown.

mczak

DegustatoR

DegustatoR

Similar threads