Prediction: In a year, NVIDIA buys the combined AMD/ATI

Geo · Jul 29, 2006

Now, what happens with gaming?

Hecifino. You've seen, I presume, discussion of how DX10 is a much more uniform feature set, and that the IHVs will differentiate by performance?

Well, that still requires developer support, one presumes. Are developers going to be shoveling more vertices at ATI if NV can't handle them? Or as they've so often done, look at both cards, say "okay, lowest common denominators it is --these from column green, and those from column red --GO!"

I'm also unsure where, if at all, the pcie bus gets saturated with vertice data --before or after ATI's vs capability is saturated?

Well, you might say, if that is a limiting factor, couldn't they step around that a bit with the new GS and amplify vertices inside the card? Except that as we already saw in another thread, the api puts a real limit on that route as well.

So, still a goodly number of unanswered questions on those points of what happens when the theoretical steps out of the shadows into the sometimes cruel sunshine of reality. Interesting times ahead (dunno about the rest of you, but that's what I love about this industry).

Jawed · Jul 29, 2006

trinibwoy said:
Do you know something about G80 that you're not sharing? :smile: Or is the concept of unified shaders so romantic that it just "has" to be more efficient?

It is more efficient (more of the GPU runs at full utilisation - the goal of PU design for performance is that as much as possible of the entire PU is running continuously: CPU or GPU). This is what you get from simulations and trivial thought experiments show the same. The only matter that's questionable is what degree of efficiency gain does unification bring? 10, 50, 100%? And how does that scale with the overall size of the GPU (e.g. do value GPUs gain more than enthusiast GPUs, or vice versa?). Also, are the control mechanisms for unificaton (and the changes in GPU behaviour from the devs' point of view) worth unification?

In my view the overhead for unification scales more slowly than the overall scale of the GPU as you add more and more pipelines. This is because it's a memory-intensive architecture and memory's denser than logic (and much easier to make redundant). The start-up cost (die area) is high but there's significantly higher performance.

R520's pixel shader pipelines can easily run 20% faster than R4xx after accounting for their theoretical clock differences. This is due to the combination of better caching, better memory interfacing and out-of-order threading. The latter, in R580, can show further substantial gains (which indicate that the TMUs in R520 are often not running at full utilisation). All in all, merely running threads out of order provides such a massive boost in efficiency (30%+?) that the gains to be had when you mix in VS and GS threads for out of order scheduling look to be significant. 50% plus...

And since ATI's unification is built up from the out-of-order threading hardware that is already within R580, the incremental cost for ATI to go to full unification is lower because the cost is shared with the concept of "performant dynamic branching" in pixel shaders. (Though dynamic branching performance in VS/GS will be hindered due to batching in ATI's unified architecture, as opposed to a traditional architecture that doesn't use batching for this kind of work. But ALU pipeline performance whilst running VS or GS is either never normally a performance consideration, or R600 will simply switch-in all 64 ALU pipelines to ease-away the bottleneck.)

Unification doesn't merely increase ALU utilisation (and therefore overall shader efficiency) but it also increases texture pipe (or vertex fetch pipe) utilisation by scheduling that work out of order too. Finally, by sharing these texture/vertex-fetch pipes globally across all functions (VS, GS, PS) their overheads (from being decoupled) are shared and so the total overhead is reduced.

I expect G80 to be partially unified in two different ways:

VS and GS shaders will run on the same ALU pipes
some (perhaps not all) texturing or vertex fetch (point-sampling) pipes will be shared between VS/GS and PS.

Those expectations are founded on the patents we've seen.

I don't get the first point either. Even when Nvidia had more features and a bigger chip (NV40) their margins were better than ATi's.

287mm2 versus 282mm2? That's not a "bigger chip" in any meaningful way. I think any difference in margin would have been due to the process, 130nm low-k at TSMC was a "premium" node, for higher clocking.

I don't remember what kind of difference in margins there was, 18-months to 2 years ago

Jawed

Jawed · Jul 29, 2006

geo said:
Now, what happens with gaming?

Hecifino. You've seen, I presume, discussion of how DX10 is a much more uniform feature set, and that the IHVs will differentiate by performance?

They should learn to do so pretty quickly. If they don't have to test for caps (beyond checking for D3D10, D3D10.1 etc.) then they can streamline the options that determine performance.

Just because G80 runs half the speed of R600, say, doesn't mean that devs will hold back work from R600. Valve set the trend there. Devs will be able to see which GPU is a better match for future algorithms and start there.

I'm also unsure where, if at all, the pcie bus gets saturated with vertice data --before or after ATI's vs capability is saturated?

Well, you might say, if that is a limiting factor, couldn't they step around that a bit with the new GS and amplify vertices inside the card? Except that as we already saw in another thread, the api puts a real limit on that route as well.

GS, and D3D10 in general, is all about moving great wodges of work off the CPU. So you can do more for the same amount of data sent over PCI Express. WAY WAY MORE. It's not merely amplification you're doing, but complete rendering passes (for a single frame) that run entirely on the GPU - with the CPU merely supervising.

If you've played with ATI's R2VB demos, then that gives you an idea of the entry level. Those demos are all about running stuff on the GPU not across the CPU/GPU divide (with the attendant batch performance problems of DX9).

Jawed

trinibwoy · Jul 29, 2006

Jawed said:
It is more efficient (more of the GPU runs at full utilisation - the goal of PU design for performance is that as much as possible of the entire PU is running continuously: CPU or GPU).....Also, are the control mechanisms for unificaton (and the changes in GPU behaviour from the devs' point of view) worth unification?

Well I think that's what geo was getting at. If you define efficiency that way yet use more transistors to get there, are you really improving efficiency? Also, if those transistors that you're keeping busy more of the time are slower at a particular task then what will performance look like in the end?

It seems that when we discuss unified we assume equality on certain levels and apply unification on top of that. I'm positive that it's not going to turn out that way.

287mm2 versus 282mm2? That's not a "bigger chip" in any meaningful way. I think any difference in margin would have been due to the process, 130nm low-k at TSMC was a "premium" node, for higher clocking.

Wow were they that close in size? Wasn't there like a 60M transistor count difference!?

Geo · Jul 29, 2006

Jawed said:
In my view the overhead for unification scales more slowly than the overall scale of the GPU as you add more and more pipelines. This is because it's a memory-intensive architecture and memory's denser than logic (and much easier to make redundant). The start-up cost (die area) is high but there's significantly higher performance.

This might also be the core of the difference in technical decisions made by each company, and where to implement them. Much as ATI made a decision re where to implement the transistors attendant to what they felt it took to do SM3 right. . . NV could have decided that the cost balance pointed at above didn't look favorable to their eyes until further down the road.

Acert93 · Jul 29, 2006

Razor1 said:
I think the r600 will have fairly large transitor counts if its based off of Xenos tech which is about 330 mill tranys for the xenos chip, not sure if that includes the edram though. Anyways IMO I think it will end up around 480 mil along with the g80, depending on the process used it will make a bit of difference on clocks as you said.

Xenos is only ~232M transistors for the parent die and ~105M for the eDRAM if my memory is correct. About 80-90M of the eDRAM transistors are for memory. Xenos also has 16 disabled shader ALUs for yield purposes. D3D10 will require integer and bitwise support, full Geometry Shader support (although the basics seem to be roughly in place on Xenos) and some other tweaks and caches. Xenos also lacks Avivo. But Xenos has the same essential threading and scheduling as the R580 series and really is not huge by any means. Part of this is because the shaders are more fine grained--1 robust ALU per shader, which goes along with the general theme of dynamic load balancing.

It is hard to imagine R600 only having 64 shader ALUs. It could happen, and maybe they will use a shader array similar to R580 where there are multiple ALUs per shader. But if ATI was able to squeeze in 64 shader ALUs on 90nm with ~250M logic transistors, even with all the additional D3D10 features and requirements it is hard to imagine ATI would be incapable of increasing the shader array over Xenos on 80nm/65nm. Hopefully ATI keeps the general shader performance envelope and adds more TMUs to R600; R580 has plenty of raw shader power, hopefully the focus shifts back to balancing out the rest of the design. Based on the Xenos die size and transistor count I don't see why they couldn't.

Jawed · Jul 29, 2006

trinibwoy said:
Well I think that's what geo was getting at. If you define efficiency that way yet use more transistors to get there, are you really improving efficiency? Also, if those transistors that you're keeping busy more of the time are slower at a particular task then what will performance look like in the end?

Not all transistors are doing the same thing.

As I hinted earlier, memory transistors are more densely packed, and can be easily configured with extremely high redundancy (i.e. lead to a massive boost in yield per unit area for the die as a whole).

It's my understanding of the unified architecture that ATI's created that it uses a relatively large amount of memory (compared to traditional GPUs). This is partly because the number of threads in flight is very high (to increase the chances of hiding texture fetch latency) and is thus a feature of R580 too. It's also because a unified architecture needs to have a more involved "state machine" for thread scheduling (queue, thread status/prioritisation, arbitration between ALU pipelines, TEX/VFU pipelines, ROP pipelines and inter-stage buffers, not to mention the relative memory consumption of threads depending on whether they're VS, GS or PS threads and the dizzying wonder of allocating post-GS cache and constant buffers and how they might all contend for register file space). Not all of that is pure memory, obviously, there's extra logic required to control and interpret all that stuff.

So, if a unified architecture has ~ the same number of transistors as a traditional architecture (which has more pipes/ALU/TMUs in total) the radical differences in the usage of those transistors makes the comparison pretty much void.

You can only count absolute performance when the GPU is running flat out with all features firing. Obviously with DX7 GPUs that was relatively easy to test. Nowadays it seems pretty tricky because of the huge variation in possible workloads. G71 performs better than R580 at old-fashioned texture-heavy shaders and shaders with no dynamic branching and light register count.

I hope that more emphasis will be placed on performance minima in the future and I think that's where unified will show a clean pair of heels.

Wow were they that close in size? Wasn't there like a 60M transistor count difference!?

Yep, which is why transistor counting is so utterly pointless across IHVs. You have exactly the same problem in comparing R580 and G71, I think it's a 30% disparity.

Jawed

DemoCoder · Jul 29, 2006

All of this is predicated on the assumption that DB performance is very important and that near-future DX10 workloads are going to make heavy use of it.

Acert93 · Jul 29, 2006

DemoCoder said:
All of this is predicated on the assumption that DB performance is very important and that near-future DX10 workloads are going to make heavy use of it.

Which my guess is no based upon the craptastic dynamic branching performance of most SM3.0 GPUs (the very chips that will be the baseline for the near future, even D3D10 "enabled" games). But then again it only takes 1 killer app (like HL2 with SM2.0) to demonstrate the need and importance. That could create a bit of a crysis.

Jawed · Jul 29, 2006

Forgetting the question of efficiency, there's still every chance that R600, being considerably "second gen", will have good margins. The major architectural stuff is now very well understood by ATI, so it's detailed features (such as bitwise operations or constant buffers) that will soak up some effort, while a lot of the rest will be spent making the whole thing slicker.

Maybe a bit like NV40->G70 (extra ALU capability in the pipeline) or G70->G71 (lower-cost pixel shader pipeline).

NVidia, as I keep saying, is climbing a much steeper mountain so it's going to be much harder for them to dot all the Is and cross all the Ts for maximum margins.

Jawed

Acert93 · Jul 29, 2006

Jawed said:
NVidia, as I keep saying, is climbing a much steeper mountain so it's going to be much harder for them to dot all the Is and cross all the Ts for maximum margins.

Jawed

Of course NV isn't building 2 console GPUs during the last 2 years and instead migrated their PC architecture over, so maybe they have had the time and resources to do some G80 magic? Tall mountain, but maybe they have been head faking us on what G80 is?

Geo · Jul 29, 2006

Acert93 said:
Of course NV isn't building 2 console GPUs during the last 2 years and instead migrated their PC architecture over, so maybe they have had the time and resources to do some G80 magic? Tall mountain, but maybe they have been head faking us on what G80 is?

Jen-Hsun made one of those pre-NV30-launch kind of statements a few months back about how much G80 R&D cost. I know that's not a comfort for some people, but it does indicate to me that they didn't mail it in on the technology side. We'll see soon enough.

Pete · Jul 29, 2006

(Post meant to follow Acert's, but geo's just too quick.)

Well, it's not that simple. Xenos by all accounts represents work that goes into R600, so while it took serious manpower it didn't necessarily leave ATI standing still. And one would assume Hollywood would be some sort of extension of the GameCube's GPU, so maybe not so much work there. Still, you're right, all that plus R5x0 at the same time surely leaves something short-handed.

And I agree, no way R600 is only 64 ALUs when Xenos is already there (16 disabled for yields). Would it be much of a stretch to expect R600 to at least double Xenos (be it twice 48 or 64 ALUs and possibly twice 16 bilinear--but still 16 point-filtered--TMUs), corresponding to its doubled memory bus?

But that's for another thread....

DudeMiester · Jul 29, 2006

Just because you have a large market cap doesn't mean you automatically have huge wads of cash, it just means you have a high value stock. Although, in this case I'm pretty sure NV has huge wads of cash anyways. I wouldn't put it outside of the world of possibility, but it is on the very edge. I'm pretty sure AMD/ATI will have bigger wads of cash by that time anyways.

INKster · Jul 30, 2006

DudeMiester said:
Just because you have a large market cap doesn't mean you automatically have huge wads of cash, it just means you have a high value stock. Although, in this case I'm pretty sure NV has huge wads of cash anyways. I wouldn't put it outside of the world of possibility, but it is on the very edge. I'm pretty sure AMD/ATI will have bigger wads of cash by that time anyways.

Keep in mind that AMD has been profiting these past 3 years because they could afford the luxury of selling their products with a premium price (perceived, with good reason, as the best in the market on a price/power consumption/performance basis).

Not so anymore.
With Core 2 Duo available now, AMD will have to sacrifice profits to stay competitive (hence the recent price cuts to match Intel's).
I foresee AMD back in "the red zone" (pun not intended

) for several quarters to come, at least until K8L, and even then, we don't know much yet about "Nehalem".

DudeMiester · Jul 30, 2006

I don't know about the red zone, but certainly margins will be lowered. On the other hand, if they can use GPUs as a cash cow as Nvidia does so well, it should be ok. The nice thing about supplying Intel chipsets and GPUs for that platform, is even if Intel starts beating AMD, AMD still makes money.

Skrying · Jul 30, 2006

Intel and AMD are offset right now.

AMD released the A64 halfway between the lifetime of the P4. Now that the A64 is getting a little older the Core 2 comes out, halfway through the Core 2 life span a improved A64 comes out, etc etc.

Intel is trying as hard as possible right now to basically skip a step and match AMD pace for pace with their previously "slightly ahead" chips. If that makes any sense.

If Intel does that then they'll sit pretty, but right now I see them fighting back and forth with periods of profit for both and then periods of loss.

Razor1 · Jul 30, 2006

Acert93 said:
Xenos is only ~232M transistors for the parent die and ~105M for the eDRAM if my memory is correct. About 80-90M of the eDRAM transistors are for memory. Xenos also has 16 disabled shader ALUs for yield purposes. D3D10 will require integer and bitwise support, full Geometry Shader support (although the basics seem to be roughly in place on Xenos) and some other tweaks and caches. Xenos also lacks Avivo. But Xenos has the same essential threading and scheduling as the R580 series and really is not huge by any means. Part of this is because the shaders are more fine grained--1 robust ALU per shader, which goes along with the general theme of dynamic load balancing.

It is hard to imagine R600 only having 64 shader ALUs. It could happen, and maybe they will use a shader array similar to R580 where there are multiple ALUs per shader. But if ATI was able to squeeze in 64 shader ALUs on 90nm with ~250M logic transistors, even with all the additional D3D10 features and requirements it is hard to imagine ATI would be incapable of increasing the shader array over Xenos on 80nm/65nm. Hopefully ATI keeps the general shader performance envelope and adds more TMUs to R600; R580 has plenty of raw shader power, hopefully the focus shifts back to balancing out the rest of the design. Based on the Xenos die size and transistor count I don't see why they couldn't.

Ah thx for the correction Acert93 :smile: . Well given that ATi is definitly going unified, I don't see them going with any shader structure other then 1x1 per pipeline, anything else would just defeat the purpose of unification.

example:

Lets say a program calls for 5 vertex and 5 geomotry shaders, and you have a 1x2 type shader structure with 32 pipes each with 2 ALU's, it would end up utilizing 2 and a half of the pipes for vertex shaders and 2 and half pipes for the geometry shaders which leaves two pipes each with one half not utilized. Its not much of a loss but again it defeats the purpose of unification. And I think thats why we saw the Xenos have a 48x1 structure.

This is all speculation but assuming ATi sticks with thier 64 pipes with 1 ALU each the GPU will be 310 mill + the Avivo engine + addition Dx10 features I think it will end up around the 500 mil trany count, otherwise I don't see why ATi has to have the r600 on the .65 micron node. Actaully forgot that Xenos didn't have an Avivo engine

.

digitalwanderer · Jul 30, 2006

I can't believe this thread is still going!

DemoCoder · Jul 30, 2006

I don't think the G80 needs world beating DB performance, it just needs DB good enough to avoid pathologically bad behavior, since I don't believe performance of next-gen titles are going to be dominated by DB. I am not saying that they won't use DB, but that DB will only constitute a fraction of the workload. Consider VTF for example, I don't think an implementation that is twice as slow as the market leader will matter that much. What might matter is if your implementation runs 500-1000% slower if it is enabled in a game.

IIRC, current Nvidia DB architecture uses batches of around 1024. The R580 uses 48 (or is it 64? I don't remembe)r. If the G80 could manage to reduce batch size down to 128 (a factor of 8), that alone world go along way to keeping them competitive. It won't win in synthetic benchmarks meant to showcase DB, but I think in real world titles it will only amount to a small loss.

I don't think they're going to spend a whole lot of effort making GS and PS DB run ultra fast, precisely because it lowers margins and won't matter to most DX10 games that will arrive within the lifespan of the G80.

Prediction: In a year, NVIDIA buys the combined AMD/ATI

Geo

Mostly Harmless

Jawed

Jawed

trinibwoy

Meh

Geo

Mostly Harmless

Acert93

Artist formerly known as Acert93

Jawed

DemoCoder

Acert93

Artist formerly known as Acert93

Jawed

Acert93

Artist formerly known as Acert93

Geo

Mostly Harmless

Pete

Moderate Nuisance

DudeMiester

INKster

DudeMiester

Skrying

S K R Y I N G

Razor1

digitalwanderer

wandering

DemoCoder

Similar threads