GF100 evaluation thread

aaronspink · Mar 28, 2010

neliz said:
I've had quite some PM's with Ailuros and right up to the launch he called my 295W measured power draw was Bullshit. I guess this whole launch drama will put a dent in a lot of people's faith.

Ahh so 250W TDP is Nvidia's version of AMD's ACP! Unless your design has technology to prevent going over the TDP for any thermally significant periods, you better not exceed your TDP.

BTW 110C is well into electro migration range. If the 95C people are measuring are really Tj temps, these chips likely will have issues in short order.

aaronspink · Mar 28, 2010

Squilliam said:
I wonder how reliability will hold up? I remember a survey showing some fairly high failure rates (IMO) for some of the GTX 285s especially. If the card is now hotter and more complex and comes with some supposed Via issues, how long will a GTX 480 last and how often will it fail?

They just have to hold up for 1 year, then it isn't Nvidia nor their partners problem!

Bouncing Zabaglione Bros. · Mar 28, 2010

aaronspink said:
They just have to hold up for 1 year, then it isn't Nvidia nor their partners problem!

Nvidia continues to burn bridges (if not houses down with the high temps). At least with customers, because I can't see many OEMs being interested in these things beyond niche special order products.

Bumpgate and drivers that melt cards can be ascribed to mistakes/incompetence, do it again and Nvidia will be cementing a reputation for faulty products by design.

Jawed · Mar 28, 2010

rpg.314 said:
BTW, It appears that the ALU:TEX in fermi is 8:1. Wouldn't that be too high.

If you want to normalise to scalar MAD ALU operations per cycle of texture fetching, then GF100 is nominally 16:1, while Cypress is 20:1.

Yes and no. Big changes are due, but they need not arrive all at once. EG adds r/w caches (albeit at L2 level), 2 rasterizers for a limited support for distributed ff processing, and most importantly DX11 so that you don't lose a fight by not turning up.

Evergreen doesn't have R/W L2. It has some small R/W caches dotted around (most attached to the ROPs).

Hecaton could conceivably add a unified cache hierarchy with multiple tessellators without too invasive changes.

Why bother? HD5970 is faster than GTX480 at tessellated workloads.

Larger LDS might be a convenient way of solving the bank conflicts we saw in tessellation.

Or AMD could just change the driver so that it lays out the data to avoid bank conflicts.

Distributed setup/raster might be doable too.

Rasterisation is distributed. The problem is that AMD didn't distribute it enough. If Cypress had 4 banks of SIMDs each with 8 fragments per clock rasterisation, that might have been a start. But I suspect there are more fundamental problems in ATI's architecture to do with the way triangles of fragments are packed per hardware thread (are they? I have my doubts) and the way that hardware threads are globally launched and scheduled, rather than doing so locally.

It is due, but shorter refresh cycles mean that there is no need to throw the entire kitchen sink into one chip.

See my signature

Though it also seems like the era of TSMC's tick-tocking node/half-node has ended, which looks like a really serious problem.

Jawed

CarstenS · Mar 28, 2010

Jawed said:
Why bother? HD5970 is faster than GTX480 at tessellated workloads.

I wonder, if tessellation, especially adaptive stuff would lead to an increase in necessary inter-GPU-communication?

Jawed · Mar 28, 2010

CarstenS said:
I wonder, if tessellation, especially adaptive stuff would lead to an increase in necessary inter-GPU-communication?

It seems adaptive factors are generated intra-frame in current techniques. Though you hint at something that sounds like a good idea, as I suppose it's possible to update these factors progressively across all geometry, i.e. in phases, over successive frames, with factors changing in clumps. Varying the frequency of factor update for a clump of geometry depending on distance to camera?

I don't have any idea of the cost of storing these factors or transmitting them...

Jawed

chiadog · Mar 28, 2010

I am pretty indifferent about the gf100. Of course, I feel the exact same way about the HD58xx series*. These two architectures' performance gains over previous generation are laughable in comparison to how many SP/CC/(whatever they're calling it these days) they've added. Ah well, looking forward to the next round as this match up is pretty underwhelming. I hope for the next round that they will put more strict guideline on the power consumption as they are getting to obscene levels. It's like we're back in the Prescott days.

*I own a 5850.

rpg.314 · Mar 28, 2010

Jawed said:
If you want to normalise to scalar MAD ALU operations per cycle of texture fetching, then GF100 is nominally 16:1, while Cypress is 20:1.

So nv's almost there in alu:tex apart from the t unit.

Evergreen doesn't have R/W L2. It has some small R/W caches dotted around (most attached to the ROPs).

It has write combining caches attached to each MC. Cache lines are read in, and entire cache lines are evicted. Leading to wider write transactions. If it has a lru eviction policy, then how are they not L2's?

Why bother? HD5970 is faster than GTX480 at tessellated workloads.

To stagger the changes over many chips.

Or AMD could just change the driver so that it lays out the data to avoid bank conflicts.

Yes, but it seems more capacity induced to me. I think it would have been reduced by now if it was so easily resolvable. Any way, we'll know for sure when B3D does it's hecaton/10.3 analysis.

Rasterisation is distributed. The problem is that AMD didn't distribute it enough. If Cypress had 4 banks of SIMDs each with 8 fragments per clock rasterisation, that might have been a start. But I suspect there are more fundamental problems in ATI's architecture to do with the way triangles of fragments are packed per hardware thread (are they? I have my doubts) and the way that hardware threads are globally launched and scheduled, rather than doing so locally.

Fair enough. But there's no way in hell they won't pack fragments from multiple triangles into one hw thread. 64 wide threads are too wasteful otherwise.

Though it also seems like the era of TSMC's tick-tocking node/half-node has ended, which looks like a really serious problem.

Yeah, good bye to yearly gpu launches.

Rys · Mar 28, 2010

gkar1 said:
Must be really upsetting ending up being another marketing tool used to generate hype.

When the number comes direct from the lead architect, you can forgive me for having confidence in it

Florin · Mar 28, 2010

I just came back from holiday and I haven't read most reviews yet, but from what I've seen, this launch wins a sympathy vote, so yay/yay

CarstenS · Mar 28, 2010

Jawed,
Yes and no. Some kind of grouping depending on z-distance could be as useful as not updating tessellated geometry every frame - but that would inevitably lead to a more coarse level-of-detail system with geo-detail popping up.

But why I was wondering: Is every GPU in current systems only given the base mesh and expands it every frame again on it's own? Since every GPU does only every other frame, the differences in drawn geometry would be noticeably larger compared to a single GPU which has to do every frame by itself. I could imagine this isn't too helpful with caching performance, don't you think? And also the expansion has to be done on many more triangles, as many more change from one LoD into another.

Instead, would every GPU transmit their Frames' geometry state to the other to avoid that, you'd end up with a lot of geometry data which has to pass the busses hence and forth for every frame.

CarstenS · Mar 28, 2010

rpg.314 said:
So nv's almost there in alu:tex apart from the t unit.

which is also there in GF100, but sans FMA.

Alexko · Mar 28, 2010

Rys said:
When the number comes direct from the lead architect, you can forgive me for having confidence in it

Indeed. Is this where the 1700MHz figure came from as well?

PeterAce · Mar 28, 2010

Early Fermi thourghts...

While I have not read and digested all the reviews out there (and there are many datapoints missing at this early time in the analysis, I'm looking at you B3D and Tech-report

) So far, what I am impressed/plesantly surprised with :

- Minimum frame rates seem great.

- Low res frame rates seem great.

What I'm disapointed with :

- Heat/Noise

As an enthusiast I replaced 8800 GTX SLI setup with a 280 GTX SLI (using the same case) I'm unsure if 480 GTX SLI will allow the same! Without extra side case fans or maybe an new case.

Anyway as avaliability is still 'pre order' I've got a little time to think about cooling changes

Bouncing Zabaglione Bros. · Mar 28, 2010

Rys said:
When the number comes direct from the lead architect, you can forgive me for having confidence in it

Is it the sort of thing you would expect the lead architect to keep to himself as company confidential? If so, you've got to expect any such information offered as suspect.

It just goes to show that everyone in the company will be under orders to either keep quiet or spread misinformation.

Jawed · Mar 28, 2010

rpg.314 said:
So nv's almost there in alu:tex apart from the t unit.

If you assume ATI has a typical best case of 80% ALU utilisation...

It has write combining caches attached to each MC. Cache lines are read in, and entire cache lines are evicted. Leading to wider write transactions. If it has a lru eviction policy, then how are they not L2's?

L2 implies there's an L1. But it's not a two-level hierarchy.

To stagger the changes over many chips.

We need evidence that TS throughput is a bottleneck in games, first...

Fair enough. But there's no way in hell they won't pack fragments from multiple triangles into one hw thread. 64 wide threads are too wasteful otherwise.

That's not an argument though, I want evidence one way or the other.

AMD says 8 fragments per triangle is the reasonable lower limit. There's ~ an order of magnitude between 2 quads occupying a 64-capacity hardware thread and 1 fragment occupying the same thread.

In games with no tessellation I imagine there's quite a bit of that 64-capacity going unused - even with fairly high fragment counts per triangle on average. But, I am just guessing at the workings here...

Yeah, good bye to yearly gpu launches.

I think it basically means that AMD's GPUs will creep up in size - sweet spot isn't about making the smallest die possible, anyway. Cypress is the second biggest ATI chip, after all (though there's some argument about the size of R580 :???:

). Indeed Cypress could have ended up the biggest ATI chip ever, depending on how you interpret the comments on the shrinkage it suffered.

Jawed

trinibwoy · Mar 28, 2010

aaronspink said:
ouch, that's really bad. Though honestly all the card should be able to handle multiple displays with less power/heat but the 480 is ridiculous.

Well you're looking at a card that's burning up even at idle clocks of 50Mhz core, 100Mhz memory. It's barely running and still using a lot of juice. And there's nowhere to go but up.

CarstenS · Mar 28, 2010

Jawed said:
Cypress is the second biggest ATI chip, after all (though there's some argument about the size of R580 )

From what we've measured, R580 was about 10 sqmm larger than Cypress (with equal parameters, thus the error should be more or less equal too).

Jawed · Mar 28, 2010

CarstenS said:
Jawed,
Yes and no. Some kind of grouping depending on z-distance could be as useful as not updating tessellated geometry every frame - but that would inevitably lead to a more coarse level-of-detail system with geo-detail popping up.

You can trade LOD for the reduced effort in computing LOD per frame, therefore avoiding popping.

But why I was wondering: Is every GPU in current systems only given the base mesh and expands it every frame again on it's own?

That's the basic concept. No different from texturing each triangle each frame, etc.

Time-coherent shading is possible, e.g.

http://developer.amd.com/media/gpu_...ith_Reverse_Reprojection_Caching(GH07).ppt#35

and things like ambient occlusion approximation can be generated progressively over multiple frames. So it seems reasonable to do tessellation factors the same way.

Since every GPU does only every other frame, the differences in drawn geometry would be noticeably larger compared to a single GPU which has to do every frame by itself. I could imagine this isn't too helpful with caching performance, don't you think? And also the expansion has to be done on many more triangles, as many more change from one LoD into another.

AFR spoils lots of things, which is why I don't like it. I dislike the "X2 to compete with NVidia's top-end" strategy.

Instead, would every GPU transmit their Frames' geometry state to the other to avoid that, you'd end up with a lot of geometry data which has to pass the busses hence and forth for every frame.

Yes. Early days yet as we don't know how expensive the data involved in adaptive tessellation factors is.

Jawed

rpg.314 · Mar 28, 2010

Jawed said:
If you assume ATI has a typical best case of 80% ALU utilisation...

Well, this sort of analysis is always modulo these inconveniences...

L2 implies there's an L1. But it's not a two-level hierarchy.

It's L1 in the sense that there is no level of cache above it. It's L2 in the sense that there are caches above it which are not unified with it. At any rate, it does seem to have r/w caches.

We need evidence that TS throughput is a bottleneck in games, first...

TS might not be a bottleneck, but in tessellated scenes setup/raster quite likely are.

AMD says 8 fragments per triangle is the reasonable lower limit. There's ~ an order of magnitude between 2 quads occupying a 64-capacity hardware thread and 1 fragment occupying the same thread.

In games with no tessellation I imagine there's quite a bit of that 64-capacity going unused - even with fairly high fragment counts per triangle on average. But, I am just guessing at the workings here...

More reasons to go deferred. Yeah, I am a sucker for deferred shading. :smile: With tessellation however, it might be more beneficial to rasterize all and then shader instead of just buffering up all the geometry.

I think it basically means that AMD's GPUs will creep up in size - sweet spot isn't about making the smallest die possible, anyway. Cypress is the second biggest ATI chip, after all (though there's some argument about the size of R580 ). Indeed Cypress could have ended up the biggest ATI chip ever, depending on how you interpret the comments on the shrinkage it suffered.

Or may be they can go to GF and ask 'em to do half nodes.

GF100 evaluation thread

Whatddya think?

Yay! for both

480 roxxx, 470 is ok-ok

Meh for both

480's ok, 470 suxx

WTF for both

aaronspink

aaronspink

Bouncing Zabaglione Bros.

Jawed

CarstenS

Moderator

Jawed

chiadog

rpg.314

Rys

Graphics @ AMD

Florin

Merrily dodgy

CarstenS

Moderator

CarstenS

Moderator

Alexko

PeterAce

Bouncing Zabaglione Bros.

Jawed

trinibwoy

Meh

CarstenS

Moderator

Jawed

rpg.314

Similar threads