NVIDIA Fermi: Architecture discussion

3dilettante · Jan 22, 2010

I forgot about the mapping of the L2s to memory controllers, negating the need to keep them coherent with one another.

Jawed · Jan 22, 2010

chavvdarrr said:
Fact is, detecting 0 cosmic rays induced errors is way too suspicious than detecting 2 orders of magnitude less errors than expected.

Yes, it's suspicious, the authors are clearly surprised. You will find more than two orders of magnitude variation in estimated FIT by the way, if you look around.

Maybe the hunt for cosmic ray induced soft errors in graphics cards is a bit like the hunt for neutralinos

Jawed

mczak · Jan 22, 2010

3dilettante said:
The L1 figure appears to be a straightforward 4 byte*16 load/store*16 cores*1.5 GHz.
The L2 sounds like it has core-local partitions with similar bandwidth, though probably not of 32-bit granularity in transfers.

If that very similar speed for L2 is indeed true, that would be very impressive. Would be about 3 times more than what Cypress offers (though speculation says L1->L2 bandwidth could indeed potentially be a bottleneck in Cypress).
Cypress L2 partitions apparently offer 128bytes per clock bandwidth (reportedly, it's the same for rv770 - by that logic though Juniper should only have half the bandwidth as it only has two instead of 4 MCs/L2 partitions).
I can't quite come up with any numbers which would give "similar" L2 bandwidth compared to L1 for Fermi though - at 256bytes per clock and partition (with 6 partitions, 600Mhz clock) that would "only" be ~920GB/s, still twice that of Cypress, but a bit of a stretch of "similar" (but twice that would give more L2 bandwidth than L1, which doesn't make sense).

trinibwoy · Jan 22, 2010

I swear I saw something about L2 having its own clock domain but I can't remember where. Maybe that's the missing component.

3dilettante · Jan 22, 2010

The L2 and ROPs are in what's left of the old slow-clock domain of earlier designs.

Jawed · Jan 22, 2010

Fermi's L1 bandwidth is very nice - need to add in the TMU's L1, what's the bandwidth there?

Jawed

Groo The Wanderer · Jan 22, 2010

rpg.314 said:
When Juniper does not have DP, what makes you think that LIano will have it?

However, I am expecting LIano to have HT3, as it connects amd cpu's to northbridge.

Intel competition perhaps? That said, Llano is aimed at the netbook set, wait for Bulldozer based fusion parts before you decide to jump in or not. Think of Llano as an architectural preview.

-Charlie

MfA · Jan 22, 2010

Unlike the texture cache the L1 bandwidth is almost certainly highly dependent on access patterns (ie. bank conflicts).

Groo The Wanderer · Jan 22, 2010

chavvdarrr said:
In that topic, both aaronspink and dkanter, said you are wrong in downplaying cosmic rays. Fact is, detecting 0 cosmic rays induced errors is way too suspicious than detecting 2 orders of magnitude less errors than expected.
And frankly on such topic I'd believe what dkanter says.

I don't think it matters as much as people think. Not because it is happening or not happening, but because of the cost of a failure. Generally failures cost a LOT of time, money and annoyance, way out of proportion to the cost of the machine.

Then there is the fact that ECC does not just protect against cosmic rays, it can shield you from bad memory, electrical interference, and tons of other causes. If you have a server that is poorly placed against a wall with a transformer behind the sheetrock that you did not know about, well, you have some rather annoying transient errors in your future.

For most people, the cost of ECC is worth it to prevent the errors on anything that has a large value for downtime. The cost of the ECC'd parts usually is trivial compared to the cost of most downtimes, so you just do it. Quantifying the hit rate of cosmic rays is nice and fine, but how do you quantify 'shit happens', something that I would argue is much more prevalent than cosmic ray strikes in critical areas.

-Charlie

Ailuros · Jan 23, 2010

Groo The Wanderer said:
Intel competition perhaps? That said, Llano is aimed at the netbook set, wait for Bulldozer based fusion parts before you decide to jump in or not. Think of Llano as an architectural preview.

-Charlie

Vastly OT but I lived under the impression that Llano is more notebook material and Ontario is meant for netbooks (or anything even lower perhaps)?

chavvdarrr · Jan 23, 2010

Bouncing Zabaglione Bros. said:
As I said before, Nvidia would be foolish not to design hardware limitations into the product, rather than rely on crackable software lockouts.

btw, what was the reason for AMD removing DP execution from non-58xx ?
Die size is negligible. Just to screw hobbyiests ?

Jawed · Jan 23, 2010

chavvdarrr said:
btw, what was the reason for AMD removing DP execution from non-58xx ?
Die size is negligible. Just to screw hobbyiests ?

The die overhead for DP in ATI should be low (since it's a few extra bits on the four multipliers + wider dot-product paths which serve to connect it all together).

I'm reasonably sure, now, that Juniper and lower GPUs cannot do single-precision FMA, either.

So the die overhead is jointly DP and FMA. FMA adds overhead because of wider sub-normal handling.

AMD would justify this on the basis that it's 1 or 2% die space difference, I guess. Certainly for something like Cedar 1 or 2 % is a big deal because margins are thin. The low ALU:TEX in Cedar theoretically reduces the die cost further.

So then you get into an argument over where to draw the line. Redwood? Juniper?

Jawed

FrameBuffer · Jan 23, 2010

chavvdarrr said:
btw, what was the reason for AMD removing DP execution from non-58xx ?
Die size is negligible. Just to screw hobbyiests ?

My guess is so that 5700 and below cards wouldn't cannibalize 5800 sales .. where performance is key sure the 5800 would be the best option however when it comes to programming functionality and cost would seem more important. Plan and simple, it's my humble opinion that ATI doesn't want hoobyists to buy "cheap" sub 5800 products when instead they would have to resort to more expensive (profitable) 5800+ products. While "ATI Radeon HD 5800 Series Graphics Cards - Designed by the Community" might be right for some, I think "ATI Radeon HD 5700-5400 Series Graphics Cards - Designed by the Accountants" might be more applicable.

Sontin · Jan 23, 2010

I had a little fun with the unigine numbers from nVidia and completed the numbers with tree simulated "Hemlocks".

better version: http://i49.tinypic.com/rm2xbt.jpg

My GF100 and 5870 numbers are very accurate - 99%.
Scaling of 70% is the best case. I don't know how good the profile for the Unigine benchmark is but in this 60 seconds AMD needs a scaling of 60%.

Gipsel · Jan 23, 2010

Jawed said:
I'm reasonably sure, now, that Juniper and lower GPUs cannot do single-precision FMA, either.

I just looked it up, it is even said in the Evergreen ISA docs that FMA works for double precision parts only.

Groo The Wanderer · Jan 23, 2010

Jawed said:
The die overhead for DP in ATI should be low (since it's a few extra bits on the four multipliers + wider dot-product paths which serve to connect it all together).

I'm reasonably sure, now, that Juniper and lower GPUs cannot do single-precision FMA, either.

So the die overhead is jointly DP and FMA. FMA adds overhead because of wider sub-normal handling.

AMD would justify this on the basis that it's 1 or 2% die space difference, I guess. Certainly for something like Cedar 1 or 2 % is a big deal because margins are thin. The low ALU:TEX in Cedar theoretically reduces the die cost further.

So then you get into an argument over where to draw the line. Redwood? Juniper?

I would be shocked if the units are not physically there, ripping them out takes more work than disabling them. It also means you need a new die for the Firewhateveritiscallednow variant, and that is very unlikely to be a sane proposition.

-Charlie

Ninjaprime · Jan 23, 2010

Sontin said:
I had a little fun with the unigine numbers from nVidia and completed the numbers with tree simulated "Hemlocks".

My GF100 and 5870 numbers are very accurate - 99%.
Scaling of 70% is the best case. I don't know how good the profile for the Unigine benchmark is but in this 60 seconds AMD needs a scaling of 60%.

Actually, from what I've seen, in Unigine 5870s in crossfire scale almost 100%, its like 95%+. I would be curious to see though if ATI is really setup limited in those tesselation heavy parts NV is trying to sell us, and if 2 rast/setup/tri units on two chips actually shows 100% difference.

GZ007 · Jan 23, 2010

Sontin said:
My GF100 and 5870 numbers are very accurate - 99%.
Scaling of 70% is the best case. I don't know how good the profile for the Unigine benchmark is but in this 60 seconds AMD needs a scaling of 60%.

Problem would be if the 5870 would drop frames to 10 or lower but thats not happening. The 5870 frames are stable without high peaks and lows. They showed this graph as a tesselation showcase for gf100 but somehow i doesnt see it there. There is only one peak betwen 22 and 30 and thats all (anyway who cares about random fps jumps). The rest is quite close to 5870.

It seems the only advantage for gf100 in the graph are the new redesigned 512 cuda cores (raw alu power) and not the 16 polymoprh engines.
I would rather ask if the polymorph engines will be good for anything else than custom nvidia demos in the next years ?

KimB · Jan 23, 2010

GZ007 said:
It seems the only advantage for gf100 in the graph are the new redesigned 512 cuda cores (raw alu power) and not the 16 polymoprh engines.
I would rather ask if the polymorph engines will be good for anything else than custom nvidia demos in the next years ?

Well, the pixel processing power is most definitely the thing that's going to mean the most for today's games. The polymorph engines are more for future games (which doesn't necessarily mean that the card itself will be more future-proof, but it should give developers a new tool to make use of to enhance future games, even if this first implementation ultimately turns out to be flawed).

It will be interesting to see if the polymorph engine actually turns out to have other side benefits, though, such as more stable framerates.

Ninjaprime · Jan 23, 2010

Chalnoth said:
Well, the pixel processing power is most definitely the thing that's going to mean the most for today's games. The polymorph engines are more for future games (which doesn't necessarily mean that the card itself will be more future-proof, but it should give developers a new tool to make use of to enhance future games, even if this first implementation ultimately turns out to be flawed).

It will be interesting to see if the polymorph engine actually turns out to have other side benefits, though, such as more stable framerates.

From the graph it looks to be the opposite of more stable framerates... very spikey.

NVIDIA Fermi: Architecture discussion

3dilettante

Jawed

mczak

trinibwoy

Meh

3dilettante

Jawed

Groo The Wanderer

MfA

Groo The Wanderer

Ailuros

Epsilon plus three

chavvdarrr

Jawed

FrameBuffer

Sontin

Gipsel

Groo The Wanderer

Ninjaprime

GZ007

KimB

Ninjaprime

Similar threads