NVIDIA Fermi: Architecture discussion

Fact is, detecting 0 cosmic rays induced errors is way too suspicious than detecting 2 orders of magnitude less errors than expected.
Yes, it's suspicious, the authors are clearly surprised. You will find more than two orders of magnitude variation in estimated FIT by the way, if you look around.

Maybe the hunt for cosmic ray induced soft errors in graphics cards is a bit like the hunt for neutralinos :LOL:

Jawed
 
The L1 figure appears to be a straightforward 4 byte*16 load/store*16 cores*1.5 GHz.
The L2 sounds like it has core-local partitions with similar bandwidth, though probably not of 32-bit granularity in transfers.
If that very similar speed for L2 is indeed true, that would be very impressive. Would be about 3 times more than what Cypress offers (though speculation says L1->L2 bandwidth could indeed potentially be a bottleneck in Cypress).
Cypress L2 partitions apparently offer 128bytes per clock bandwidth (reportedly, it's the same for rv770 - by that logic though Juniper should only have half the bandwidth as it only has two instead of 4 MCs/L2 partitions).
I can't quite come up with any numbers which would give "similar" L2 bandwidth compared to L1 for Fermi though - at 256bytes per clock and partition (with 6 partitions, 600Mhz clock) that would "only" be ~920GB/s, still twice that of Cypress, but a bit of a stretch of "similar" (but twice that would give more L2 bandwidth than L1, which doesn't make sense).
 
I swear I saw something about L2 having its own clock domain but I can't remember where. Maybe that's the missing component.
 
When Juniper does not have DP, what makes you think that LIano will have it? :LOL:

However, I am expecting LIano to have HT3, as it connects amd cpu's to northbridge.

Intel competition perhaps? That said, Llano is aimed at the netbook set, wait for Bulldozer based fusion parts before you decide to jump in or not. Think of Llano as an architectural preview.

-Charlie
 
Unlike the texture cache the L1 bandwidth is almost certainly highly dependent on access patterns (ie. bank conflicts).
 
In that topic, both aaronspink and dkanter, said you are wrong in downplaying cosmic rays. Fact is, detecting 0 cosmic rays induced errors is way too suspicious than detecting 2 orders of magnitude less errors than expected.
And frankly on such topic I'd believe what dkanter says.

I don't think it matters as much as people think. Not because it is happening or not happening, but because of the cost of a failure. Generally failures cost a LOT of time, money and annoyance, way out of proportion to the cost of the machine.

Then there is the fact that ECC does not just protect against cosmic rays, it can shield you from bad memory, electrical interference, and tons of other causes. If you have a server that is poorly placed against a wall with a transformer behind the sheetrock that you did not know about, well, you have some rather annoying transient errors in your future.

For most people, the cost of ECC is worth it to prevent the errors on anything that has a large value for downtime. The cost of the ECC'd parts usually is trivial compared to the cost of most downtimes, so you just do it. Quantifying the hit rate of cosmic rays is nice and fine, but how do you quantify 'shit happens', something that I would argue is much more prevalent than cosmic ray strikes in critical areas.

-Charlie
 
Intel competition perhaps? That said, Llano is aimed at the netbook set, wait for Bulldozer based fusion parts before you decide to jump in or not. Think of Llano as an architectural preview.

-Charlie

Vastly OT but I lived under the impression that Llano is more notebook material and Ontario is meant for netbooks (or anything even lower perhaps)?
 
btw, what was the reason for AMD removing DP execution from non-58xx ?
Die size is negligible. Just to screw hobbyiests ?
The die overhead for DP in ATI should be low (since it's a few extra bits on the four multipliers + wider dot-product paths which serve to connect it all together).

I'm reasonably sure, now, that Juniper and lower GPUs cannot do single-precision FMA, either.

So the die overhead is jointly DP and FMA. FMA adds overhead because of wider sub-normal handling.

AMD would justify this on the basis that it's 1 or 2% die space difference, I guess. Certainly for something like Cedar 1 or 2 % is a big deal because margins are thin. The low ALU:TEX in Cedar theoretically reduces the die cost further.

So then you get into an argument over where to draw the line. Redwood? Juniper?

Jawed
 
btw, what was the reason for AMD removing DP execution from non-58xx ?
Die size is negligible. Just to screw hobbyiests ?

My guess is so that 5700 and below cards wouldn't cannibalize 5800 sales .. where performance is key sure the 5800 would be the best option however when it comes to programming functionality and cost would seem more important. Plan and simple, it's my humble opinion that ATI doesn't want hoobyists to buy "cheap" sub 5800 products when instead they would have to resort to more expensive (profitable) 5800+ products. While "ATI Radeon HD 5800 Series Graphics Cards - Designed by the Community" might be right for some, I think "ATI Radeon HD 5700-5400 Series Graphics Cards - Designed by the Accountants" might be more applicable.
 
I had a little fun with the unigine numbers from nVidia and completed the numbers with tree simulated "Hemlocks".

2j63407.jpg

better version: http://i49.tinypic.com/rm2xbt.jpg


33z7fjt.jpg


My GF100 and 5870 numbers are very accurate - 99%.
Scaling of 70% is the best case. I don't know how good the profile for the Unigine benchmark is but in this 60 seconds AMD needs a scaling of 60%.
 
The die overhead for DP in ATI should be low (since it's a few extra bits on the four multipliers + wider dot-product paths which serve to connect it all together).

I'm reasonably sure, now, that Juniper and lower GPUs cannot do single-precision FMA, either.

So the die overhead is jointly DP and FMA. FMA adds overhead because of wider sub-normal handling.

AMD would justify this on the basis that it's 1 or 2% die space difference, I guess. Certainly for something like Cedar 1 or 2 % is a big deal because margins are thin. The low ALU:TEX in Cedar theoretically reduces the die cost further.

So then you get into an argument over where to draw the line. Redwood? Juniper?

I would be shocked if the units are not physically there, ripping them out takes more work than disabling them. It also means you need a new die for the Firewhateveritiscallednow variant, and that is very unlikely to be a sane proposition.

-Charlie
 
I had a little fun with the unigine numbers from nVidia and completed the numbers with tree simulated "Hemlocks".

My GF100 and 5870 numbers are very accurate - 99%.
Scaling of 70% is the best case. I don't know how good the profile for the Unigine benchmark is but in this 60 seconds AMD needs a scaling of 60%.

Actually, from what I've seen, in Unigine 5870s in crossfire scale almost 100%, its like 95%+. I would be curious to see though if ATI is really setup limited in those tesselation heavy parts NV is trying to sell us, and if 2 rast/setup/tri units on two chips actually shows 100% difference.
 
Last edited by a moderator:
My GF100 and 5870 numbers are very accurate - 99%.
Scaling of 70% is the best case. I don't know how good the profile for the Unigine benchmark is but in this 60 seconds AMD needs a scaling of 60%.


Problem would be if the 5870 would drop frames to 10 or lower but thats not happening. The 5870 frames are stable without high peaks and lows. They showed this graph as a tesselation showcase for gf100 but somehow i doesnt see it there. There is only one peak betwen 22 and 30 and thats all (anyway who cares about random fps jumps). The rest is quite close to 5870.

It seems the only advantage for gf100 in the graph are the new redesigned 512 cuda cores (raw alu power) and not the 16 polymoprh engines.
I would rather ask if the polymorph engines will be good for anything else than custom nvidia demos in the next years ?
 
It seems the only advantage for gf100 in the graph are the new redesigned 512 cuda cores (raw alu power) and not the 16 polymoprh engines.
I would rather ask if the polymorph engines will be good for anything else than custom nvidia demos in the next years ?
Well, the pixel processing power is most definitely the thing that's going to mean the most for today's games. The polymorph engines are more for future games (which doesn't necessarily mean that the card itself will be more future-proof, but it should give developers a new tool to make use of to enhance future games, even if this first implementation ultimately turns out to be flawed).

It will be interesting to see if the polymorph engine actually turns out to have other side benefits, though, such as more stable framerates.
 
Well, the pixel processing power is most definitely the thing that's going to mean the most for today's games. The polymorph engines are more for future games (which doesn't necessarily mean that the card itself will be more future-proof, but it should give developers a new tool to make use of to enhance future games, even if this first implementation ultimately turns out to be flawed).

It will be interesting to see if the polymorph engine actually turns out to have other side benefits, though, such as more stable framerates.

From the graph it looks to be the opposite of more stable framerates... very spikey. ;)
 
Back
Top