NVIDIA Fermi: Architecture discussion

Hopefully NV will do the smart move and price it the same making Cypres obsolete, at least until AMD lower prices. I think forcing them to do so earns a lot of mind share though.

I would bet that pricing is going to be as dependent on part availability as anything else (relative performance, power draw, etc.) Even if it had worse performance than a Cypress, you know that there are enough dyed in the wool NV fans that they'd have no problem selling it with a premium over Cypress if there weren't a lot of parts available to sell.

If it's faster, unless they have a metric ton of these to sell on day one, it's all but guaranteed to carry a price premium over a Cypress - by then, anyone who wants a Cypress will have bought one, and with stock of other high end Nvidia GPUs drying up, there will be pent up demand for team Green's finest. Plus, the 5870 isn't all that expensive now (historically speaking), even with no competition from Nvidia. By the consumer Fermi launch time frame, they will no doubt be even cheaper, so I don't see how Nvidia will have much of an opportunity to price it cheaper than Cypress.
 
That certainly seems quite plausible. So basically what you're saying is that you would agree that that other guy's claim that most server farms typically rely on budget single socket desktop components is unlikely?

Still, even lower TDP, higher efficiency components and the like don't really do much to change the probability that any Nehalem based solution will still be a factor or so off when it comes to computational density and power efficiency compared to a Fermi based Tesla setup.

No, not at all. I know who he is, and I believe every word he says in this and several other contexts. I also know what several data centers use, and they almost all have 1S boxes now.

-Charlie
 
It's interesting looking at the clocks on the various nvidia 40nm parts(in rough release order):
GT218 589/1402
GT216 625/1360
GT215 550/1340
GT215(OC) 585/1420
GF100 A1(Hotboy) 495/1100
GF100 A2(Various) ???/1200-1350
GF100 A3(Nvidia) ???/1250-1400

They seem very clustered around 1300-1400Mhz for shaders(excluding GF100 A1 obviously). Assuming they keep the same ratio as GF100 A1 core clock is looking around 550-630 maybe?

Contrast with the G9x generation which covered 1375->1836Mhz
Yeah, those A2 Teslas were ~550mhz, so maybe up to 600mhz is doable.
I had been expecting ~G200/200b clocks for awhile.
 
That certainly seems quite plausible. So basically what you're saying is that you would agree that that other guy's claim that most server farms typically rely on budget single socket desktop components is unlikely?

Only way they won't use the DT cpus if they get a deal of something better for the same cost. It litterally makes up 3/4 of the BOM as is.

Still, even lower TDP, higher efficiency components and the like don't really do much to change the probability that any Nehalem based solution will still be a factor or so off when it comes to computational density and power efficiency compared to a Fermi based Tesla setup.

PEAK != reality.
 
No, not at all. I know who he is, and I believe every word he says in this and several other contexts. I also know what several data centers use, and they almost all have 1S boxes now.

-Charlie

I guess we just have different experiences. Well, I'll readily admit that I'm not familiar with trends in the US, but of the 8 or 9 data centers I've visited in Europe and Asia over the last 2 years, pretty much all of them are on the sweet spot of 2 socket blades, down from 4 socket rack units, and I don't know anyone who is seriously using i7s and Phenoms in place of Xeons and Opterons.
 
For yield, speed or power, an A4 won't do much. (I'm a bit on a mission to kill the idea in people's brain that metal spins have a lot of impact on those. ;)) A B spin is something else entirely, of course, but that would be a much longer term solution.
I don't remember if this caveat was mentioned previously, but in at least one specific case it is possible for a metal spin to increase clock speed. If a small number of paths are a significant bottleneck they might be able to be fixed. Sometimes all it takes is some added buffering or increasing drive strength along the offending path.

The only time I've seen a spin significantly reduce power drain is when a stupid mistake was made during the previous revision like shorting phantom cells to power.

In general I agree with silent_guy and metal spins are usually to fix logical bugs.
 
I guess we just have different experiences. Well, I'll readily admit that I'm not familiar with trends in the US, but of the 8 or 9 data centers I've visited in Europe and Asia over the last 2 years, pretty much all of them are on the sweet spot of 2 socket blades, down from 4 socket rack units, and I don't know anyone who is seriously using i7s and Phenoms in place of Xeons and Opterons.

thats the exact world i live in and i haven't seen much different of late, i call BS on 1 socket installs in DC's it makes no sense. If you guys realised the cost per sq meter of a datacenter you will quickly see density is king. there are so many fronts it doesn't make sense on.

its inefficient in terms of power consumption from both increased inefficientices of AC to DC conversion in 1p servers vs blades and redundant parts within each server.

its very low density, i cant get 576 opertion cores in a 40 RU rack right now vs 14.4 racks to do that as 1 socket servers.

very costly in terms of localized cooling , for example i would need one APC 1/2 rack ac unit for 3 blade enclosures in a 40 RU rack, i would need 3-4 for 14.4 racks and much bigger water pumps.

given a 2 socket blade with 64 gig of ram with 2 istanbul cpu's is around 8-10K (AUD) single socket servers dont make sense. Sure there is hard disk to consider but most things in this space local disk doesn't have the performace anyway.


PS. im a network "engineer" in a DC ;)

edit: lol, i forgot completely about networking costs, with 1 socket servers thats pretty much forcing you to go top of rack switching, with blades you can go something like a catalyst 6500 or nexus 7000 and just use fibre back to a couple of central points. assuming a redundant network design it is significatly cheaper as well as far more scalable to centralise your network access layer.
 
Last edited by a moderator:
I guess we just have different experiences. Well, I'll readily admit that I'm not familiar with trends in the US, but of the 8 or 9 data centers I've visited in Europe and Asia over the last 2 years, pretty much all of them are on the sweet spot of 2 socket blades, down from 4 socket rack units, and I don't know anyone who is seriously using i7s and Phenoms in place of Xeons and Opterons.

Nobody with serious internet scalability intentions has used anything over 2s for almost a decade now. Unless your application requires wide scale coherent transactions, you aren't getting any of the benefits of the larger systems and your getting all the cost disadvantages. And even in those cases where you do need some coherence, when you are talking about monthly purchase volume of 10-50K sockets, programmers really are fairly cheap, even the good ones.
 
Last edited by a moderator:
How is this any different on any other architecture?

Cause its a lot easier to get an application running close to peak on a homogeneous system with 20+ years of tools infrastructure than it is to get it running close to peak on a heterogeneous system with <2 years of tools infrastructure.
 
thats the exact world i live in and i haven't seen much different of late, i call BS on 1 socket installs in DC's it makes no sense. If you guys realised the cost per sq meter of a datacenter you will quickly see density is king. there are so many fronts it doesn't make sense on.

1P servers currently have a ~50% density advantage. Takes only minutes to go to the websites of the various vendors that play in this space.

its inefficient in terms of power consumption from both increased inefficientices of AC to DC conversion in 1p servers vs blades and redundant parts within each server.

blades are just marketing. People were selling racks with shared PSUs long before blades ever became a catch phrase.

very costly in terms of localized cooling , for example i would need one APC 1/2 rack ac unit for 3 blade enclosures in a 40 RU rack, i would need 3-4 for 14.4 racks and much bigger water pumps.

The problem is you are thinking in terms of racks. The people doing this at scale are thinking about what a cargo container requires. And they are doing their best to do it almost entirely on ambient air. The data centers of the Now and Future look a lot more like shipping warehouses than anything high tech.

given a 2 socket blade with 64 gig of ram with 2 istanbul cpu's is around 8-10K (AUD) single socket servers dont make sense. Sure there is hard disk to consider but most things in this space local disk doesn't have the performace anyway.

MS/Google/Amazon/Etc are pushing in the range of <500 per socket.

PS. im a network "engineer" in a DC ;)

Ahh, the people all the other people are trying to make redundant.

edit: lol, i forgot completely about networking costs, with 1 socket servers thats pretty much forcing you to go top of rack switching, with blades you can go something like a catalyst 6500 or nexus 7000 and just use fibre back to a couple of central points. assuming a redundant network design it is significatly cheaper as well as far more scalable to centralise your network access layer.

side of rack actually. Here's a shocker, making everything a blade doesn't decrease the amount of switch ports that are required.
 
Last edited by a moderator:
A3 will have to ship, but will it ship as a GeForce?

Did they tape-out any smaller GF10X parts yet?

It is my understanding so far that by the time they light the green light for GF100 production, smaller parts will go for their tape out. No idea though if NV has send those to the fab yet or when they're planning to.

Basically, yes. If they can't get the clocks up after 3 tries for metal layer tweaks, a fourth probably won't produce miracles. It could, but I doubt it will in practice.

A3 didn't strike me as a spin to improve frequencies directly. I live under the impression that if you go for a spin to hypothetically improve yields (which at this stage doesn't sound like a respin from an IHV's side would, but rather a pure TSMC headache than anything else), higher possible frequencies could be an indirect result of possible better yields.

I am betting on a serious revamp of the architecture before a 28nm part, either that or NV will just suck it down in the benchmarks while proclaiming a crushing lead in some really odd benchmark that they do well.

-Charlie

Maybe it's just me but I don't think there's enough time for a "serious revamp" between 40nm and whenever they could go for 28nm with whichever fab. What I'm more interested in for the foreseeable future is if they will stick mostly to TSMC or if we'll see a sudden change of winds considering Globalfoundries somewhere down the line.
 
1P servers currently have a ~50% density advantage. Takes only minutes to go to the websites of the various vendors that play in this space.

ummm, i dont know where you got that from a HP 7000 is 10 RU and can take 16 half high dual socket blades. again thats 192 cores based on a 6 core CPU. 10 single RU servers is 60 cores or if you go the 1/2 side to side setup thats 120 cores. SO where is this 50% density advantage.


blades are just marketing. People were selling racks with shared PSUs long before blades ever became a catch phrase.
yes but not as general deployment and thats hardly worth noting, there is also alot more intergration with blade enclosures.

The problem is you are thinking in terms of racks. The people doing this at scale are thinking about what a cargo container requires. And they are doing their best to do it almost entirely on ambient air. The data centers of the Now and Future look a lot more like shipping warehouses than anything high tech.

funny i've helped deploy 3 data centres in that last 2 years and none meet that description, when your outside ambient air can range from -15 to 40+ letting mothernature take care of it isn't an option.


MS/Google/Amazon/Etc are pushing in the range of <500 per socket.
so, a very small percentage of data centre workloads, most VM's in our enviroment( we have 1000's) are pulling well under a gig.

Ahh, the people all the other people are trying to make redundant.
and failing, lets see we took voice, we are taking storage, i dont see routers/ switches/load balancers/IPS/WDM etc going away any time soon.

side of rack actually. Here's a shocker, making everything a blade doesn't decrease the amount of switch ports that are required.

yes, it does actually, i can feed a 16 blade enclosure off 2 fibres for LAN(assuming redundancy) and 2 for SAN, if you buy all CISCO gear i can do both in 2, just a question wether we do 1 or 10 gig maybe even 40 soon enough.
 
One benefit of the Quadro/Tesla setups shown is that they're using Melanox cards.
The only nvidia DC I helped set-up so far was just using 2S SuperMicro servers using Geforce cards. This was a financial set-up, they just need a lot of numbers, so the cost was definitely an aspect.

They were much more concerned getting their intra-site(10G) communications up to support their internal structures (40G)...That's what's limiting them, not how much data they can process on-site, but how to distribute it globally (7 sites)
 
Cause its a lot easier to get an application running close to peak on a homogeneous system with 20+ years of tools infrastructure than it is to get it running close to peak on a heterogeneous system with <2 years of tools infrastructure.

This is true. However, when a heterogeneous system has much higher peak than a homogeneous system, even a "less close to peak" on a heterogeneous system may be much faster than a monogeneous system.

A similar example is, the Earth Simulator vs other clusters. Earth Simulator has crazy amount of interconnection and good tools which make it generally much easier to program than other clusters. I've heard that it's easy to get 15% peak with a normal Fortran program, compared to a normal clustered computer's 5% peak. This is quite impressive. However, when a normal, cheaper cluster is 10 times faster than Earth Simulator in peak performance, it doesn't matter how easy to achieve 100% peak performance on the Earth Simulator, because a 15% peak performance on a normal cluster is still faster.

Right now, GPU generally has around 10 times or more raw computation power than similarily priced CPU. Memory bandwidth is generally 5 times or more. Although this does not guarantee better performance, but at least it's still worth trying.
 
Back
Top