NVIDIA Fermi: Architecture discussion

MfA · Dec 27, 2009

As things go small both the compute density advantage and the power efficiency become less important at the same time and the relative cost of software development goes up. For the cost of a C2050 you can get what ... 4 quadcore blades with Infiniband HCAs? (A lot more if you go with gigabit ethernet.)

aaronspink · Dec 27, 2009

trinibwoy said:
Again, Tesla competition is not Geforce. You keep making this irrelevant comparison and I don't know why. The comparison is between Tesla and current CPU based setups. In certain workloads the perf/$ of Tesla will be much higher and that's what Nvidia is targeting. So in fact they are banking on the cheapness of the HPC folks.

Because Geforce can do EVERYTHING that a tesla can! They both can run the exact same software stack. They both have the same feature set. The difference is that one costs 5-10x as much. Why spend 5-10x as much to get the exact same functionality?

Florin · Dec 27, 2009

MfA said:
As things go small both the compute density advantage and the power efficiency become less important at the same time and the relative cost of software development goes up. For the cost of a C2050 you can get what ... 4 quadcore blades with Infiniband HCAs?

And 4 quadcore Nehalem blades offer what, about 200 DP GFLOPS if you use SSE, in a 6U form factor?

Let's be charitable and fill up that chassis, and you're talking around 500 DP in those 6Us. Using oh let's say about 3-4KW of power.

Or you could use an S2050 for about 2 DP TFLOPS in a 1U package at around 1KW.

aaronspink · Dec 27, 2009

Florin said:
The premium is actually huge. I'm not sure which CPUs you have been looking at.

the ones people buy. Also it helps to pay attention to things besides just clock speed, fyi.

As for server farms, particularly HPC, they are all about the lowest amount of cooling and power possible for rooms full of rackmounted equipment. 1P boxes using desktop parts? Maybe.. for a handful of poor university student projects.

For a handful of massive installations for people like google, amazon, yahoo, MS, etc. You think they are using top end xeons and opterons? They are using the cheapest parts they can generally get their hands on and then putting them into 4 per 1U boxes pre-packaged into racks/shipping containers.

aaronspink · Dec 27, 2009

Florin said:
And 4 quadcore Nehalem blades offer what, about 200 DP GFLOPS if you use SSE, in a 6U form factor?

Let's be charitable and fill up that chassis, and you're talking around 500 DP in those 6Us. Using oh let's say about 3-4KW of power.

Or you could use an S2050 for about 2 DP TFLOPS in a 1U package at around 1KW.

Depends on the workload. For the vast majority, the blades will have infinitely higher performance...

and if you really want to talk low cost, your better off with a couple of geforces...

Florin · Dec 27, 2009

aaronspink said:
the ones people buy. Also it helps to pay attention to things besides just clock speed, fyi.

Got any examples?

For a handful of massive installations for people like google, amazon, yahoo, MS, etc. You think they are using top end xeons and opterons? They are using the cheapest parts they can generally get their hands on and then putting them into 4 per 1U boxes pre-packaged into racks/shipping containers.

Here's what a Google server looks like - hey, it's a 2 socket box. Of all the companies you mention, Google has by far the cheapest server design philosophy, and none except Amazon (arguably, with ec2) are in the HPC business.

Maybe you should just admit already that you're basically blowing smoke

Florin · Dec 27, 2009

aaronspink said:
Because Geforce can do EVERYTHING that a tesla can! They both can run the exact same software stack. They both have the same feature set. The difference is that one costs 5-10x as much. Why spend 5-10x as much to get the exact same functionality?

Because it's not the exact same functionality at all. Tesla offers much larger memory configurations and pervasive ECC protection.

XMAN26 · Dec 27, 2009

Florin said:
Because it's not the exact same functionality at all. Tesla offers much larger memory configurations and pervasive ECC protection.

You fergot the fact that even tho the software will run on Geforces just aswell as the Teslas', it will vastly faster on the Tesla than it will the geforce, so people who buy them are also paying for the performance they'll get over the "desktop" counterpart.

Its apparent Aaron has never seen the comparison between Geforces and their bigger cousins the Quadros running the same software. In almost all cases, the Quadros are 2-3x times faster than the Geforce version of the cards at the professional apps. I can only imagine that the margine of difference is only larger in HPC sector.

aaronspink · Dec 27, 2009

Florin said:
Got any examples?

TDP for one.

Here's what a Google server looks like - hey, it's a 2 socket box.

yes its one server of umpteen 10s of different designs they've used in various different data centers.

Of all the companies you mention, Google has by far the cheapest server design philosophy, and none except Amazon (arguably, with ec2) are in the HPC business.

Maybe you should just admit already that you're basically blowing smoke

They are all in the HPC business. The only difference in many of their setups and other public/commercial HPC sites is applications.

aaronspink · Dec 27, 2009

Florin said:
Because it's not the exact same functionality at all. Tesla offers much larger memory configurations and pervasive ECC protection.

Its offers larger memory configurations. That's it. The ones that have ECC haven't been released yet, and everything points to it being effectively software based error detection which can be implemented in current cards, geforce and tesla.

aaronspink · Dec 27, 2009

XMAN26 said:
You fergot the fact that even tho the software will run on Geforces just aswell as the Teslas', it will vastly faster on the Tesla than it will the geforce, so people who buy them are also paying for the performance they'll get over the "desktop" counterpart.

given the same software, geforces will tend to run the application faster since it has more flops and more bandwidth.

Its apparent Aaron has never seen the comparison between Geforces and their bigger cousins the Quadros running the same software. In almost all cases, the Quadros are 2-3x times faster than the Geforce version of the cards at the professional apps. I can only imagine that the margine of difference is only larger in HPC sector.

Yes, its always nice when a vendor intentionally cripples performance in drivers. The hardware is exactly the same. You can bypass the id detection on the drivers and get the same performance on a geforce.

And in HPC, the margin is likely INVERTED since they actually do both run the same software stack and the geforces are the higher performance devices.

Florin · Dec 27, 2009

aaronspink said:
TDP for one.

Yeah, the cores that are validated for multiprocessor capability are quite often tested and packaged for a lower TDP. Of course they're still the same chips, at a premium of usually at least several 100s of dollars. Your argument that Intel and AMD are not able to ask a substantial price difference between desktop and professional parts is nonsense. They do exactly that.

yes its one server of umpteen 10s of different designs they've used in various different data centers.

They are all in the HPC business. The only difference in many of their setups and other public/commercial HPC sites is applications.

No, they're in web serving and transaction processing and data warehousing and many other different tasks. This is not the same as HPC, which is about computational performance.

dkanter · Dec 27, 2009

A few notes:

1. Cell is a mess for programmability and only made sense when Sony was paying all the NRE. Once Sony learned their lesson, it was cancelled. Also note that IBM tried to sell Cell blades for insane prices (even relative to GPUs). NV GPUs are easier to program than Cell by far (instructions caches FTW).

2. GPU hardware is around 4-8X faster than a CPU in an ideal case. Larger speed ups are seen due to algorithmic or tuning issues. However, many optimal algorithms aren't supported efficiently on GPUs, which is a big issue. Algorithms typically have the biggest impact on performance, and can easily put 4-8X gains into the noise.

3. Many HPC buyers are very cheap and knowledgeable (have graduate students, will port code!).

4. Where ECC is needed, the implications for performance will be considered in comparison to CPUs. If Fermi is 4X faster than a 6-core Westmere DP, then with ECC overhead it's probably more like 3X (25% overhead). People will notice.

5. Some clever HPC users may not care about ECC (because of their workload or the way they wrote their software), and may be unwilling to pay a premium at all. I've certainly talked to a few myself. They would rather buy desktop cards and not get gouged.

5. To make money in HPC, NV needs to move far beyond academia, since they don't pay well. It's commercial uses that are most lucrative.

6. HPC cannot support a company of NV's size - look at SGI, Cray and the other HPC focused companies. They need the volume economics of desktop GPUs to make things work.

aaronspink · Dec 27, 2009

Florin said:
No, they're in web serving and transaction processing and data warehousing and many other different tasks. This is not the same as HPC, which is about computational performance.

I don't think you really understand how the server systems at places like google/amazon/MS work. They are maxing out computational performance, memory performance, storage performance, and network performance. If anything they do more research and optimize their hardware MORE than most other HPC sites.

And HPC is LESS about computational performance than it is about data movement efficiency. The top end HPC machines are generally defined by their networks and not their processors.

Florin · Dec 27, 2009

aaronspink said:
I don't think you really understand how the server systems at places like google/amazon/MS work. They are maxing out computational performance, memory performance, storage performance, and network performance. If anything they do more research and optimize their hardware MORE than most other HPC sites.

Haha, so now they are suddenly not about picking the cheapest desktop stuff off the shelf but about maxing out uhh.. everything?

You're making this up as you go along aren't you

Florin · Dec 27, 2009

Good post and I agree completely with most statements but just a few selective responses:

dkanter said:
A few notes:
2. GPU hardware is around 4-8X faster than a CPU in an ideal case. Larger speed ups are seen due to algorithmic or tuning issues. However, many optimal algorithms aren't supported efficiently on GPUs, which is a big issue. Algorithms typically have the biggest impact on performance, and can easily put 4-8X gains into the noise.

Absolutely. But then again, the more parallel your problem, the bigger the gain a GPU is likely to have, and there is potential for an even larger speedup. When it comes to algorithms and optimising your code for your architecture, that actually ties in rather nicely with:

3. Many HPC buyers are very cheap and knowledgeable (have graduate students, will port code!).

I would say that exactly that could actually be a major boon for the adoption of non-x86 isa computing like CUDA, OpenCL etc

EduardoS · Dec 27, 2009

Florin said:
Heh, which is why the 2.66Ghz Bloomfield Xeon W3520 goes for $309 on Newegg, while the equivalent DP-capable X5550 is $999. Ever wonder why the modest 2.6Ghz Opteron 8435 sells for upwards of $2600?

Opteron 8435 scales up to 8 sockets, it's not a DP part, the low-end Istanbul starts at ~$450 delivering more flops than Nehalem, Shangai starts at $174.

Also, sometimes DP boxes are cheaper than two UP boxes, you can cut costs on not duplicating hard disks for chekpointing (for the guys who do it...), Infiniband cards and switches and sometimes on the motherboard/mount, oh, some also have to pay for per server licenses.

Groo The Wanderer said:
And don't forget, a large portion of the HPC space has an army of slaves, sometimes called grad and undergrad students, to code things that would never make sense in a commercial setting. $10K can pay for a lot of coding time in an academic environment.

From a economical point of view often this is the most expensive work force, not having the best code means more hardware to do the same, with a thousand or so nodes it's a lot of hardware, maybe expensive hardware.

On universities we may call it "educational cost", so it's ok.

And about Tesla margins... Buy two Geforces and check results, it's 2.5 times cheaper, or better, buy two Radeons, it's likely 4 times cheaper for the same performance.

Let's see how well nVidia PR will work to sell Teslas, let's see if they do better than ClearSpeed...

aaronspink · Dec 28, 2009

Florin said:
Haha, so now they are suddenly not about picking the cheapest desktop stuff off the shelf but about maxing out uhh.. everything?

You're making this up as you go along aren't you

Maybe you should try some reading comprehension and then rejoin the conversation.

itsmydamnation · Dec 28, 2009

Florin said:
Haha, so now they are suddenly not about picking the cheapest desktop stuff off the shelf but about maxing out uhh.. everything?

You're making this up as you go along aren't you

no, no he isn't. Its the reason cray builds the biggest super computers, its the reason the biggest X86 super computers are AMD and not INTEL, its the reason why things like this exsist. can't move your data efficently, scale your bandwidth then you lower your real world perf vs your peak perf.

this is what i have wondered about with GPU's. Cray have there seastar routers connected directly to the hyper transport buses and currently have 40GBit/s bandwidth a node configured in very complex partial mesh designs. How exactly do NV plan to scale across nodes/racks/rows?

trinibwoy · Dec 28, 2009

I think you guys are looking at this from the wrong angle. GPUs are not going to come in overnight and be better at everything that people have built over the last few decades. It's easy to point out situations where current GPU technology doesn't fit but so what? That's not the point. The point is to provide a better solution to those problems where the technology makes sense and mature from there.

aaron, I still don't understand your take on it. If Tesla is cheaper (perf/$) than current CPU solutions are you saying it won't be adopted because Geforce is even cheaper? Not sure I buy that logic.

NVIDIA Fermi: Architecture discussion

MfA

aaronspink

Florin

Merrily dodgy

aaronspink

aaronspink

Florin

Merrily dodgy

Florin

Merrily dodgy

XMAN26

aaronspink

aaronspink

aaronspink

Florin

Merrily dodgy

dkanter

aaronspink

Florin

Merrily dodgy

Florin

Merrily dodgy

EduardoS

aaronspink

itsmydamnation

trinibwoy

Meh

Similar threads