NVIDIA Fermi: Architecture discussion

aaron, I still don't understand your take on it. If Tesla is cheaper (perf/$) than current CPU solutions are you saying it won't be adopted because Geforce is even cheaper? Not sure I buy that logic.

I merely saying that they won't make any higher profit than they will on their gaming/desktop parts because the market (HPC) is well versed at taking high volume commodity parts and using them instead of the "special super high end" part which is really just a slightly repackaged high volume commodity part.

AKA, Tesla will likely fail, geforce has a chance though.
 
aaronspink said:
You can bypass the id detection on the drivers and get the same performance on a geforce [as a Quadro].
How, exactly, do you accomplish that on G80 and beyond? I'm rather interested in the details. Please, do elaborate. Also, do you happen to have performance measurements on (eg) Viewperf showing that a modded GeForce driver can get you the Quadro performance?

and everything points to [ECC] being effectively software based error detection
Please do elaborate on how that works too. I'm very interested in how you'd square that with this paragraph from the Fermi whitepaper:
Fermi Whitepaper said:
Fermi’s register files, shared memories, L1 caches, L2 cache, and DRAM memory are ECC
protected, making it not only the most powerful GPU for HPC applications, but also the most
reliable.
 
How, exactly, do you accomplish that on G80 and beyond? I'm rather interested in the details. Please, do elaborate. Also, do you happen to have performance measurements on (eg) Viewperf showing that a modded GeForce driver can get you the Quadro performance?

Numbers are nice but we all know there's no physical difference between Geforce and Quadro so why wouldn't a Geforce modded to work with the Quadro driver provide the same (or better) performance?

Please do elaborate on how that works too. I'm very interested in how you'd square that with this paragraph from the Fermi whitepaper:

Nothing in that quote precludes the use of ECC via software.
 
ShaidarHaran said:
Numbers are nice but we all know there's no physical difference between Geforce and Quadro so why wouldn't a Geforce modded to work with the Quadro driver provide the same (or better) performance?
Yes, let's not let "facts" like "numbers" get in the way. We all just "know". It's Common Sense!

Nothing in that quote precludes the use of ECC via software.
So every time a register is accessed, SW ECC computation must be performed. Great. With 10-30 instruction per register access, and 4 registers accessed per multiply-add, that's a lot of overhead. An ARM11 would probably get you better math performance.

So why isn't AMD touting the amazing (software) ECC capabilities of RV870?
 
Its offers larger memory configurations. That's it. The ones that have ECC haven't been released yet, and everything points to it being effectively software based error detection which can be implemented in current cards, geforce and tesla.
Why couldn't they simply use interleaved ECC codes with hardware support for the DRAM? I don't see the big deal.

You multiply the memory address before retrieving the data, take out the ECCs, barrel shift the remaining bytes to form complete 64 bit chunks again, check it and the data can go on it's merry way (and in reverse for writing). Partial writes become RMWs, but you have that with "normal" ECC memory too.

What's the problem? Transistors are cheap and an extra cycle of latency is not a huge deal either.
 
Last edited by a moderator:
Numbers are nice but we all know there's no physical difference between Geforce and Quadro so why wouldn't a Geforce modded to work with the Quadro driver provide the same (or better) performance?
Because it's trivial to irreversibly disable or enable pieces of hardware with a fuse? It's done everywhere for RAM repair, security keys and, yes, product differentiation.

Nothing in that quote precludes the use of ECC via software.
Not kinds of ECC coverage indicated by Bob. Those are similar to the ECC that's present in, e.g., SPARC processor for the register files. They are calculated and checked every cycle. There's no way you can do this in SW.

That said, I don't know how important this kind is to the HPC crowd: in many cases (e.g. Google), it can be avoided due to system level redundancy or by detecting errors with a result sanity check at the end. But I suppose the knowledge that everything went fine in a commercial MRI machine may be worth the additional money... if that still qualifies as HPC, that is. I believe the ECC feature is going to be more important for smaller scale deployments such as engineering workstations where an additional couple of $1K don't really register compared to the overall cost of SW licenses and engineer salary.
 
Haha, so now they are suddenly not about picking the cheapest desktop stuff off the shelf but about maxing out uhh.. everything?

You're making this up as you go along aren't you :LOL:

Take a look at this. If you honestly believe they just buy cheap PCs and hope for the best you are being naive.

Of course these companies invest a lot of time and money making sure things like network or CPUs don't become a bottleneck or how to build cheap storage systems that are still reliable.
 
In all fairness, I would be careful to compare universites to commercial users of HTPCs.

I can not talk about details, but just the idea of using a modded driver to turn a GeForce into a Quadro is not an option. You sign a support contract with the software vendor which demands fully compatible hardware for it to work and in exchange guarantees 100% up time, or will pay a heavy compensation for the downtime.
And seriously. If I think of what it would cost to have 10-12 (in one department alone) engineers sitting around useless till we mod the driver to fit the newest updates, the Quadros are cheap.
 
That said, I don't know how important this kind is to the HPC crowd: in many cases (e.g. Google), it can be avoided due to system level redundancy or by detecting errors with a result sanity check at the end. But I suppose the knowledge that everything went fine in a commercial MRI machine may be worth the additional money... if that still qualifies as HPC, that is. I believe the ECC feature is going to be more important for smaller scale deployments such as engineering workstations where an additional couple of $1K don't really register compared to the overall cost of SW licenses and engineer salary.

This makes sense to me. For smaller scale deployments, a few thousand dollars here or there won't be much of an issue for the convenience of the Tesla products. For a large scale deployment, things are totally different, and cost savings could potentially be huge if one is careful with what option they go with.

Based on the projected DP GFlops and retail pricing of Tesla 2xxx systems (C2050 is $2499 for 520 DP GFlops, C2070 is $3999 for 630 DP GFlops, S2050 is $12,995 for 2080 DP GFlops, and S2070 is $18,995 for 2520 DP GFlops), and assuming that GTX 380 retails at $599 with 700 DP GFlops (retail price intentionally overestimated and GFlops intentionally underestimated):

Cost to reach 1 PetaFlop (Double Precision) with GPU's would be as follows:

C2050: $4.81 million
C2070: $6.35 million
S2050: $6.25 million
S2070: $7.54 million
GTX 380: $0.86 million


Cost to reach 1 ExaFlop (Double Precision) with GPU's would be as follows:

C2050: $4.81 billion
C2070: $6.35 billion
S2050: $6.25 billion
S2070: $7.54 billion
GTX 380: $856 million

What a huge difference in GPU cost for these large scale systems! I know this is highly oversimplifying things because this doesn't include CPU Flops, doesn't include additional hardware costs for the C2050/C2070/GTX 380 which the rack-mountable S2050/S2070 systems don't incur, and doesn't include performance and reliability differences due to amount of RAM, ECC mem, driver tuning, etc., but still pretty interesting.

The reality is that large scale HPC systems are to some extent driven by marketing reasons to reach world record GFlop/TFlop/PFlop peak performance levels, in which case simply providing the most Flops per dollar will be very beneficial in achieving this goal (and the next big push over the next 5-10 years is for an ExaFlop system). But is this really the right approach?

I would hope that HPC systems of the future will put less focus on peak performance, and more focus on real world performance per watt, real world performance per dollar, long-term reliability, etc. in which case NVIDIA's products should do fine in this space given the efforts they have put forth with the Fermi architecture.

I'd say it is up to NVIDIA to demonstrate to their HPC customers that Tesla is the way to go due to higher amount of RAM, ECC mem, and optimized drivers providing a real and tangible performance and reliability benefit which would make the above cost per PFlop/EFlop comparison moot. In the gaming market, we already know that real world gaming performance differences between NV and AMD cards does not correlate well with differences in peak SP Flops. Maybe the situation will be similar in the HPC market too, so NVIDIA has to clearly show this too.
 
Last edited by a moderator:
Who said that, might want to read my post again;). And do you have any basic understanding of thermal output vs. leakage vs. voltage at a given frequency in silicon chips?

This is exactly why I stated you can't look at Tesla's flops and really work backwards to find out anything much about the Geforce line. And also A2 silicon was in the range of 550mhz to 650 mhz as base clocks if we use a multiplier of 2.2 to 2.4 to get the gflops of range they are going for. So come again? If you listened to the web conference call, they did state there the reason why the flops are lower for Tesla is because they had to because of end requirements of the systems they are going to be in.

Admittedly, I'm a little late to the party, but I've got a few questions left:
Do we already know that hotclock multipliers are the same for Geforce and Tesla? Do we know if base clock will be the same between the two product lines? Even more important: Do we know which parts of the pipeline will be hot- and which will be base-clocked? And finally, are there set limitations on how high this multiplier can be chosen? I think those are very important points when we try to draw conclusions from what is (perceived to be) known about Tesla wrt to the appropriate Geforce characteristics.
 
Numbers are nice but we all know there's no physical difference between Geforce and Quadro so why wouldn't a Geforce modded to work with the Quadro driver provide the same (or better) performance?

It should, although these days you'd have to mod the Quadro driver rather than the board as the hardware mods aren't as simplistic as they used to be. And sure, the Geforce should also somehow be able to pretend that it is a Tesla with less memory. But all of that really isn't as relevant as some people seem to think it is.

What is next, someone going to suggest HPC users buy budget CPUs and overclock them? Sure, it's all possible, but succesful businesses just don't need to resort to these kind of shortcuts.

And in general, people shouldn't overestimate hardware costs as a factor in most projects. It's not like any big buyer who is considering going Tesla will be paying anywhere near list price anyway.

Nothing in that quote precludes the use of ECC via software.

That'd be great. Then ECC drivers for all other graphics cards should be right around the corner.
 
Last edited by a moderator:
While reading DKanters piece about Fermi ECC, I noticed that the whitepaper he links to has been site-jacked :/

Are the transmission errors on GDDR5 so high that ECC actually becomes beneficial over simpel ErrorDetection and Resend? I can't find anything about this in the Qimonda or Hynix whitepapers. Are they using 8 or 32 bit CRC?
 
Last edited by a moderator:
If that is true then they aren't competing against commodity CPUs, they are competing against commodity GPUs, which I might remind you have even LOWER margins than cpus. In the HPC space you generally aren't competing against the other guys expensive parts, but against your cheapest parts.

AKA why would I pay nvidia 2K for something they are selling for $300 when its them that want to push it?
Given a few things, such as:
• AMD and Nvidia are going pregnant with the knowledge about TSMCs 40nm-problems for at least nine month now
• GT212 as a 40nm replacement for GT200b seems to have been cancelled mid-2008
• Geforce-Fermi seems to be due not before another 2 months have passed
• Nvidia clearly stated, that they could design DP-disabled and thus smaller chips without major effort

I wonder how likely would it be that in the light of recent (economic) developments, their targetting the HPC market, the competition between Tesla and Geforce for HPC, AMDs 1/3rd smaller Cypress and other factors, Nvidia did work on separate silicon to present as Geforce.
 
In any truly large scale deployment you'll get your board pricing direct from NVIDIA anyway, not the current channels where market forces (and NVIDIA) are dictating things. Buy big, NV will lower margin, simple as that.

This whole discussion about whether the modern consumer GPU is a good fit for high performance computing (helps to spell it out, people get blinkered by the contraction) is mad. Of course it is, it has an immense flop/area/watt/dollar advantage compared to other devices for certain codes.

So any company that wants to try and exploit NV hardware to meaningfully accelerate their software should do this if they have an ounce of sense:

  • Buy two GeForces for each engineer doing the investigation into whether it's worth it
  • Have them experiment and work with both chips
  • If the experiments are fruitful and it looks like the GPU is great, think about Tesla
  • Think about it properly, because the extra cost does buy you useful things
  • For example, if you think more board memory will accelerate your app, borrow a Tesla or a Quadro and find out. NV and partners can and will do that to help get sales. No need to buy at this point.
  • Make a final decision, based on perf/watt/dollar vs other devices (lots of metrics here, some non-obvious, but fairly easy if your algorithm and software make it a no brainer versus other devices on perf alone)
  • If you're going ahead, negotiate a price with NV or a channel partner depending on hardware volume. Don't pay attention to list prices. Money talks.

It's really not that difficult or expensive (hardware wise) to perform the evaluation, then you use your buying power if you decide to buy to leverage the cost in your favour. That's good at this point, especially with NV, because they're keen to get deployments to see what's working and what's not.

To say that Tesla will fail is laughable. Unless I'm mistaken, NV have already amortised the engineering and productisation costs for GT200-based Tesla, so it's mostly profit margin now. That'll happen even quicker with Fermi. Yes, GeForce or Quadro might be a nicer fit, but depending on your codes and performance the large board memory or support from NV engineers might be crucial.

I know CPU people want to sell CPUs, but the market for GPUs in high performance computing (just checking you're not skim reading) is terribly real at this point. You don't have to be a genius to figure out that there are some types of code that will just go much faster on the GPU. Graphics says hello.
 
http://www.digitimes.com/news/a20091228PD207.html
Nvidia is expected to delay its next-generation DirectX 11-supporting GPU (Fermi) to March, 2010

From the article:

Nvidia plans to launch a 40nm GDDR5 memory-based Fermi-GF100 GPU in March, and will launch a GF104 version in the second quarter to target the high-end market with its GeForce GTX295/285/275/260, the sources pointed out.

Will the first GPUs based on GF100 fall short on performance? Or will they be so good that they will sell for $700 and Nvidia will bring out another chip to target $300-$500 market?
 
From the article:



Will the first GPUs based on GF100 fall short on performance? Or will they be so good that they will sell for $700 and Nvidia will bring out another chip to target $300-$500 market?

GF104 is probably just a half-GF104 with ~256SPs and a 192~256-bit bus, meant to replace GT200b-based products.
 
Back
Top