NVIDIA Fermi: Architecture discussion

Layman perspective:

I'm not sure that the 5870 / 5890 are really the most important competitors of the Fermi for HPC. IMHO the real competitor could be Llano:

- wait ~half a year
- use a 4 CPU server board
- 16GB of ECC RAM
- 4 Llano APUs with HT3 [ ~ 800GFlops - 1TFlop DP Performance ]

- have fun ;)

- If AMD was not so stupid to include not enough HT3 interfaces in the Llano APUs then such a system would have far more HPC performance than Fermi at a comparable prize (depending how much AMD wants for the additional HT3 interfaces for server applications).

When Juniper does not have DP, what makes you think that LIano will have it? :LOL:

However, I am expecting LIano to have HT3, as it connects amd cpu's to northbridge.
 
Theoretical flops aren't what matters, what matters is what you can achieve for an expected set of workloads. Even computers like BlueGene have sustained flops that pale compared to their theoreticals. If slapping together a bunch of dumb ALUs was all you needed to make a good supercomputer, IBM or Intel would have done it years ago.

What's also being ignored is the software side. Scientists, the main consumers of HPC, suck as software engineers. We know this from the leaked Climategate scandal. You can have higher theoretical flops, but make it so hard to write code to achieve those flops, that in practice, they aren't achieved. The people voting for Red have been through this before with Xbox vs PS3 CELL, where you had the potential for high performance CPU execution, but damn difficult to achieve in practice.

The best thing NVidia could do would be to produce a Fortran compiler for Fermi. It look's like they're already going to do C++. There's a few third party Fortran compilers already. In any case, if the Fermi architecture is better at executing more 'SIMD CPU-like' code, and the software tools are there to generate efficient code that can let the average person attain high utilization, then they have a powerful argument for choosing Fermi.

Honestly, the biggest thing they have against them in HPC is not Cypress's theoretical TFLOP advantage, but their power consumption and heat.
 
The best thing NVidia could do would be to produce a Fortran compiler for Fermi. It look's like they're already going to do C++. There's a few third party Fortran compilers already.
Well, no, what they need is a set of libraries that offload certain expensive operations to the GPU. It is probably too much to expect any but a tiny fraction of scientists to bother with programming on GPU's directly. Instead they should just ship versions of LAPACK and BLAS that are GPU-accelerated, as well as other common operations that benefit, such as FFT's.

I actually know somebody right now that's working on a version of Healpix (a way of pixelizing a sphere that is optimized for spherical harmonic transforms) in order to offload some of the computations to the GPU. He's working on a training program to determine, for each machine, how much it should offload for optimal performance.

If he can get that working, then scientists in my field will make copious use of it, provided the cost/performance his there, because it would be basically no different from using the current Healpix routines that already see wide use.
 
Well, no, what they need is a set of libraries that offload certain expensive operations to the GPU. It is probably too much to expect any but a tiny fraction of scientists to bother with programming on GPU's directly. Instead they should just ship versions of LAPACK and BLAS that are GPU-accelerated, as well as other common operations that benefit, such as FFT's.

Exactly, but AMD seems to be doing quite well in that respect with their ACML-GPU. I haven't used it yet, though, so I don't know if it has many shortcomings.
 
The issue isn't just hardware problems, but cosmic rays flipping your bits as well.
I know. I guess you've not been following the discussion on ECC and measured error rates on GPUs without ECC.

Soak testing will do nothing to stop that. Build a big enough cluster and run it long enough, and the probability of failure becomes non-trivial.
Actually, the only test to date with GPUs shows no failures (while Aaron reckons the method of that test is faulty). What it does show is that graphics cards delivered with faulty memory are a serious problem.

The estimate for cosmic ray bit flips is about 1 event per 256MB of memory per month. Amazon got taken down by cosmic rays in the 90s for 24 hrs by a cosmic ray event.
Yeah, you're really been out of the loop on this subject:

http://forum.beyond3d.com/showthread.php?t=54676

No-one has demonstrated a need for video memory to be protected by ECC. GPU on-die memory? Not that either. Fact is, the error rates in contemporary systems do not match with "received wisdom".

People building HPC are clusters going to be using several hundred cards and running them on jobs which could run for weeks or months and consume huge $$$ of power costs and time, so to have the results fscked up halfway through is a bitch.
In the end ECC isn't terribly expensive to implement in hardware, the way NVidia's implemented it. The performance loss isn't a deal breaker either. NVidia took the easy way forward, relatively speaking. And will be marketing it purely on FUD as there is no evidence that GPU video memory suffers from cosmic ray events.

By the way, I'm not saying GPU video memory can't suffer from cosmic ray events - I'm saying the evidence one way or the other (or any measurements of failure rate) doesn't exist.

Even if fears are overrated, the people in the position of purchasing huge amounts of equipment, especially for government laboratories, are risk averse and like to buy safety.
Yeah, it's why "no-one ever got fired for buying IBM" became a paradigm. Of course, the cost of the FLOPS can make one quite pragmatic about risk. Like these guys:

http://forum.beyond3d.com/showthread.php?p=1353418#post1353418

using GPUs that are obviously not ECC protected. I wonder if they'll be reporting about their cosmic ray problems...

Honestly, I don't think it really matters - the option's there for those who "need" it and the reality is that NVidia has, effectively, not priced ECC as a premium option (since DP throughput and memory amount are the dominant facets of the premium option). But the science on cosmic ray events is sorely, or is that "hilariously", lacking.

Of course, now that NVidia's built ECC it will be possible to experimentally compare ECC, one test with it turned on and one with it turned off and see how they fare. Though as I understand it, the on-die ECC can't be turned off - so it's not an entirely controlled experiment.

Jawed
 
And who would run such a system on hacked drivers? The price of a Quadro or Tesla is nothing compared to one engineer being forced to sit around doing nothing for a day or having te redo a callculation that took 24 hours and meaning that the whole schedule gets mixed up.

Apart from that most software vendors won´t give you support if you run GeForces on hacked drivers either.


Any of the hobbyists that won't spend thousands, any companies that don't want to spend multiples of that but still have large amounts of computing power, companies in China, etc.
 
Jawed, one paper does not disprove anything, but even if they were right, the fact remains that the market *believes* protection is needed from these errors. Do you think NVidia went through the trouble to throw ECC into Fermi for no reason, or because some very important customers had actually asked them for it?

On the subject of LINPACK/BLAS/etc, I'd go one further and say, shipping something like MATLAB with GPU acceleration would go even further. Still, anyone building a cluster for a specific purpose (like nuclear weapons simulations) will probably requirement more hands on development, and having a superior software tool chain and easier to program hardware certainly isn't going to be a negative.

Both AMD and NVidia still face intense competition from Intel. A few years ago before GPGPU, I was doing Smith-Waterman and HMMR software (handed tuned for SMP), and our competitors were selling FPGA solutions. The FPGA systems were faster, but still not as flexible as the CPU, and every time a new CPU was released, we got a lot faster, while their FPGA system became a legacy.

The problem still remains that, even if a GPU is 10x faster than a CPU, investing in a server farm of cheap commodity CPUs gives you more flexibility in what you can run, and how much it'll cost.
 
Yeah I don't really get Jawed's point here. It's not like Nvidia is running around spreading FUD about cosmic rays to get people to buy into their ECC enabled products. It's the customers themselves who are demanding the protection.
 
Jawed, one paper does not disprove anything, but even if they were right, the fact remains that the market *believes* protection is needed from these errors. Do you think NVidia went through the trouble to throw ECC into Fermi for no reason, or because some very important customers had actually asked them for it?
Do you think I'm arguing that NVidia should not have implemented ECC? I'm not.

All I'm saying is that the evidence is lacking - hence my earlier post: "I'm still waiting for any data that shows GPUs without ECC suffer from memory errors (once the memory has passed being soak-tested for hardware problems)."

Also, since NVidia's solution minimises development costs as well as die-space and system-implementation costs, it's actually a very good solution. I'm not criticising NVidia for engineering this solution - not at all (I've noted before the "customer demand").

I'm merely criticising the lack of science. Which is super-ironic given that it's scientists who'll be amongst the biggest consumers (if not the biggest) of the ECC capability. If they've been "educated" to insist on ECC, then more fool them.

In the meantime, ECC is there, and NVidia's produced what is, in effect, a non-premium-priced ECC option. And it will make the basis of a nice test for the continuing need for ECC. Seriously, it's a good thing NVidia's implemented it, because the way they've done it makes it optional (a software switch) and not the predominant product differentiator.

The only fly in the ointment with ECC is that ECC on its own is not a guarantee of exactitude, as there are other error mechanisms in a GPU cluster. If you really care about exactitude for months-long compute runs then ECC doesn't buy much - you're looking at truly byzantine (super-expensive) configurations to ensure correctness.

The problem still remains that, even if a GPU is 10x faster than a CPU, investing in a server farm of cheap commodity CPUs gives you more flexibility in what you can run, and how much it'll cost.
Not to mention that CPUs are on the cusp of an explosion in core count (we'll be at 16 cores soon enough - and Fusion/AVX/LRBni promise good things on the 5 years+ timeline). And, because GPUs have done much of the heavy-lifting in engineering ultra-high bandwidth consumer memory, CPUs will be able to incorporate much more bandwidth as core counts per chip increase...

Anyway, there's still plenty of things where GPUs are 50-100x faster (explicitly data-parallel workloads)...

Jawed
 
Do you think I'm arguing that NVidia should not have implemented ECC? I'm not.

All I'm saying is that the evidence is lacking - hence my earlier post: "I'm still waiting for any data that shows GPUs without ECC suffer from memory errors (once the memory has passed being soak-tested for hardware problems)."
GPU memory isn't different or special compared to system memory. It's going to suffer just as many errors per die area as system memory does. If supercomputing systems need ECC memory for system memory in order to operate stably for long periods of time, then so will GPU's.
 
By the way, I'm not saying GPU video memory can't suffer from cosmic ray events - I'm saying the evidence one way or the other (or any measurements of failure rate) doesn't exist.

But, evidence *does* exist. It not like DRAM chips suddenly becomes immune to alpha decay in the packaging material or cosmic rays because they are soldered onto a GPU PCB, - and there are plenty of studies on DRAM soft errors.

Now, the risk of encountering a soft error on a single GPU with traditional amounts of memory (½-1GB) might not be big. But once you have a non-trivial number of racks with GPGPU solutions in them your MTBF tanks.

Cheers
 
But, evidence *does* exist. It not like DRAM chips suddenly becomes immune to alpha decay in the packaging material or cosmic rays because they are soldered onto a GPU PCB, - and there are plenty of studies on DRAM soft errors.

Now, the risk of encountering a soft error on a single GPU with traditional amounts of memory (½-1GB) might not be big. But once you have a non-trivial number of racks with GPGPU solutions in them your MTBF tanks.
The one experiment that's been done with racks of GPUs should have shown this tanking - that's the issue, it was nowhere to be seen. Anyway, further experiments are going to be much easier...

Jawed
 
L2 cache scaling

"We asked Henry Moreton, Distinguished Engineer at nVidia what the overall bandwidth of the caches involved was and we learned that nVidia's GF100 packs more than 1.5 TB/s of bandwidth for L1 and a very similar speed for the L2 cache."
http://www.brightsideofnews.com/new...100-architecture-alea-iacta-est.aspx?pageid=1

The near 1.5 TB/s is quite fast for a single read/write cache. If the 2*GPC variants would still have 768KB L2 cache than it could be fast enough even with slower memory.
 
The one experiment that's been done with racks of GPUs should have shown this tanking - that's the issue, it was nowhere to be seen. Anyway, further experiments are going to be much easier...

Jawed

We need a survey of server admins who check for ECC events in their logs.
 
"We asked Henry Moreton, Distinguished Engineer at nVidia what the overall bandwidth of the caches involved was and we learned that nVidia's GF100 packs more than 1.5 TB/s of bandwidth for L1 and a very similar speed for the L2 cache."
http://www.brightsideofnews.com/new...100-architecture-alea-iacta-est.aspx?pageid=1

The near 1.5 TB/s is quite fast for a single read/write cache. If the 2*GPC variants would still have 768KB L2 cache than it could be fast enough even with slower memory.
The L1 figure appears to be a straightforward 4 byte*16 load/store*16 cores*1.5 GHz.
The L2 sounds like it has core-local partitions with similar bandwidth, though probably not of 32-bit granularity in transfers.
 
The one experiment that's been done with racks of GPUs should have shown this tanking - that's the issue, it was nowhere to be seen. Anyway, further experiments are going to be much easier...

Jawed
In that topic, both aaronspink and dkanter, said you are wrong in downplaying cosmic rays. Fact is, detecting 0 cosmic rays induced errors is way too suspicious than detecting 2 orders of magnitude less errors than expected.
And frankly on such topic I'd believe what dkanter says.
 
Any of the hobbyists that won't spend thousands, any companies that don't want to spend multiples of that but still have large amounts of computing power, companies in China, etc.

Hobbyist sure, Chinese maybe but apart from that I can see few organisations willing to take that risk. Deliver faulty results for one reason or the other and the other sides find out that you have been running non certified systems and you a screwed.
 
Hobbyist sure, Chinese maybe but apart from that I can see few organisations willing to take that risk. Deliver faulty results for one reason or the other and the other sides find out that you have been running non certified systems and you a screwed.

Doesn't matter if it saves them millions in hardware costs. They'll just make sure the hacks work properly, and there are many applications where a small error may not mean too much. For instance, you might be raytracing microcell coverage in cellphone network planning. If you're a bit off, it doesn't matter, as long as you can produce coverage maps to show you're (apparently) sticking to the contracted coverage requirement.

When I worked for software houses, I personally knew of several big, official Chinese companies that as a matter of course hacked software lockouts so they could use more licences than they had paid for. I have no reason to think they would do otherwise to save themselves millions in hardware costs.

These are the people Nvidia want to make sure buy a $5000 Fermi rather than the $500 variant. Nvidia don't care about the hobbyists, but about the companies that could replace millions of dollars worth of professional Fermis with a few tens of thousand of dollars of gaming cards. These are the companies that have programmers by the bucketload, but not a lot of cash, so it's economically very attractive to them to hack a software lockout.

As I said before, Nvidia would be foolish not to design hardware limitations into the product, rather than rely on crackable software lockouts.
 
The L2 sounds like it has core-local partitions with similar bandwidth
The L2 partitions are aligned to the ROP partitions/memory controllers. So there are six 128kB partitions on GF100 (a version with a 320bit memory interface will have 640 kB L2). There is a quite powerful crossbar between all the 16 SMs and the 6 L2 partitions.
 
Back
Top