NVIDIA Fermi: Architecture discussion

Actually ECC provides quite a wide range of both correction and detection. Most high end servers for instance are 2b correct, 3b+ detect.
As far as I know ECC (with 72 bit DIMMs) uses a (128,120) extended hamming code which can detect (but not correct) 2 bits of error.
 
Last edited by a moderator:
As far as I know ECC (with 72 bit DIMMs) uses a (128,120) extended hamming code which can detect (but not correct) 2 bits of error.

Agreed. Edit: Nevermind I will check this instead of posting information without checking first (too coding a long time ago :))
 
ECC DIMMs doesn't actually implement any correcting code algorithm, they just provide 1/8 more chips where the memory controller can store extra info, for example, the data for a correcting code algorithm.

Opterons used to implement a BCH code with 4 bit symbols, Istanbul and Nahelem implements BCH code with 8 bit symbols.
 
I don't see why DECTED ECC wouldn't be possible with a different hamming code or bch code - aaronspink did say this was for high-end systems (==mainframes?). I don't work with anything considered high-end, so I have no idea how widespread usage of DECTED is. Don't these kinds of systems also use chip-kill (to tolerate failures of any single memory chip)? Is that in anyway related to DECTED?
 
IMHO (layman perspective):
Series 6000, Bulldozer or even Larrabee are not the biggest threat for Fermi regarding GPGPU, but Fusion.

Fusion is said to have 480 ALUs and ~1 TFlop SP performance. Therefore the DP speed should be around 200 GFLops. On top, Fusion has access to the complete main RAM and 4 K10.5 cores.

If AMD made no big mistakes and included support for ECC and more than one HT3 interface it should be simple to plug 2 - 4 Fusion on a Server board and voila HPC performance en mass. Bandwidth could be a problem though.

The big problem there is that HPC apps tend to be written for CPU or for GPUs, and not some kind of load balanced simultaneous combination. So for a GPU-centric app, it's likely that any significant die area used for a serial CPU would be wasted, and conversely for a CPU-centric app.

Even for an application which has been efficiently ported to both domains (something like Folding@home, perhaps), it's likely much more efficient in one domain than the other. Fusion won't be great for Folding@home, since it's so much faster on the GPU, so it just prefers a full undiluted GPU.

What kind of apps will Fusion boost over a lone CPU or lone GPU? Apps that need to rapidly send data from serial code to parallel code and back with little overhead. This isn't an obvious niche, though perhaps sparse matrix multiplies might be an application. One can argue that Cell is more like this tight connection between serial and parallel processors, and we've all seen how difficult it is to use to its full theoretical potential.

Fermi, with its more generalized kernel handling and parallel kernel scheduling is likely a more practical approach. The use of parallel kernels (and faster synchronization via on-chip atomics) will help with the scheduling complexities that current GPUs depend on CPUs to handle now.

So the lack of on-chip CPU cores is likely not a big deal. Lack of direct access to system RAM is a bigger problem, though in many (even most) apps it can be minimized as long as the GPU has enough RAM of its own. The other bottleneck is intercommunication between multi-GPUs which has a similar and related bandwidth issue. PCIE is very fast, but it's not fast enough for these modern computing approaches.
 
IMHO (layman perspective):

Series 6000, Bulldozer or even Larrabee are not the biggest threat for Fermi regarding GPGPU, but Fusion.

Fusion is said to have 480 ALUs and ~1 TFlop SP performance. Therefore the DP speed should be around 200 GFLops. On top, Fusion has access to the complete main RAM and 4 K10.5 cores.

If AMD made no big mistakes and included support for ECC and more than one HT3 interface it should be simple to plug 2 - 4 Fusion on a Server board and voila HPC performance en mass. Bandwidth could be a problem though.

Fusion comes to market (best case) in H12011. By then you would have AMD gpu's with ~2400 ALUs on the market. The fusion chip you are looking at has a LOT more die area and perimeter given to the CPU side of things and is meant to go into low end desktops, not HPC racks.

The real KO punch (*if* amd brings it to market in time) would be a 6870 with a bobcat core on die, going into CPU sockets. This would give it far higher inter socket bandwidth, much less network latency (ie sending data in gpu memory across network), not to mention that the PCIe bottleneck would be almost done away with.
 
Isn't in more in terms of total board cost/(performace+features) vs the same from the other guys? You would ask if the wider bus and the added board complexity is worth it relative to the cost and performance gained vs what the other guys can do with 256bit.

Yeah, you got that more accurate than me.
 
And Ion for all intents and purposes is a dying product as well, which leaves them with Tegra. The fundamental problem with chipsets for any non-cpu vendor is that simply due to the natural progress of silicon, chipsets are a dying commodity. Though, I don't think we'll ever see notebook/desktop/server CPUs integrate the southbridge functionality simply because their is little to no engineering nor financial incentive to do so, in fact there is both engineering and financial incentives not to integrate.

Actually, I think with the emergence of the netbook market, and specialized OS'es like Chrome OS, merger of southbridge with CPU is a matter of time.

Today southbridges have IO interfaces for hdd, usb, audio, ethernet. In a netbook, you don't need a dozen usb ports, 2-3 are enough. HDD's will likely be replaced by flash, and ethernet's future in netbooks is doubtful. Audio doesn't take much space anyway.

Today, southbridge is a pad limited chip made on an older process so that the CPU die size (on an advanced node) remains in control and this arrangement makes sense. 5 years down the line, I think southbridge will merge with cpu/vanish altogether.

After all, in mobile, embedded market, the trend is to have a single chip.
 
As far as I know ECC (with 72 bit DIMMs) uses a (128,120) extended hamming code which can detect (but not correct) 2 bits of error.

There are a large number of different ECC schemes being used in commercial practice currently, many of which are proprietary. Suffice to say that most commercial vendors aren't using the 8 over 64b or even 9 over 128b but more exotic schemes.
 
Fusion comes to market (best case) in H12011.

Are you sure?

From 2. Dec. 2009:
Link: http://translate.google.de/translat.../11050183.shtml&sl=zh-CN&tl=en&hl=de&ie=UTF-8

AMD Medek: release early next year, the first APU

The "release" shouldn't be in Q1/2010 but the first parts. Why do you think that Fusion will need more than 1 year from first parts till serial production?

IMHO I wouldn't be surprised to see the first Fusion CPUs in retail in Q4/2010 (nov.-dec.) already.

By the way, wombat seem to have seen the first Bulldozer too (at 2GHz).
 
You seem to be assuming that your problem is compute bound, and from what I have seen, that is not the case for Google and many other similar installations.

-Charlie

Google's cluster handles both I/O bound and compute bound tasks. There are lots of very CPU expensive internal batch jobs they run.
 
Somewhere in between 5870 and 5970, according to the latest news, closer to 8 then to 9. (unless I got it completely wrong and it's 10% faster then a 5970.

Or you could take the simple route, take two times gt200b (GTX295) and use that as a reference. if it's close to GTX295, they have done a good job.

But that's not even two times GT200b. For it not to be a failure, given how late it is, the high-end Fermi based GeForce needs to be at the very least, as fast as two GTX 285 in SLI.
 
But that's not even two times GT200b. For it not to be a failure, given how late it is, the high-end Fermi based GeForce needs to be at the very least, as fast as two GTX 285 in SLI.

Late or not you'd expect a healthy performance increase to predecessors. Afaik IHVs target typically roughly up to twice the performance in that regard. Now of course there will be cases where due to limitations of some games or the underlying platform differences tend to be smaller (as we've seen on RV870 vs. RV770) but also peak cases where the difference is rather big.

I realize that most are interested in setting an average but at this time even if say 5 hand selected benchmark results would be available they would only present a rather subjective small side of the entire story. What we need is a wide selection of benchmarks/tests from a variety of independent sources.

If the result then poses something that's on average slighly below say a Hemlock (with the latter trouncing it in several peak cases) and has an equally attractive MSRP they're quite well positioned. Only a small notch above 5870 would be in the yawn realm and a healthy notch above 5970 would be any fanboy's unrealistic wet dream.

But it most likely won't be, since it hasn't got enough bandwidth.

What's more important on a GPU: how much theoretical bandwidth it has or how it actually handles it? That's the nasty byproduct of dealing only with sterile theoretical numbers. If all factors are unchanged it's easy to multiply and estimate. The real question is IF and where any changes have been made. I for one wouldn't expect the same old tired G80 memory controller in a new architecture.
 
But it most likely won't be, since it hasn't got enough bandwidth.

How can you say that, when there's nothing to prove that yet ? We have no idea if bandwidth will be a limiting factor. This is a new architecture anyway, so what may have been the bottleneck in previous generations, may not be a bottleneck now.

If we take Rys's numbers, we are looking at over 200 GB/s of memory bandwidth, but as it was discussed before (speculated actually) Rys may be a bit conservative on the memory clock frequency. If we take 4500 Mhz (instead of Rys's 4200) we are looking at 216 GB/s or if we take 4800 Mhz (same as Cypress), we are looking at 230 GB/s.
It's still definitely not double of what GTX 285 SLI has (theoretically speaking), but this single GPU won't be constrained by SLI limitations.
 
Late or not you'd expect a healthy performance increase to predecessors. Afaik IHVs target typically roughly up to twice the performance in that regard. Now of course there will be cases where due to limitations of some games or the underlying platform differences tend to be smaller (as we've seen on RV870 vs. RV770) but also peak cases where the difference is rather big.

I realize that most are interested in setting an average but at this time even if say 5 hand selected benchmark results would be available they would only present a rather subjective small side of the entire story. What we need is a wide selection of benchmarks/tests from a variety of independent sources.

If the result then poses something that's on average slighly below say a Hemlock (with the latter trouncing it in several peak cases) and has an equally attractive MSRP they're quite well positioned. Only a small notch above 5870 would be in the yawn realm and a healthy notch above 5970 would be any fanboy's unrealistic wet dream.

I have doubts it will surpass Hemlock, except in cases where Crossfire scaling just sucks. But I also can't see it not coming very close of it. Let's say exactly like the HD 5870 is with the GTX 295. Slower on average, but not by much and winning in some cases, while consuming less power. If they priced this GPU (under these performance speculations) @ $550, the fact that they are late, will be almost completely ignored, given the performance/$ they are offering.
 
Back
Top