NVIDIA Fermi: Architecture discussion

However, I do not know of a chip (cpu or gpu) that was 7-8 months late and didn't prove to be an epic fail. Maybe someone more knowledgeable can enlighten us.

Yeah that's exactly how I feel about it too. By definition a delay means it isn't good enough to bring to market in its current state. So it has to catch up to "good enough" status first. Now you could say that it's due to yields/leakage/etc but those things are usually indicative of performance levels too - i.e gotta lower clocks to improve yields and lower power consumption.
 
I guess we just have different experiences. Well, I'll readily admit that I'm not familiar with trends in the US, but of the 8 or 9 data centers I've visited in Europe and Asia over the last 2 years, pretty much all of them are on the sweet spot of 2 socket blades, down from 4 socket rack units, and I don't know anyone who is seriously using i7s and Phenoms in place of Xeons and Opterons.

How many of them were Google/Amazon/AmazonEC2/Yahoo/WhateverMSbrandsthingsthishour type installations?

-Charlie
 
thats the exact world i live in and i haven't seen much different of late, i call BS on 1 socket installs in DC's it makes no sense. If you guys realised the cost per sq meter of a datacenter you will quickly see density is king. there are so many fronts it doesn't make sense on.

its inefficient in terms of power consumption from both increased inefficientices of AC to DC conversion in 1p servers vs blades and redundant parts within each server.

its very low density, i cant get 576 opertion cores in a 40 RU rack right now vs 14.4 racks to do that as 1 socket servers.

very costly in terms of localized cooling , for example i would need one APC 1/2 rack ac unit for 3 blade enclosures in a 40 RU rack, i would need 3-4 for 14.4 racks and much bigger water pumps.

given a 2 socket blade with 64 gig of ram with 2 istanbul cpu's is around 8-10K (AUD) single socket servers dont make sense. Sure there is hard disk to consider but most things in this space local disk doesn't have the performace anyway.


PS. im a network "engineer" in a DC ;)

edit: lol, i forgot completely about networking costs, with 1 socket servers thats pretty much forcing you to go top of rack switching, with blades you can go something like a catalyst 6500 or nexus 7000 and just use fibre back to a couple of central points. assuming a redundant network design it is significatly cheaper as well as far more scalable to centralise your network access layer.

You seem to be assuming that your problem is compute bound, and from what I have seen, that is not the case for Google and many other similar installations.

-Charlie
 
blades are just marketing. People were selling racks with shared PSUs long before blades ever became a catch phrase.

I would disagree with this point slightly, you seem to be saying that there is nothing to blades but shared PSU/NIC/IO etc. I can think of one thing that a blade does better than a equivalently dense set of rack server, management.

If you have a full management scheme that is data-center wide and covers every bit of the systems, then there is little difference. If you don't, blades can be a bit easier to manage, and with the scale of some of these farms, a bit to a seriously high power adds up. :)

Then again, all the farms I know of don't manage as much as rip and replace anything that twitches wrong, it isn't worth the time or effort.

-Charlie
 
ECC RAM is often registered because they are primarily used in servers where registered RAM is used to have more DIMMs per channel.

That is correct. It could also be a FB-DIMM.

And this is not OT because this clears up some misconceptions about FERMI use of memory when ECC is enabled.
 
That's why there's a lot of speculation (despite Nvidia's denials) that they are going primarily for HPC with mainstream markets now a secondary thing. They can't compete with ATI in the mainstream space (unless they are willing to effectively subsidise every chip), so they are sidestepping that battle and trying to carve out their own market in the form of GPU-based HPC. The alternative is to follow AMD's "smaller is more" design philosophy, or come up with something spectacularly ahead of the times. It looks like current processes or their designers are not up to the latter.

If this was 20 years ago, or possibly even 10 years ago, this could be a reasonable thing to consider. But given Nvidia's size and cost structures and given the HPC market, it is unreasonable for any reasonable executive to think that they can treat HPC as a primary and the broader GPU market as a secondary. The simple issue is that given the market functionality requirements in the consumer space, it would be hard if not impossible to create enough product differential between the HPC SKU and the game SKU to get a significant amount of the HPC crowd to pay 5-10x higher costs. There is little in the support realm that can reasonably support that cost differential.

Nvidia have pretty much done the same thing with their chipset business. There were lots of denials there, and now it all seems to be concentrated on Ion and Tegra with the desktop chipset business effectively abandoned by Nvdia in the face of crippling competition from AMD and Intel. Nvidia can no longer charge a premium now that they no longer provide extra features and performance and Intel and AMD have eaten that business.

And Ion for all intents and purposes is a dying product as well, which leaves them with Tegra. The fundamental problem with chipsets for any non-cpu vendor is that simply due to the natural progress of silicon, chipsets are a dying commodity. Though, I don't think we'll ever see notebook/desktop/server CPUs integrate the southbridge functionality simply because their is little to no engineering nor financial incentive to do so, in fact there is both engineering and financial incentives not to integrate.

Nvidia has had to snake out into a different field, and it looks to me that they may be doing the same with Fermi and what seems to be a heavy focus on GPGPU/HPC markets. It carries extra cost for the mainstream market whilst not giving benefits to that market for that extra cost.

The problem is that the mainstream is called mainstream for a reason. History is littered with failed HPC companies that had products very very similar to Nvidia's GPUs. In order to make the HPC market viable for Nvidia, they need to be competitive enough so that the mainstream market basically can pay all the R&D. If they can't, then their HPC product dies too.
 
As far as I know all their Opteron models support ECC, it would be weird if they suddenly didn't.
Main difference ofcourse being Fermi ECC being calculated on the GDDR chip and Opterons use an ECC chip on the DIMM.

Unless they are using some pretty exotic DRAMs, their GDDR5 isn't calculating any ECC for the memory. Also Opterons do not use an "ECC" chip on the DIMM.
 
Yep and thats IMHO one of the most weird things. AFAIK ECC is normaly done with 9bit DRAM with the ninth bit used for error correction. Fermi uses normal GDDR5 with a 8bit "organisation" and access (multiples of 8). Therefore with 1 bit reserved for the error correction Fermi can, as far as I understand it (remember: layman perspective) only access data in 7bit increments. Thats seems strange and inefficient.

ECC hasn't been done with 9b DRAMs since the dark ages (dark ages being defined as the area of computing modern DIMMs). Given a standard 64b width DIMM that uses either 16b, 8b, or 4b (though 4b is becoming an exceedingly rare and costly thing), it makes they most engineering and economical sense to add a 9th 8b dram to a 64b wide dimm to get 72b of data transmission.

Fermi most likely is doing 2 seperate memory accesses when it is using ECC. First access to the data and then a secondary access to a reserved ECC storage space. Given reasonable memory locality the costs would be roughly 1 additional access per every ~30 reads. And given an ECC write cache, 1 additional access every ~30 writes.
 
No. You couldn't even do error correction with 1 extra bit, only error detection for 1 bit errors. ECC provides 1 bit error correction and 2 bit error detection.

Actually ECC provides quite a wide range of both correction and detection. Most high end servers for instance are 2b correct, 3b+ detect.
 
As for Fermi ... my assumption is that they simply interleave the ECC codes with the data. IMO if it wasn't for legacy and inertia this should be the way to implement it in PCs as well.

Except that in cases of bad data locality it has increased bandwidth costs. In addition, it would require DRAM chips capable of higher prefetch as well. ECC is done as it is done for current day systems primarily because of cost. The DRAM chips are exactly the same, the boards they go on is a minor cost factor, and the performance difference effectively doesn't exist. Doing interleaved ECC/Data would add either increased costs, reduced performance, or both.

For Fermi, I think it makes the most sense if they simply have a contiguous reserved section of the DRAM map for the ECC storage which enables a variety of optimizations no reasonably doable with an interleaved mapping.
 
That doesn't explain the lower capacity when ECC is enabled.

You need extra space to store those ECC data, so capacity is lower (the second reserved area). As in ECC DIMM, 9 chips is used instead of 8 chips on normal DIMM. So, supposed that you want to do the normal 2b detect/1b correct ECC, you will have to reserve one byte for every 8 bytes, thus losing 1/9 capacity.
 
I think I'm pretty right. It's called an EOS (ECC On SIMM) chip and. You'll find that any A brand (not Dell ;) ) x86/x64 has EOS chips on their DIMM.

Check this DIMM and notice the small IC in between the two rows of Memory chips. That's the ECC module.

Um..........

NO and NO!

no ecc calculations are done on the DIMM of any modern day computers. Even in the dark ages it was an utterly small minority of the ECC dimms.
 
Pretty sure that's the register/buffer chip ... wasn't ECC on SIMM basically a way to use ECC with motherboards which only understood parity?

They were used for motherboards/systems that had no memory error detection (basically all non-server systems). For a vanishing short period of time, some insane people thought it was a reasonable retrofit to enable non-server systems to be used as servers. Luckily, those people are no longer employed, probably a consequence of spending 2x the cost on specialty ram for a system that was barely any cheaper to begin with.
 
That doesn't explain the lower capacity when ECC is enabled.

Um, yeah it does. If you are having to store an additional, say 8b per 64, 9b per 128, 10b per 256, or 11b per 512 of ECC data then you have to store it somewhere. Unless they have additional data lanes and some exotic way to get extra data lines not revealed, then they would have to store the ECC codes in the normal memory pool. This would require the hardware/drivers to map out sections of the DRAM for ECC code storage.

You don't see this on normal systems because they utilize 72b memory channels in which the additional 1/9th memory capacity is never actually advertised. For various reasons, Nvidia doesn't have this luxury.
 
Yep, but I was following on from your comment - "it makes they most engineering and economical sense to add a 9th 8b dram to a 64b wide dimm to get 72b of data transmission. Fermi most likely is doing 2 seperate memory accesses when it is using ECC. First access to the data and then a secondary access to a reserved ECC storage space."

Thought you were implying that this would be the setup in Fermi which would be contrary to what we know. I see now that the two statements were unrelated.
 
How many of them were Google/Amazon/AmazonEC2/Yahoo/WhateverMSbrandsthingsthishour type installations?

-Charlie

Well none actually, because like I said before to that other guy, and you happen to reiterate in your very next post, all but ec2 (and that one is arguable) are not examples of HPC.

Do you think that Tesla's primary purpose is to compete for your webserver business, or your database?

You seem to be assuming that your problem is compute bound, and from what I have seen, that is not the case for Google and many other similar installations.

-Charlie
 
Yes, that's what expanding into new markets means. What does that have to do with leaving existing ones? Have you heard anything from Nvidia that indicates they think HPC can replace graphics revenues by Q1 2010?
I dont think they believe HPC will replace Graphics, more complement it similar to Professional providing profit while Graphics keeps the volumes high ie in economics an example of cross subsidization . The problem with graphics currently is the profitability appears to be progressively declining, rather than sit back they are trying to find new markets to compensate for this.

This chinese site a few months ago had some figures for the total HPC market i think from Nvidia's mid year presentation to analysts on Tegra and Telsa(about half way down under the S1070 picture):
http://news.mydrivers.com/1/145/145916_1.htm
Over next 18 months HPC TAM(US$)
Seismic: $300m
Supercomputing: $200m
University $150m
National Defense: $250m
Finance: $230m
So larger than $1bn in total. Now you just need to work out what percentage of the above nvidia figures it can get ;)
 
Except that in cases of bad data locality it has increased bandwidth costs.
Random accesses of size 1-48 bytes get their ECC for free with interleaved ECC bandwidth wise (burst 8, 64 bit bus). It's normal ECC which wastes bandwidth. There is only this very small region from 48-64 byte accesses (and their multiples) where it wastes any significant amount of bandwidth (and relatively little at that).
In addition, it would require DRAM chips capable of higher prefetch as well.
Why exactly? I see no problem using burst length 8 ... the memory controller simply caches bytes from an incomplete 8/1 pair for a bit to be able to handle sequential accesses, if they weren't needed in the end you can just drop them.
For Fermi, I think it makes the most sense if they simply have a contiguous reserved section of the DRAM map for the ECC storage which enables a variety of optimizations no reasonably doable with an interleaved mapping.
Oh man, now THAT would fuck you up when you have bad data locality.
 
Back
Top