NVIDIA Fermi: Architecture discussion

Hmmm...

First, I don't think they use a "multiplier" between ROP domain and shader domain, they use the same frequency generator but that's all they share, each one has it's own multiplier. So, there's no doubt for any relationship between those frequencies. Given Tesla usage is far from being blending/texturing/rasterizing intensive, it's more than likely they'll lower ROP domain for this products family.

That is correct never stated anything to contrary. But also they will need ROP's at certian frequencies for certian operations.

Second, RAM doesn't consume that much power. Say 2 watt for a 1Gb part, so that amounts to an astronomical 24 watt excess power consumption.

Oh I was under the impression it higher then that. The heat off ram chips tend to be quite high directly proportionally to the amount of energy they need.

Now, given that Tesla is already given a max board power of 225 watt, il leaves us with a 200 watt absolute minimum for the exact same core configuration, add 10% for a full featured core and another 10% for frequency boost, and you end up with a "GTX380" consuming no less than 250 watt max, and that's quite optimistic.

Can't say that at this point.

Something I read there was clearly laughable btw, if Tesla don't need to have full math units, why the hell would they design such an absurdly big chip? They could simply design a 256SP core with the exact same functionality, a 256-bit bus and a slightly smaller L2. Who cares since it's the features that give it all it needs to have any appeal?

Where did you read that?

Remember GeForce products are not going to be available before 2 to 3 more month, basically 6 months after Cypress launch, and add to that the clear statement AMD is going to refresh their own GPUs on a yearly basis, so i'd better have to be almost 50% faster than a Cypress XT, which seems to not be fully useable at the moment.

Don't know that either, unless you are assuming certian rumors.

And as a conclusion to this, where are the famous "Fermi derivatives"? Are they trying to sell us GT200 derivatives as such? (GeForce G310, anyone?)

Weren't really discussing those are we?
 
Aside from the previous reply, I'm still wondering how they achieve ECC memory access on Tesla using any possible controller width.

If we consider some sort of data interleaving, we'll always end up with a "9" which doesn't work, could it be for the width or the prefetch burst access. Bandwidth is going to be mediocre when using this functionality if it requires to access the memory for just 1 word, be it 16, 32 or 64-bit. This would not cause a "1/8th" bandwidth penalty, but 50% penalty.

I didn't understand why DP performance sucked that much during the particles demonstration made at the GTC either, with a little more compute power than a GT200 based Tesla in SP they were only able to deliver half the throughput, which points to a severe bottleneck.


And Fermi with ECC, causes around 10% increase in bandwidth, internal (caches) and external (vram) in some operations, it could be more or less depending on the operations.
 
Last edited by a moderator:
Aside from the previous reply, I'm still wondering how they achieve ECC memory access on Tesla using any possible controller width.

If we consider some sort of data interleaving, we'll always end up with a "9" which doesn't work, could it be for the width or the prefetch burst access. Bandwidth is going to be mediocre when using this functionality if it requires to access the memory for just 1 word, be it 16, 32 or 64-bit. This would not cause a "1/8th" bandwidth penalty, but 50% penalty.

I didn't understand why DP performance sucked that much during the particles demonstration made at the GTC either, with a little more compute power than a GT200 based Tesla in SP they were only able to deliver half the throughput, which points to a severe bottleneck.

Fermi uses 8+1 bits for ECC, you can see by the available ram numbers when ECC is turned on. There are no special chips, other than the larger 64 x 32 parts. If they got special chips, it wouldn't do them much good without widening the bus by 9/8ths. :)

ECC probably add a bit of latency, and likely burns a bit of power as well, and might cost them speed on top of it all. the 3.6-4.0GHz GDDR5 seems way low, so I am guessing that they have a bottleneck in other areas, like whatever they use to compute the ECC check.

The problem that Fermi has is that GT200b had GDDR3@ ~1200MHz/2400 effective for 240 shaders. It doesn't take much math to realize that it is ~100MHz per shader. Fermi has (best case) GDDR5@ 1GHz/4GHz effective for 448 shaders, or about 90Mhz per shader. With 512 shaders, it drops to ~78MHz per shader.

I know those numbers are pretty meaningless technically, but they do serve to illustrate a bandwidth problem for the chip. I was expecting a much faster memory clock, 4.8 or so personally.

Some of this can be made up for by caches, but things like FP work tends to be much more bandwidth sensitive than raw compute sensitive, especially if using large data sets. Caches will help, but by how much?

-Charlie
 
The problem that Fermi has is that GT200b had GDDR3@ ~1200MHz/2400 effective for 240 shaders. It doesn't take much math to realize that it is ~100MHz per shader. Fermi has (best case) GDDR5@ 1GHz/4GHz effective for 448 shaders, or about 90Mhz per shader. With 512 shaders, it drops to ~78MHz per shader.

I take it you're absolutely devastated by the bandwidth/flop ratio in Cypress? :smile: But is there a reason why you are comparing Tesla to Geforce? The GT200 based Tesla parts had 800Mhz GDDR3.
 
You mean you still pay for the leakage on the disabled clusters? WTF? That makes no sense to me. :oops:

Unless you have either on die power gates or partitioned power grids with independent off chip regulation, all that disabling a section of die does is remove the active switching power for the clock and transistors. It has no effect on the leakage (aka static) power being used in those regions.

Only production design that I'm aware of with disclosed power gating is Core i aka Nehalem. This enabled effectively zero power when a core is gated off. This is a big win, because static power in modern processes tends to be fairly high.
 
Full on power-gating for Nehalem involves sticking big transistors right on the power feeds to the cores.
I'm not a materials scientist, but getting that right is apparently a non-trivial task, and something Intel is apparently proud of accomplishing.

Considering that only one company has shipped a production part with it, it is likely non-trivial.

I wonder if a less performant option could be possible for a disabled core or cluster, where there is no concern about turning a core on and off with a fast response time like with Nehalem's power scheme.

The issue is simple electronics. The power transistors have to be able to supply enough current to the devices being powered. The only issue caused by fast on/off would be transients, but those transients realistically would have to be able to be handled to some extent anyways.
 
And as a conclusion to this, where are the famous "Fermi derivatives"? Are they trying to sell us GT200 derivatives as such? (GeForce G310, anyone?)
Weren't really discussing those are we?
Why not? The topic is called Nvidia fermi:... and some people were eager to point out that Fermi is not a chip.
So, we are free to discuss any future chips with Fermi architecture, no? Or some people will object if its called an architecture, or if its called a chip? :LOL:

Now, when can we expect some mid and low-end fermi-architecture chips?
Christmas 2010?
Many people don't give a **** about any card which costs >> 500$
 
Why not? The topic is called Nvidia fermi:... and some people were eager to point out that Fermi is not a chip.
So, we are free to discuss any future chips with Fermi architecture, no? Or some people will object if its called an architecture, or if its called a chip? :LOL:

Now, when can we expect some mid and low-end fermi-architecture chips?
Christmas 2010?
Many people don't give a **** about any card which costs >> 500$


is there any info on those yet, I do agree midrange cards for Dx11 are very important but low end cards don't matter.
 
is there any info on those yet, I do agree midrange cards for Dx11 are very important but low end cards don't matter.
I disagree, even now a puny 8200 can be as fast as CoreDuo on certain tasks (like cracking passwds :D)
A low end sub-100$ card on Fermi arch which not only plays casual games but accelerates Internet and Antivirus products can sell a lot more than a supercomputer. Common people don't care about super computers, but tell them that they'll be better defended from viruses... I don't think Nv needs "halo", they need money and wider GPGPU penetration.
 
I disagree, even now a puny 8200 can be as fast as CoreDuo on certain tasks (like cracking passwds :D)
A low end sub-100$ card on Fermi arch which not only plays casual games but accelerates Internet and Antivirus products can sell a lot more than a supercomputer. Common people don't care about super computers, but tell them that they'll be better defended from viruses... I don't think Nv needs "halo", they need money and wider GPGPU penetration.

They've been doing that for the past two years. CUDA is even teached at universities. The "killer apps" are slowly but certainly appearing and GPGPU computing is gaining a lot of strength. Most people already know that GPUs can do a lot more than just graphics.

The HPC market is a clear target, now more than ever, because of the high profit margins it provides.
 
They've been doing that for the past two years. CUDA is even teached at universities. The "killer apps" are slowly but certainly appearing and GPGPU computing is gaining a lot of strength. Most people already know that GPUs can do a lot more than just graphics.

The HPC market is a clear target, now more than ever, because of the high profit margins it provides.

Hate to break it to you, but most people don't know what a GPU is.
 
I disagree, even now a puny 8200 can be as fast as CoreDuo on certain tasks (like cracking passwds :D)
A low end sub-100$ card on Fermi arch which not only plays casual games but accelerates Internet and Antivirus products can sell a lot more than a supercomputer. Common people don't care about super computers, but tell them that they'll be better defended from viruses... I don't think Nv needs "halo", they need money and wider GPGPU penetration.


well is that software available right now ? ;) But yeah that is a vialble option when software will take advantage of the GPU.
 
is there any info on those yet, I do agree midrange cards for Dx11 are very important but low end cards don't matter.
That's probably why the GT21x were designed.

In fact, value/entry level GPU have almost 1/3rd market share, quite similar to mainstream and high end.

AMD already has mainstream and high end boards for sale and is expected to launch its entry/value GPUs next month, so even if NVidia releases a competitive high end in 2-3 months, if they don't release derivatives before 6 months their DX11 market penetration will be negligible.

There are multiple views to the "Fermi problem", and that's one too.
 
That's probably why the GT21x were designed.

In fact, value/entry level GPU have almost 1/3rd market share, quite similar to mainstream and high end.

AMD already has mainstream and high end boards for sale and is expected to launch its entry/value GPUs next month, so even if NVidia releases a competitive high end in 2-3 months, if they don't release derivatives before 6 months their DX11 market penetration will be negligible.

There are multiple views to the "Fermi problem", and that's one too.


low end cards don't matter at all, as long as they are windows 7 complient thats all that matters, Dx11 won't be playable much on low end cards if at all.
 
low end cards don't matter at all, as long as they are windows 7 complient thats all that matters, Dx11 won't be playable much on low end cards if at all.
That's forgetting the main benefit of CS 5.

As NV themselves like to say, a GPU is no more "just a gaming chip", and even for games, CS could have a sensible effect on playability if used right (see HDAO in STALKER CoP... 40% framerate increase vs PS, for a lesser final cost than medium SSAO).

Yeah, tessellation seems to be the thing to throw away, but that's too much of a shortcut.
 
As NV themselves like to say, a GPU is no more "just a gaming chip", and even for games, CS could have a sensible effect on playability if used right (see HDAO in STALKER CoP... 40% framerate increase vs PS, for a lesser final cost than medium SSAO).

That doesn't really matter, since those low-end cards will barely play games at high settings. Yeah, you could still play it at mid settings, but then what's the point of the "extra" effects like SSAO/HDAO?


Certainly a crazy idea, but is it possible to offload things like post-processing and tesellation to a dedicated lowish/mainstream card, just like PhysX? ;)
 
Back
Top