NVIDIA GF100 & Friends speculation

Ailuros · Mar 11, 2010

mczak said:
Tegra 2 is around 50 mm^2. Ion2 (aka gt218) around 60mm^2. And you'd think they'd get a bit more yield out of them than fermi...

I live under the impression that Tegra2 is on 45LP; but then again during silly season you can throw tractors and flippers into the same pot

aaronspink said:
You do realize that gt200 is basically 2x g80 and g100 is basically 2x gt200 right?

Yes to the first and no to the second. I don't see 160TMUs on GF100 and I don't see either more than 324M Tris/clock in real time on GT200 either.

MDolenc · Mar 11, 2010

aaronspink said:
G80/gt200, no so much. G100, unlikely to prove itself much different than G80/gt200 since its still fundamentally the same architecture.

But then what exactly does make G80 so fundamentally different to G70? The fact that it's unified? Differences between GF100 and GT200 are way bigger then differences between G80 and GT200.

Bob · Mar 11, 2010

PSU-Failure said:
- points 1 and 5 are part of point 6, and as point 6 is an SM5 prerequisite it's part of the differences between R700 and Evergreen too

There is more to IEEE-754-2008 than just FMAs and denormals, and SM5 does not require all of IEEE-754-2008. AFAIK, the lower end RV8xx chips do not implement FMAs.

- points 10 and 11 are part of point 3

They are not disjoint points, true, but the overlap is small.

(and btw adresses are locally way less than 40-bit, that's just a schematic view with almost no hardware requirement... only exceptions handling is new)

How so? I get 40-bit pointers from the host CPU (on 64-bit OSes anyway). They better damn well work on the GPU. Physical addresses may be less than 40-bits, but that has little to do with the SMs, which all run in the virtual address space.

To others, GT200 is not "just" 2x G80. It doesn't have 2x the TPCs, 2x the TMUs, 2x the SMs, etc.

GZ007 · Mar 11, 2010

Ailuros said:
I live under the impression that Tegra2 is on 45LP; but then again during silly season you can throw tractors and flippers into the same pot

.

Tegra 2 is 40 nm TSMC http://www.nvidia.com/object/io_1262837617533.html.

Ailuros · Mar 11, 2010

GZ007 said:
Tegra 2 is 40 nm TSMC http://www.nvidia.com/object/io_1262837617533.html.

Typo; it should have read 40LP. That's still not 40G. And albeit vastly off topic it sounds more and more like a die shrinked T1 with twice the frequency.

---------------------------------------------------------------------------------------

Quick oversimplification:

G80=
[8*(2*8], 4TA/8TF per cluster

GT200=
[10*(3*8)], 8TA/8TF per cluster

GF100=
[16*(2*16)], 4TA/4TF per cluster

silent_guy · Mar 11, 2010

rpg.314 said:
My reason for hypothesizing along those lines is a bit more... mundane. My guess is that, even for ASICs like GPUs, which have a lot of cut paste structures, the larger die size per unit engineering effort leads to longer design cycles. IOW, all other things being equal, a 5770 will have a shorter design cycle than a 5870, even though one is just half of other. I will be happy to know your thoughts on this matter.

My thoughts are that it doesn't matter much. Time from netlist to tape-out has a very low correlation with the size of the die. That's true even for chips without a lot of cut and paste.

The trailing end of the schedule is what people always remember, but when you check during the post-mortem of a project where the battle was lost, it is almost without exception during front-end design. Bob's recent post only strengthens that argument. Architecture change or not, I think it's fairly obvious that changes from GT200 to Fermi are more than just doubling up and some tweaks here and there. Even if the shader architecture had stayed the same, just a move to a R/W L2 cache is not something to be taken lightly. Those things are famous for their gnarly corner cases.

IMHO the design cycles of AMD were shortened by doing things in manageable pieces.I don't doubt that internally quite a bit of units have seen full rewrites in the process, but I suspect changes were done in such to way to cause minimal top level disruption which makes schedules more predictable. It's a very smart thing to do, if you're base architecture allows you to do it. (It's also less sexy for the outside world.

)

dkanter · Mar 11, 2010

<13- Out of order thread block execution & completion.>

Can someone clarify what this means?

David

aaronspink · Mar 11, 2010

Bob said:
There is more to IEEE-754-2008 than just FMAs and denormals, and SM5 does not require all of IEEE-754-2008. AFAIK, the lower end RV8xx chips do not implement FMAs.

Actually -2008 is pretty much just FMA/MAD and clarifications to what is quite possibly one of the worst written documents ever to come into widespread use. The reality is FMA/MAD is pretty much tomAto/tomato in the scheme of things.

Oh noes, you changed the ALU BB. That however does not a new architecture make, as significant changes happen all the time in the world of CPUs and we don't consider those new architectures.

They are not disjoint points, true, but the overlap is small.

Meh, address space changes, once again do not a new architecture make. These happen pretty much with every product release in the cpu world.

To others, GT200 is not "just" 2x G80. It doesn't have 2x the TPCs, 2x the TMUs, 2x the SMs, etc.

You mean you can double some things but not other? OMG.

aaronspink · Mar 11, 2010

MDolenc said:
But then what exactly does make G80 so fundamentally different to G70? The fact that it's unified? Differences between GF100 and GT200 are way bigger then differences between G80 and GT200.

I would argue that going unified is a more significant architectural change than GT200->G100 even if all the PR about G100 is true. Unified is a fundamental shift in the way that the device operates. Everything in G100 is a low to medium complexity feature request to various parts of an already existing design.

aaronspink · Mar 11, 2010

Ailuros said:
Yes to the first and no to the second. I don't see 160TMUs on GF100 and I don't see either more than 324M Tris/clock in real time on GT200 either.

I'm hoping that everyone can understand the concept of doubling of some things and keeping others the same? Things like TMUs/ROPs, etc have a more direct correlation with the memory subsystem than with the shading subsystem as far as resource requirements.

aaronspink · Mar 11, 2010

Bob said:
Do I have to point you at the Fermi Compute whitepaper again? *sigh* Even in the SMs alone, the list of differences from GT200 is pretty long (let alone differences from G80).

1- 6

So you switched out to a different ALU BB. That does not a new architecture make.

7- More SPs per SM.

So if say AMD adds another SSE unit, it becomes a new architecture?

8- Cache hierarchy

Thats pretty normal.

9- Configurable shared memory / cache size, instead of a dedicated RAM.

a mux does not a new architecture make.

10- Unified address space.
11- 40-bit pointers instead of 32-bit pointers.

We've had CPUs go from unified to split to unified caches without considering them different architectures.

14- ECC in the register file, L1 cache, shared memory, L2 cache, and DRAM.

Um, RAS additions over time are the default in the industry. Trying to include them as examples that make G100 a new architecture is laughable at best.

Ailuros · Mar 11, 2010

aaronspink said:
I'm hoping that everyone can understand the concept of doubling of some things and keeping others the same? Things like TMUs/ROPs, etc have a more direct correlation with the memory subsystem than with the shading subsystem as far as resource requirements.

You may twist and turn whatever reasoning you want to pass, but vast oversimplifications can eventually step into traps. TMU clusters weren't part of the SMs in past architectures, while they are on GF100 and that's probably just another "minor change".

silent_guy · Mar 11, 2010

aaronspink said:
Unified is a fundamental shift in the way that the device operates.

Given that we're never going to go back from unified to non-unified and that they'll always have some kind of a massive parallel SIMD shader architecture, according to your definition, we're unlikely to see a new architecture in our lifetime...

DemoCoder · Mar 11, 2010

There is no set industry wide agreed upon definition of what constitutes an "architecture", so the whole thing is an exercise in personal opinion and aesthetics.

I'd agree with you that if I took a chip, and simply cut-pasted in more units, or moved blocks around, but the blocks themselves didn't change, it would not constitute a new architecture. For one, it doesn't alter efficiency, all it does is alter scale. Imagine a had a factory that could produce 2 widgets per machine per input, and I added 10 new machines. I can now produce 20 widgets per input, it's scaling up, but it's not a new architecture. If on the other hand, by arranging the machines differently I could get a super-linear increase, or, if I put in machines that could do 3 widgets per machine per input, I'd call it a new architecture.

Where I disagree is when the blocks themselves changed. If AMD adds an another SSE unit, it's a stretch to call it a new architecture. If they redesigned the SSE unit to have significantly more functionality, new instructions, new behavior, I'd call it a new architecture.

And when virtually every major block is getting tweaked new features, it's absurd not to call it new. NVidia, from what we can gather, Rev'ed EVERYTHING. Until the G200, which was a cut-paste job for the most part, they have new register file functionality, new cache architecture, new ECC, new FP functionality (DP, denorms, exceptions, etc), new scheduler functionality, new tessellation units, new setup unit architecture, new TMU clocking arrangement, new ROP CSAA functionality, and on and on.

If you look at the new cache architecture alone, it has potential major implications for the efficiency and performance (in contradiction to your claims PSU-Failure), even Jawed recognized that the cache changes alone could be some rocketsauce for Fermi on certain kinds of algorithms.

Let's put it this way, if clock for clock, when normalized for differing amount of SP units and onboard memory, Fermi beats GT200 because algorithms run more efficiently on it's cache architecture, would you admit it's a new architecture?

Or rather, if Intel switched the L1/L2 caches in the Core architecture to support the kind of software management/partitioning that Fermi supports, would you consider it a new architecture?

For me, whether or not something is new depends on whether it executes a *different algorithm* or different logic, rather than simply higher clocked, or parallel-cut-and-pasted versions of the same logic. Taking the same logic blocks, and moving them around using a clone brush == not a new architecture. Changing the memory controller, cache, scheduler, ALUs, to have altered implementations == new architecture.

Bob · Mar 11, 2010

[edit: the following is a reply to silent_guy]

No no, you don't understand. Each and every Intel CPU chip is an individual new architecture. But everything NVIDIA ever did, that's all just the one and the same (with a few tweaks here and there, over time). *snicker* Double standards is fun!

FWIW, GPUs were "unified" before they became fixed-function. You'll have to go a long way back to remember.

PSU-failure · Mar 11, 2010

Ailuros said:
You may twist and turn whatever reasoning you want to pass, but vast oversimplifications can eventually step into traps. TMU clusters weren't part of the SMs in past architectures, while they are on GF100 and that's probably just another "minor change".

Going this route, we could ask if split frequencies is still required since everything point to quite similar ROP/scheduler frequencies...

I think they initially targeted way higher ALU and perhaps lower ROP frequencies as it's quite obvious GF100 is G80's real successor and the latter was capable of 1500MHz inside its shader core. Perhaps some Netburst "10GHz" syndrome here.

rpg.314 · Mar 11, 2010

aaronspink said:
Thats pretty normal.

In the gpu world, since when?

While we are at this topic, will you please explain why conroe->lynnfield is a new microarchitecture? A few instructions, a "pretty normal" packet switched mem subsystem, cut-paste IMC, cut-paste reorganization of cache hierarchy. SMT is pretty normal too, as shown by niagara and P4. So why exactly does intel insist on calling it a new microarchitecture?

For that matter, Sandybridge adds a few more alu's, and a copy pastes a retarded gpu. Why exactly that a new microarchitecture?

aaronspink · Mar 11, 2010

DemoCoder said:
There is no set industry wide agreed upon definition of what constitutes an "architecture", so the whole thing is an exercise in personal opinion and aesthetics.

Among computer architects there tends to be some agreement. Certainly in the CPU space it is easier to make a determination both because the companies tend to publish significantly more detail, people actually generally can trust the detail that is published, and third, because we can test the bare hardware in ways that are somewhat difficult in the GPU area due to various levels of indirection, though ATI deserves some kudos for their publishing are their HRMs.

Where I disagree is when the blocks themselves changed. If AMD adds an another SSE unit, it's a stretch to call it a new architecture. If they redesigned the SSE unit to have significantly more functionality, new instructions, new behavior, I'd call it a new architecture.

But both AMD and Intel have done that numerous times. The engineers and architects that designed said chips would hardly describe the resulting chip as a new architecture.

[quoe]And when virtually every major block is getting tweaked new features, it's absurd not to call it new. NVidia, from what we can gather, Rev'ed EVERYTHING. Until the G200, which was a cut-paste job for the most part, they have new register file functionality, new cache architecture, new ECC, new FP functionality (DP, denorms, exceptions, etc), new scheduler functionality, new tessellation units, new setup unit architecture, new TMU clocking arrangement, new ROP CSAA functionality, and on and on.[/quote]

In any shrink in the cpu space, you are going to see virtually every blocked tweaked in either some micro-architecture or circuit way yet the general consensus would be that it is basically an evolutionary directive design. New architectures of things such as modern CPUs and GPUs tend to be fairly rare simply because of all the work involved.

If you look at the new cache architecture alone, it has potential major implications for the efficiency and performance (in contradiction to your claims PSU-Failure), even Jawed recognized that the cache changes alone could be some rocketsauce for Fermi on certain kinds of algorithms.

Let's put it this way, if clock for clock, when normalized for differing amount of SP units and onboard memory, Fermi beats GT200 because algorithms run more efficiently on it's cache architecture, would you admit it's a new architecture?

This actually hasn't been proven and the results so far don't look good.

Or rather, if Intel switched the L1/L2 caches in the Core architecture to support the kind of software management/partitioning that Fermi supports, would you consider it a new architecture?

No. Nor would the vast majority of computer architects. Its a new feature. Thats all.

Ailuros · Mar 11, 2010

PSU-failure said:
Going this route, we could ask if split frequencies is still required since everything point to quite similar ROP/scheduler frequencies...

G8x/G9x/GT2x0 have TMUs on core clock, while on GF100 TMUs run on 1/2 hot clock (nothing we don't already know here...)

I think they initially targeted way higher ALU and perhaps lower ROP frequencies as it's quite obvious GF100 is G80's real successor and the latter was capable of 1500MHz inside its shader core. Perhaps some Netburst "10GHz" syndrome here.

If the GTX470 has a hot clock of 1300, then TMUs (at least I'm sure about TAs) run at 650MHz. If the GTX480 should have a hot clock of 1400-1450, then 700-725MHz half hot clock might not be a ground-breaking difference to the 648MHz core clock of a GTX285, but it's still not the same frequency either. Core clock on the 480 might be give or take on the same level as on 285.

If they targetted a hot clock initially in the say 1600 region, yes that would mean TMUs at +/- 800MHz but that doesn't suggest a huge difference in texel fillrate on paper either compared to the above hypothetical 700+ TMU frequency.

AF performance is one of the big question marks to me for GF100 still and since reviewers usually deal with just sterile benchmark numbers I probably will have to wait for someone that is willing a bit more time investigating output quality and not just some timedemo results.

rpg.314 · Mar 11, 2010

PSU-failure said:
Going this route, we could ask if split frequencies is still required since everything point to quite similar ROP/scheduler frequencies...

Will go away pretty soon.

NVIDIA GF100 & Friends speculation

Ailuros

Epsilon plus three

MDolenc

Bob

GZ007

Ailuros

Epsilon plus three

silent_guy

dkanter

aaronspink

aaronspink

aaronspink

aaronspink

Ailuros

Epsilon plus three

silent_guy

DemoCoder

Bob

PSU-failure

rpg.314

aaronspink

Ailuros

Epsilon plus three

rpg.314

Similar threads