NVIDIA GF100 & Friends speculation

In the gpu world, since when?

Umm, since at least R600/G80?

While we are at this topic, will you please explain why conroe->lynnfield is a new microarchitecture? A few instructions, a "pretty normal" packet switched mem subsystem, cut-paste IMC, cut-paste reorganization of cache hierarchy. SMT is pretty normal too, as shown by niagara and P4. So why exactly does intel insist on calling it a new microarchitecture?

The core certainly isn't a new architecture. The uncore is however fundamentally different between conroe and Nehalem.
 
Umm, since at least R600/G80?
:oops:
Having read cuda docs cover-to-cover ever since the first one, I can't find a single reference to cached global memory transactions. Could you put me out of my misery and point out where in the world has it mentioned that g80 has r/w caches?;)
 
A
In any shrink in the cpu space, you are going to see virtually every blocked tweaked in either some micro-architecture or circuit way yet the general consensus would be that it is basically an evolutionary directive design. New architectures of things such as modern CPUs and GPUs tend to be fairly rare simply because of all the work involved.

Well, for one, I didn't say die shrink. There's a different between tweaking for a new process, and adding new blocks or sub-blocks of functionality that simply weren't there before. Secondly, if moving setup into the SM pipeline and coordinating between GPCs isn't a new architecture, I don't know what is. They fundamentally changed the way geometry is piped into the compute layer for the first time since setup units first got added to consumer GPUs.

That is, *they altered the on-chip DATAFLOW*

This actually hasn't been proven and the results so far don't look good.

Well, you seem to act like you know a lot about architecture, why would you expect current results of running old code optimized for non-coherent caches to get some huge speedup? This isn't even true of coherent CPU caches when designing algorithms for SMP, you have to design for the cache and memory hierarchy to extract maximum benefit. What I would expect, is that code written for the purported new cache architecture to perform badly on older GPUs, and if it did, that would be indicative of a new cache architecture.

of course, you've already decided that fundamentally altering dataflow of the chip, or changing it's whole register file and cache system isn't a new architecture.


No. Nor would the vast majority of computer architects. Its a new feature. Thats all.

Well, it's great that you're their appointed representative to speak for them. It's also clear that nothing can be said really to change your mind, this is like arguing with movie critics over whether Jim Cameron did anything new in cinematography with Avatar. Half will say, he invented a whole new camera system, editing system, and direction process in which you can reshoot scenes using augmented reality on-set. The other half will say he did nothing to change the fundamental film grammar since Birth of a Nation/Triumph of the Will/Citizen Kane/etc, he just changed the camera.

But what's really going on IMHO, is that you made an overly broad assertion about Fermi, and are unwilling to admit hyperbole. Even if you don't think it's a "new architecture", there's more change in GT200 -> Fermi in the compute section (leaving out the frontend), than G80->GT200, to not recognize the fact that they changed ALOT more this time around besides just bumping up texture cache and # of units is just being willfully stubborn and afraid of admitting error.
 
Well, for one, I didn't say die shrink. There's a different between tweaking for a new process, and adding new blocks or sub-blocks of functionality that simply weren't there before. Secondly, if moving setup into the SM pipeline and coordinating between GPCs isn't a new architecture, I don't know what is. They fundamentally changed the way geometry is piped into the compute layer for the first time since setup units first got added to consumer GPUs.

That is, *they altered the on-chip DATAFLOW*

So they say. They also said that G80 had 128 *CORES*. Until we actually see results that actually back up the marketing slides, I've seen enough in the history of Nvidia PR to be doubtful.


Well, you seem to act like you know a lot about architecture, why would you expect current results of running old code optimized for non-coherent caches to get some huge speedup? This isn't even true of coherent CPU caches when designing algorithms for SMP, you have to design for the cache and memory hierarchy to extract maximum benefit. What I would expect, is that code written for the purported new cache architecture to perform badly on older GPUs, and if it did, that would be indicative of a new cache architecture.

One of the first rules of architecture is that existing code is more important than future hypothetical code. If your design change causes performance to be significantly worse with the same resources then it probably wasn't a good change.

of course, you've already decided that fundamentally altering dataflow of the chip, or changing it's whole register file and cache system isn't a new architecture.

No, I simply have seen no proof that they've fundamentally altered the dataflow of the chip.

Well, it's great that you're their appointed representative to speak for them. It's also clear that nothing can be said really to change your mind,

real data and real facts can certainly change my mind. But so far none have been presented from valid sources to support your point.

Even if you don't think it's a "new architecture", there's more change in GT200 -> Fermi in the compute section (leaving out the frontend), than G80->GT200, to not recognize the fact that they changed ALOT more this time around besides just bumping up texture cache and # of units is just being willfully stubborn and afraid of admitting error.

Oh, sure they likely did change more from GT200->G100. That doesn't mean its a new architecture.
 
:oops:
Having read cuda docs cover-to-cover ever since the first one, I can't find a single reference to cached global memory transactions. Could you put me out of my misery and point out where in the world has it mentioned that g80 has r/w caches?;)

Don't think it has been mentioned that G80 has r/w caches. Fortunately for me, the point was merely changes to cache hierarchies. I'm pretty sure that G80 has multiple cache levels aka a cache hierarchy. Pretty standard to modify cache hierarchy between design revisions.

As far as R/W caches, we'll have to see how it actually works in G100 but its likely that it is an extension of the shared memory functionality found in the GT200 design.
 
I'm not hung up on semantics, only measured phenomena. Whether something is labeled "new architecture" or not is practically irrelevant. I'm interested in what's changed that either alters efficiency, adds features, or expands the domain of what's computable or tractable. What irked me about your assertions is not so much the labels, but the implication that not much has changed. From features alone, much has been added, and from CUDA documentation, important memory architecture has changed.

Yes, NVidia PR typically labels everything as revolutionary. What's new, most companies do, seen Apple's PR? "Magic" "Revolutionary" "Incredible" "Unbelievable" I discount stuff like "Lightspeed Memory Architecture 3.0". But developer SDK headers and API docs usually tell a different story, and they'd idiotic to add a bunch of new memory architecture features to the CUDA SDK which aren't actually HW features, but simply software emulations of real features, because the first Dev to compile a sample would instantly detect bullshit. It's pretty hard to bullshit about making memory writable that was previously immutable.
 
Don't think it has been mentioned that G80 has r/w caches. Fortunately for me, the point was merely changes to cache hierarchies. I'm pretty sure that G80 has multiple cache levels aka a cache hierarchy. Pretty standard to modify cache hierarchy between design revisions.
g80 had a cache hierarchy, but it was read only.
 
As far as R/W caches, we'll have to see how it actually works in G100 but its likely that it is an extension of the shared memory functionality found in the GT200 design.

So you think the G80 could split it's L1 caches into separate partitions and elect to have them semi-coherent? This smells like a new hardware feature to me, and if the G8x was capable of it, why wasn't it exposed?

There's quite a difference between having a shared global cache, and having some kind of cache-snooping logic to sync distributed caches.

So you're new tact is, to admit that there's something new, but write it off as something trivial? Yeah, cache-snooping is just a few lines of code, la-de-da.
 
Aaron, maybe you could line out for us what in your eyes would be checkmarks on the way to a new architecture?
 
There's quite a difference between having a shared global cache, and having some kind of cache-snooping logic to sync distributed caches.

So you're new tact is, to admit that there's something new, but write it off as something trivial? Yeah, cache-snooping is just a few lines of code, la-de-da.

No, but keeping a couple small memory arrays in sync across a large chip when you control all the software and hardware and can easily special case it, is quite different than having a fully coherent global cache hierarchy.
 
LOL was that supposed to be funny ? Or just sarcasm ? Maybe both ?

It's wrong. GF100 is a g80 x 2. :devilish:
16SM vs. 8 --> 2x
48 ROPs vs. 24 ROPs --> 2x
GDDR5 vs. GDDR3 --> 2x
1,536mb vs. 768mb --> 2x
Two Dispatcher vs. One --> 2x
32 Cores vs. 16 Cores --> 2x
64 FUs vs. 32 FUs --> 2x
4 SFUs vs. 2 SFUs --> 2x
 
Doesn't that result in reinventing the wheel over and over again? I mean, starting with a clean slate is nice, but also incredibly work intensive. Why do such things if there's no need.
 
It's wrong. GF100 is a g80 x 2. :devilish:
16SM vs. 8 --> 2x
48 ROPs vs. 24 ROPs --> 2x
GDDR5 vs. GDDR3 --> 2x
1,536mb vs. 768mb --> 2x
Two Dispatcher vs. One --> 2x
32 Cores vs. 16 Cores --> 2x
64 FUs vs. 32 FUs --> 2x
4 SFUs vs. 2 SFUs --> 2x

Interesting take overall. :D
 
are you starting with new RTL or just modifying and rearranging existing RTL? Thats a pretty fundamental one.

Most companies almost never completely replace the RTL. Many units, especially complex (but mature and well-tested) ones are usually just modified/rearranged. For instance, Intel probably modified/rearranged the P54C RTL for use in Larrabee.
 
Doesn't that result in reinventing the wheel over and over again? I mean, starting with a clean slate is nice, but also incredibly work intensive. Why do such things if there's no need.

Generally if there is no need to write new code base then you aren't doing a new architecture but instead an evolution of an existing architecture. I've worked on a large number of designs and one thing every new architecture had in common was new from scratch RTL.
 
Originally Posted by aaronspink
You do realize that...g100 is basically 2x gt200 right?

Please specifically list out reasons why GF100 is "basically" 2x GT200 from an architectural point of view. Please. I dare you. No, I double dare you. :D

NVIDIA whitepapers have already revealed information about the large efficiency gains (well over 2x) in texture filtering performance, 8xAA performance, compute performance (among many other things) when comparing GT200 to GF100, let alone full hardware support for DX11 in GF100. Calling GF100 "basically" 2x GT200 surely seems to be a convenient way to trivialize the R&D effort that went into improving efficiency and overhauling the architecture with GF100. Of course, in the back rooms of your office at Intel, maybe this is a popular line of thought.
 
Last edited by a moderator:
Most companies almost never completely replace the RTL. Many units, especially complex (but mature and well-tested) ones are usually just modified/rearranged. For instance, Intel probably modified/rearranged the P54C RTL for use in Larrabee.

You don't think that AMD wrote pretty much new RTL when they went from K6 to K7? Or Intel from P5 to P6 or from P6 to P4? Or DEC from EV5 to EV6? Or IBM from Power6 to Power7? or from Power4/5 to Power6?
 
Back
Top