NVIDIA Fermi: Architecture discussion

I thought it would be somewhat simpler because the FPUs appear to be running along the edges of each core, leaving the other hardware in the center.

I don't really know based on the die photo, perhaps someone with a better knack for it can tell.

A large part of the core is agnostic to the capability of the FPUs, and Nvidia has a history of maintaining both DP and SP pipeline designs.

So what happens when you try and run some code that uses DP on such a design?

I did not mean to imply one can wave a wand over the DP pipeline and it magically becomes SP, just that this is but one component of the core and that the rest is very much unaware of FP capability.
Anything not an ALU, decoder, or scheduler will not care (and the latter two can in naive implementations almost not care), and it looks like a fair amount of the rest of the core is kept isolated from the parts that do.

Register files care as well, since they have to allocate and access 64b regs differently.

I guess my point is that doing a totally modular design would be really really inefficient in many regards. GT200 was modular, but that's because the DP implementation was a total kludge and *inefficient*.

It's true that large portions of the SM don't care - caches, TMUs, etc., but without knowing how the DP was designed, it's really hard to say anything intelligent about how easy or hard to remove they are, and what the engineering effort, die area and power implications are...

Nvidia has undertaken the expense of designing both a double precision and single precision floating point pipeline before.
As you've said, one is an incremental evolution over the other.

Taking what amounts to a decrement for consumer hardware could have been planned for and budgeted into the development effort, and there exists the potential for a significant amount of reuse.
Since Nvidia has gone and applied full IEEE compliance to both SP and DP, there doesn't appear to be anything disjoint enough to make a core with just the FPUs replaced impractical.

You certainly can do a SP only core. But I'm saying that you would have to redo physical design to really compact things. DP is really not just an add-on, it's very entangled with SP for good performance.

Changing this seems handy, but I'd be curious what reasons there are for modifying the scheduler or dispatch being prohibitive or strictly necessary.
The current DP-capable issue hardware is fully capable of not running DP instructions, and if the SP-only hardware is not modified to change any of the SP instruction latencies, what would the scheduler notice?
Not that a front end that dispensed with the extra instructions entirely wouldn't probably be smaller.

Scheduling DP is kind of complicated since it requires both pipelines. you could just use the same scheduler and turn off DP, but then you are wasting area...

It all depends on implementation - is it one scheduler or two, how do they communicate, etc. etc.

That's pulling in concerns outside of the engineering difficulty.
An x86 host processor that suddenly blows up on code that ran on older chips is worse than useless.

What a slave chip does behind a driver layer that obscures all or part of the internals is much less constrained (or lets hope this is the case for Larrabee).

So do you think they will do DP in a library as part of the driver?

The layout of the core would possibly change, or at some of it would need to change.
Given that this fraction of one component of the whole chip that would likely need global layout changes anyway for a reduced mass-market variant, is this necessarily prohibitive or unthinkable?

No, my point is that it requires a global layout change and that doesn't happen overnight!!!

DK
 
How do you know, they are doing fine? There are no tape-out rumours, not even any rumours about possible specs. Maybe they have not even taped out yet?
I think you missed Ailuros' joke. "Seriously" was the give away.
 
nothing truly new as that is delegated to mobile parts.. something both sides have done in the past.

Have they? can't tbh remember featureset wise any other than the G92 = GTX2x0M case, and even then it's just some CUDA features (at least i think GT2xx's support some CUDA features that G92's don't, don't they?)
 
Given the economy and existing installed based, I'm skeptical any large swings in market share are even going to take place in the next 6-12 months. The real problem for NVidia is lost revenues and earnings. Looking at their fundamentals, it seems like they have enough cash to survive a few quarters of bleeding. ($1.7 billion cash, only $25M debt)

NVidia has the benefit that they built up a lot of developer mindshare with the G80 and CUDA, mindshare that will only slowly be eroded by OpenCL/DX11 and new offerings. The bulk of developers still have to target earlier class chips.

What I'm saying is, NVidia opened up an opportunity for AMD, and the new open standards will gradually displace CUDA, but that 6 months is too short of a time for anything dramatic to happen, unlike say, the huge opportunity Intel yielded to AMD with the P4.

I think AMD will make many design wins in the OEM space, especially mid-range machines, notebooks, and that will continue for a while until NVidia gets a low end part out, with NVidia hoping that large margins in the high end "big chip" area, workstation market, and HPC will pad out losses in the low end and chipset markets in the mean time. I think NVidia's new suite of developer tools will do much to continue to attract developers to offer extra support.

In general, this round, AMD benefits, but the doom-and-gloom is clearly misplaced. In hindsight, I think Fermi, if performance works out, was the right bet. Take the pain pill early before LRB arrives. They didn't have much choice in the matter. It's just unfortunate that it had to occur during an economic downturn that drives consumers to shop for big bargains.

How can you be so sure that CUDA will *ever* be displaced?

Nvidia is now doing with developers and coders what they did with twiimtbp
- who is going to take the initiative to catch them?

Only Intel; not AMD
(imo)
 
Well, Cg got subsumed by HLSL/GLSL. What percentage of devs still use Cg? That might inform how well CUDA will be displaced by OpenCL/DX11.

I do have a feeling that middleware will tend to be tuned to low-level architecture, since there may still be advantages in doing that.
 
In general, this round, AMD benefits, but the doom-and-gloom is clearly misplaced. In hindsight, I think Fermi, if performance works out, was the right bet. Take the pain pill early before LRB arrives. They didn't have much choice in the matter. It's just unfortunate that it had to occur during an economic downturn that drives consumers to shop for big bargains.
I think the danger for NVidia is that if open standards start to dominate, then ATI can jump right in and steal this new market that NVidia is investing heavily in to create. ATI hasn't even topped out in efficiency yet, because they still have the low-lying fruit of making scalar SIMD units (I've outlined before how it's a very cheap modification to make).
 
Well, Cg got subsumed by HLSL/GLSL. What percentage of devs still use Cg? That might inform how well CUDA will be displaced by OpenCL/DX11.

I do have a feeling that middleware will tend to be tuned to low-level architecture, since there may still be advantages in doing that.

The real difference is CUDA has had a huge head start ((3 years now?)) while CG had no where near that kinda lead.
 
what's your idea Mintmaster? Running 4 vertices/pixel per 5-vector?
Although working on 256 pixel batches doesn't sound terribly efficient when it comes down to dynamic branching. To not mention increased register pressure and less than optimal instruction bandwidth/instruction cache usage
 
I think the danger for NVidia is that if open standards start to dominate, then ATI can jump right in and steal this new market that NVidia is investing heavily in to create. ATI hasn't even topped out in efficiency yet, because they still have the low-lying fruit of making scalar SIMD units (I've outlined before how it's a very cheap modification to make).

What speaks against the more expensive MIMD theory in such a case?
 
How can you be so sure that CUDA will *ever* be displaced?

Nvidia is now doing with developers and coders what they did with twiimtbp
- who is going to take the initiative to catch them?

Only Intel; not AMD
(imo)

Because it's proprietary and a single vendor solution (and that vendor is not known for playing nice). You tell me the last piece of proprietary technology to survive in the mainstream and high volume market on a continuous basis:

Myrinet, Infiniband, IA64, PPC, zArch, Alpha, Cray, SGI, RapidMind, etc.

In contrast you have:
Ethernet, x86, ARM, DirectX, OpenGL, Linux, MySQL, etc.

The list of proprietary corpses covering the field of computing is endless. More to the point, if you are a mass market SW developer, you have a choice:
1. Target CUDA and address perhaps 30% of the market with NV hardware
2. Target OpenCL and address the entire market

80%>30% --> Target openCL. It's really quite easy.

If you are a niche SW dev (e.g. oil and gas), then CUDA is a lot more reasonable, since the costs of buying into something proprietary are not big relative to the money involved, and you're used to expensive, proprietary stuff. Oh - and you can port your code relatively easy since you have skilled programmers.

David
 
What speaks against the more expensive MIMD theory in such a case?
Sorry if I wasn't being clear. The emphasis was on scalar as opposed to vec4 (plus transcendental), so that you don't have to parallelize streams. I can't really speculate on how future workloads will affect the largely orthogonal SIMD vs. MIMD issue.
 
I think you missed Ailuros' joke. "Seriously" was the give away.

If there would have been a tape-out it would have been reported by a news / runour site.

Charlie did show a roadmap with no D12P chip till then end of Q2/10. That seems realistic, why should be report false facts about NV?.
 
Because it's proprietary and a single vendor solution (and that vendor is not known for playing nice). You tell me the last piece of proprietary technology to survive in the mainstream and high volume market on a continuous basis:

Myrinet, Infiniband, IA64, PPC, zArch, Alpha, Cray, SGI, RapidMind, etc.

In contrast you have:
Ethernet, x86, ARM, DirectX, OpenGL, Linux, MySQL, etc.

The list of proprietary corpses covering the field of computing is endless. More to the point, if you are a mass market SW developer, you have a choice:
1. Target CUDA and address perhaps 30% of the market with NV hardware
2. Target OpenCL and address the entire market

80%>30% --> Target openCL. It's really quite easy.

If you are a niche SW dev (e.g. oil and gas), then CUDA is a lot more reasonable, since the costs of buying into something proprietary are not big relative to the money involved, and you're used to expensive, proprietary stuff. Oh - and you can port your code relatively easy since you have skilled programmers.

David
Letsee, off the top of my head, the last piece of proprietary SW technology to survive longtime in an open market would be Windows; MS is not known for playing nice neither. :p
- Apple has done pretty well also and Adobe. The list goes on and on .. and on.

Nvidia has done extraordinarily well with their twiimtbp. The are pushing the industry and providing the tools; and i believe there are open source apps being written for CUDA. Debuggers for Linux, etc.

Who else is going to do that? Developers on their own? Nvidia has had at least a 2-3 year head start over everyone and it still has the option to make CUDA open source also.
 
Last edited by a moderator:
Sorry if I wasn't being clear. The emphasis was on scalar as opposed to vec4 (plus transcendental), so that you don't have to parallelize streams.

Why would they want to do this? More efficiency would be traded for how compact and numerous they can make those 5-way scalar units.
 
Letsee, off the top of my head, the last piece of proprietary SW technology to survive longtime in an open market would be Windows; MS is not known for playing nice neither. :p
- Apple has done pretty well also and Adobe. The list goes on and on .. and on.

CUDA isn't just proprietary software - it's proprietary software tied to hardware from a single vendor. Don't think Windows, think VMS, OS/400, Unicos or GCOS.

Apple isn't high volume in computing, but nice try. Flash is pretty ubiquitous but it's also not tied to a single vendor's hardware either.

I think the right comparison you're looking for is actually Glide...yeah, that sounds about right.

Nvidia has done extraordinarily well with their twiimtbp.

Which is a puerile marketing campaign where they give developers money. And you wonder why it's a success? It's no different than Intel Inside.

If I was giving out $20 bills on the corner to anyone who said they were my friend, I'd be pretty damn popular too.

The are pushing the industry and providing the tools; and i believe there are open source apps being written for CUDA. Debuggers for Linux, etc.

Who else is going to do that? Developers on their own?

What the hell do you think OpenCL is? A company that develops decided to write their own tools and turn it into an open standard. Why do you think they did that?

Folks like ATI, Intel, etc. will provide support for OpenCL, as well as 3rd party software companies.

Nvidia has had at least a 2-3 year head start over everyone and it still has the option to make CUDA open source also.

Nvidia does have a head start, but the momentum of the industry is behind OpenCL. And open sourcing stuff won't change that...

DK
 
If there would have been a tape-out it would have been reported by a news / runour site.

Charlie did show a roadmap with no D12P chip till then end of Q2/10. That seems realistic, why should be report false facts about NV?.

What false facts? My first sentence was an obvious joke and the second one indicates the joke and denies anything regarding a hypothetical tapeout beyond D12U.

Sorry if I wasn't being clear. The emphasis was on scalar as opposed to vec4 (plus transcendental), so that you don't have to parallelize streams. I can't really speculate on how future workloads will affect the largely orthogonal SIMD vs. MIMD issue.

Dumb layman's question then: wouldn't MIMD be a better idea overall for small triangle efficiency?
 
Folks like ATI, Intel, etc. will provide support for OpenCL, as well as 3rd party software companies.

NVIDIA will provide support for OpenCL too of course :D After all, they are one of the first companies to support this open standard!


Nvidia does have a head start, but the momentum of the industry is behind OpenCL. And open sourcing stuff won't change that...

The purpose of CUDA is not to thwart open standards. The purpose of CUDA is to give developers quicker and more timely access to new developmental tools than one could have from any open standard. NVIDIA can implement new features much more quickly in CUDA than could be implemented in an open standard such as OpenCL. That's not to say that open standards such as OpenCL are bad (quite the contrary really), just that CUDA can actually be a very good thing too.
 
Dumb layman's question then: wouldn't MIMD be a better idea overall for small triangle efficiency?
The pixel shader doesn't really care what triangle a quad is from, you can pool them and run them in a single thread group.
 
Back
Top