NVIDIA Fermi: Architecture discussion

elsence · Oct 1, 2009

no-X said:

lol

Nick · Oct 1, 2009

apoppin said:
Jensen made it clear that "Tesla" was never the name for GT 200 architecture and Brian Burke told me that Fermi is the first name for GPU architecture named after a scientist.

Fermi Temperature. It's still in line with Celsius (NV10), Kelvin (NV20), Rankine (NV30), Curie (NV40), Tesla (NV50). They are supposed to be interfaces though, not actual hardware architectures. Newer chips also support older interfaces (in a compatibility mode).

The only new thing is that they're now using the interface name in marketing as well (or I'm just mistaken and they're using another name internally).

rpg.314 · Oct 1, 2009

A few questions,

I haven't understood the notion of semi coherent caches? Quoting from here,

Each core can access any data in its shared L1D cache, but cannot generally see the contents of remote L1D caches. At the end of a kernel, cores must write-through the dirty L1D data to the L2 to make it visible to both other cores in the GPU and the host. This might be described as a lazy write-through policy for the L1D, but it seems more accurately described as write back with periodic synchronization.

1) Can a SM see the stuff in other SM's L1D or not?

2) If L1D is private to a SM, then what happens if two SM's write different data to a same location without atomics?

3) Is it programmers job to make sure that doesn't happen?

4) And how does a SM know what to cache?

5) What about the L2 cache? Everything is supposed to go through it, then how does the notion of (semi/fully) coherence fit here?

It will be highly appreciated if anyone can provide me answers to these questions.

Thanks,

Bouncing Zabaglione Bros. · Oct 1, 2009

babcat said:
That is really disappointing. I can't wait until they share more information about what features it has that will benefit graphics.

Exactly. As a gamer rather than a developer, I'm not sure this new product is actually for me or aimed at me. I wonder how many people waiting to see what Nvidia has against 58xx will see this and think that Fermi is going to be late, expensive, and not really a gaming product, but more of a science/supercomputing chip?

RealWorldTech said:
Perhaps the most significant demonstration of Nvidia's commitment to compute is the fact that a great deal of the new features are not particularly beneficial for graphics. Double precision is not terribly important, and while cleaning up the programming model is attractive, it's hardly required. The real question is whether Nvidia has strayed too far from the path of graphics, which again depends on observing and benchmarking real products throughout AMD's, Nvidia's and Intel's line up; but it seems like the risk is there, particularly with AMD's graphics focus.

pcchen · Oct 1, 2009

rpg.314 said:
A few questions,

I haven't understood the notion of semi coherent caches? Quoting from here,

1) Can a SM see the stuff in other SM's L1D or not?

2) If L1D is private to a SM, then what happens if two SM's write different data to a same location without atomics?

3) Is it programmers job to make sure that doesn't happen?

4) And how does a SM know what to cache?

5) What about the L2 cache? Everything is supposed to go through it, then how does the notion of (semi/fully) coherence fit here?

It will be highly appreciated if anyone can provide me answers to these questions.

Thanks,

1) No.
2) It's undefined on current hardwares, so it should remain undefined.
3) Yes.
4) Like any other cache, such as LRU, I presume.
5) The L2 cache is supposedly tied to the memory controller (each memory controller has 128KB L2 cache), so it's coherent by itself because it only has to cache its own data.

Nick · Oct 1, 2009

Davros said:
So to those who understand, this new chip is it
Yay, Meh or Wtf ?

Yay²

In the past few years it has become pretty clear that TFLOPS don't mean anything if you don't get high utilization. NVIDIA has made significant changes to increase efficiency and keep all units busy. But they've definitely also not neglected the raw specs.

It remains to be seen exactly how well they can make use of the efficiency enhancements, but there's no doubt there's great potential. Expect driver revisions to offer significant performance increases over time. A lot of complexity is shifting to the software side...

Ailuros · Oct 1, 2009

Rys said:
Tridam's piece.

A thanks for the link....arghhh oh not darned online translators again.

Mr. Triolet if you're listening your fanbase requests english articles sooner please?

Scali · Oct 1, 2009

Jawed said:
My gut reaction is that NVidia's built a slower version of Larrabee in about the same die size with no x86 and added ECC. It might be a bit smaller.

Slower, why?
We don't know how either chip performs yet...
But from what we do know, Intel aims at GTX285 performance, and so far it looks like Fermi is going to be considerably faster than GTX285, hence probably faster than Larrabee.
With full C++ support, I don't really think there's much that x86 can do, that Fermi can't.

Scali · Oct 1, 2009

fehu said:
i read that they showed a raytraced carrunning on fermi
someone has a picture?

They already showed a raytraced Bugatti Veyron on Quadro's last year.

Scali · Oct 1, 2009

Eolirin said:
Is it just me or is nVidia running a really huge risk here? I can't imagine this having a better price/performance ratio than the HD5xx0 line, so it really looks like they're trying to transition away from mainstream consumer graphics, or else just couldn't adjust their plans fast enough to have anything else in this time frame.

They can probably scale this design down to mainstream parts, like they've always done.
Obviously they're showing the big guns to the press (just like they showed the 8800 at launch, but there was an 8600 available not much later, basically just G80 scaled down (and some extra additions even, in that particular case)).

elsence · Oct 1, 2009

Ailuros said:
Rys,

16 pixels/clock address and setup/SM? Are you sure 256 TMUs aren't way too much overkill for that kind of bandwidth?

Also when you state 8Z/8C samples /clock for the ROPs, I assume it's either/or as in today's GPUs?

That's my question also.

Also hardware.fr is reporting that each SFU unit can do 8 interpolations. (if i understood correct)
Each SM has 1 SFU unit, so the GT300 can do 16X8=128 interpolations.
Wouldn't the design (256TMUs) be limited that way? (128 interpolations)

EDIT*
I just rechecked hardware.fr.
It reports 16 interpolations.
So probably we have 256TMUs.

Scali · Oct 1, 2009

Bouncing Zabaglione Bros. said:
I wonder how many people waiting to see what Nvidia has against 58xx will see this and think that Fermi is going to be late, expensive, and not really a gaming product, but more of a science/supercomputing chip?

Why can't it be both?
G80 was also WAY more than just a DX10 chip, with a large part of the transistor budget dedicated to Cuda (those 3-year old chips can now run OpenCL and DirectCompute!). Still, G80 (and G92/G92b after it) dominated the gaming market despite AMD making chips that were more 'vanilla' DX10/DX10.1 graphics chips.

rpg.314 · Oct 1, 2009

Thanks pcchen for those answers.

pcchen said:
1) No.
2) It's undefined on current hardwares, so it should remain undefined.
3) Yes.
4) Like any other cache, such as LRU, I presume.
5) The L2 cache is supposedly tied to the memory controller (each memory controller has 128KB L2 cache), so it's coherent by itself because it only has to cache its own data.

If L1D are not supposed/expected to communicate, then where do you see possible uses for a unified L2? faster atomic operations?

Some more questions for B3D ppl,

a) In the block diagram, are you intrigued by the presence of the thread scheduler, on the periphery, instead of being inside?

b) which ff hw do you think was dropped, if any?

c) This is gonna invite some irritation/criticism/flames. But why is implementing trees, linked lists in g80-esque shared memory inefficient? Or plain hard? I haven't tried doing this, so please make allowance for this before you berate me.

DemoCoder · Oct 1, 2009

HPC is a niche market. No way companies building out massive cloud infrastructure (Google, Microsoft, Amazon, Rackspace, et al) are going to go with GPUs. It's too risky and too incompatible when going with cheap, commodity, x86 hardware (even if inefficient at tasks NV does well).

By most accounts, Google's cluster is about 1 million servers. NVidia sells far more than 1mil GPUs to consumers. If you add up all of the server-side customers and HPC installations, it won't come close to the revenue numbers they pull in from retail/OEM.

Sure, HPC margins might be higher, but overall, concentrating on GPGPU at the expense of retail/OEM/mobile would doom the company into bit-player status.

In short, there is no way Fermi is built on some insane business model. Well, there is, but then Jensen should be shot. I think the reality is, NV is just evolving their architecture, and this is where it took them. They are not going to reboot it into an ATI-like architecture, they're going to continue to refine it and see where it takes them.

I think they added DP and ECC not as a major focus, but because they had spare transistor budget to do it, and customers were asking. It's a risk, but why is no one postulating that AMD is forsaking graphics by adding DP when it did?

The markers are just trumping up DP and ECC as major bullet points at this point, I wouldn't real too much into it.

Ailuros · Oct 1, 2009

elsence said:
That's my question also.

Also hardware.fr is reporting that each SFU unit can do 8 interpolations. (if i understood correct)
Each SM has 1 SFU unit, so the GT300 can do 16X8=128 interpolations.
Wouldn't the design (256TMUs) be limited that way? (128 interpolations)

EDIT*
I just rechecked hardware.fr.
It reports 16 interpolations.
So probably we have 256TMUs.

If true that's a disaster in terms of bandwidth IMHLO (L doesn't stand for lame but layman's heh). Unless of course we'll see some >16xAF sample mode "for free" (due to the bandwidth restriction). The last one is more of a joke of course; I personally could make better use of something like better filter kernels for example than higher than 16x samples.

Else if it truly has something like 256 TMUs or equivalents I doubt real time fillrate could even peak to such hights.

One aspect that hasn't anyone asked this far and obviously there's probably no data available on that, is 8xMSAA performance on GF100. I'd dare to say it has 48 ROPs (wtf it would need 48 pixels/clock is beyond my uneducated imagination), but the 8Z/8C note in Rys' diagram isn't particularly telling IMO in that department yet either.

Finally if it ends up with something like 1200MHz GDDR5 the 384bit "grants" it 230.4GB/sec which might be sufficient and marks a 50% increase over GTX285, closer to what Dally allowed me to speculate from his PCGH interview and way less to anything the so far 512bit scenarios granted.

pcchen · Oct 1, 2009

rpg.314 said:
If L1D are not supposed/expected to communicate, then where do you see possible uses for a unified L2? faster atomic operations?

That'd be an obvious application. Another one is that it can work as a reorder buffer, to make memory access more bandwidth efficient. For example, in G8X/G92, memory access has to be strictly sequential and aligned to be most efficient. On GT200 the rules are less strict but there are still some situations which can lead to poor memory bandwidth utilization. Now with a large L2 cache, the restrictions will be much less, because the cache can act as a large buffer to collect all non-regular read/writes into more regular access pattern. For example, I suspect that now with Fermi's L2 cache it's possible to write a fast histogram program with histograms stored in the global memory (and hopefully cached). Similar algorithms such as radix sort can also benefit.

c) This is gonna invite some irritation/criticism/flames. But why is implementing trees, linked lists in g80-esque shared memory inefficient? Or plain hard? I haven't tried doing this, so please make allowance for this before you berate me.

Do you mean by putting the trees/linked lists in shared memory? That'd be possible but could lead to a lot of bank conflicts. Also 16KB shared memory is not really very large, especially when you want to have different trees for each thread...

liolio · Oct 1, 2009

Eolirin said:
Is it just me or is nVidia running a really huge risk here? I can't imagine this having a better price/performance ratio than the HD5xx0 line, so it really looks like they're trying to transition away from mainstream consumer graphics, or else just couldn't adjust their plans fast enough to have anything else in this time frame.

But that puts them in a position where they're forced to cede even more of the consumer market to AMD, and Intel and Larrabee seem like they could directly compete with this sort of concept. It's like they're running toward the giant behemoth that is Intel while being nipped at the heels by an AMD with a far stronger bite than expected.

I can't believe that's an enviable position to be in at all.

That's been my feel since yesterday especcially after I read (from anad) that tesla sales last quarter amount to 1.3% to their business. Nvidia is forcefully trying to carve its niche in the HPC market.
I understand tho IGP will disappear in ~1 year, they anticipate the loss they have to extend their market dto support huge R&D cost. It's risky but they have to do something anyway but this is a bit extreme if you ask me.

Lux_ · Oct 1, 2009

DemoCoder said:
HPC is a niche market. [...] Concentrating on GPGPU at the expense of retail/OEM/mobile

Of course HPC is a niche. With GF100 they can cover mass market and niche as well. They are not leaving GPU market, they have hopefully built a product, which can compete in GPU market and in HPC market (niche chips into niche market).

see where it takes them

I see the opposite: what NVidia has to do now, if it wants to survive in 7 years, when both Intel and AMD have integrated/combined/scalable CPU and GPU.

I think they added DP and ECC not as a major focus, but because they had spare transistor budget to do it

They've had spare, if the chip interconnects were dictating the minimum size, for example in R700-s case it got 800 SPs instead on some less number because of this.

GF100 is 3 billion transistors. If the product was good enough being half the size, the features would get cut without thinking twice. But it isn't.
Half-rate dual-precision is on par with CPUs from AMD and Intel, also being over twice as fast as ATI's. ECC on GPU is first in the industry, as is C++ language features support.

By limiting themselves by DX11 and OpenCL 1.0, NVidia would be just another GPU maker. As far as I can see, they continue to be that. But they also choose to pursue a larger goal and move into general computing direction. Lots of luck to them and thanks for pushing the industry forward.

Scali · Oct 1, 2009

Lux_ said:
GF100 is 3 billion transistors. If the product was good enough being half the size, the features would get cut without thinking twice. But it isn't.

I'm not too sure about this. I think nVidia just wants to stretch the transistor budget as far as they can for the high-end.
They can then release salvage parts which are still 'good enough', at more attractive prices... And they can build some scaled-down versions of the architecture for the lower markets.
All these products may be 'good enough' in their own right. HD5850 is also 'better' than HD5870, in that it is almost as fast and considerably cheaper. AMD *could* have just built the HD5850 as the actual design, rather than the salvage part with excess baggage that it is today, would have been 'good enough'.

DemoCoder · Oct 1, 2009

The industry is littered with failures trying to get into the CPU space. There was once a thriving industry in the workstation/server business prior to the rise of commodity linux. You had SPARC, you had PA-RISC, you have MIPS, you had PPC, you have Alpha, you had 68k. These were once protected islands until free BSD variants and Linux arrived. Now most are in the dead pool, including Sun. Only IBM has survived, just barely.

There are only two areas where recent success has occurred -- consoles and mobile/low power markets, but even there you see Intel is making an assault.

Competitors could have commodified x86 if not for the simple fact that the fabs are incredibly expensive, and therefore the barrier to entry is now so high, you're unlikely to see new challengers. We're lucky AMD is still hanging on. Intel's biggest threat probably comes from a mainland Chinese competitor in the next decade.

If Nvidia plans to go up against Intel, not having x86 compatibility is a non-starter. And even if they did plan to go that route, they'd still face the fact that ultimately, they can't bet their entire future on TSMC while Intel and AMD control their own process.

So IMHO, a long term strategy of trying to beat Intel at their own game is a failure strategy. IMHO, the real strategy is to de-emphasize traditional processing, which is already hitting limits. Your web browser or Microsoft Office won't run much faster on superduper x86 chips. Rather, the kinds of workloads that will stress desktop systems are inherently media/parallel tasks anyway.

We're looking at hitting the limits of process downsizing in the coming decade anyway, the only way to scale is parallelism, so long term, NVidia's strategy should be to look at transitioning developers to a new model, rather than adapt their existing hardware to the old. And I think you're seeing them do that, especially with the new developer tools they have coming out.

They just can't do it too quickly, but graphics still has to fund this transition.

Honestly, I have a hard time seeing anyone challenging Intel in the future, except perhaps state-subsidized players in Asia or maybe large Japanese oligarchies. As we get closer to fundamental limits, costs are going up so high, that very few entities can raise the kind of capital, and wait a long time for ROI, to meet future needs.

As great as the R8xx/GF1xx are, the reality is, Intel is a very large company with lots of smart people, and lots of money, and other market position advantages that makes it very hard to unseat them, and if need be, they can put enough resources on a threatening GPGPU killer. Even with a large clusterfuck like Prescott, AMD was only able to bite off a small niche.

NVIDIA Fermi: Architecture discussion

elsence

Nick

rpg.314

Bouncing Zabaglione Bros.

pcchen

Moderator

Nick

Ailuros

Epsilon plus three

Scali

Scali

Scali

elsence

Scali

rpg.314

DemoCoder

Ailuros

Epsilon plus three

pcchen

Moderator

liolio

Aquoiboniste

Lux_

Scali

DemoCoder

Similar threads