NVIDIA Fermi: Architecture discussion

MarkoIt · Oct 4, 2009

AnarchX said:
According to an interview with Andy Keane and PCGH, each CUDA core consists of:

a DP-FMA, a SP-FMA and an integer ALU

... and they say in some cases the DP-FMA can be used for SP-taks.

So we are talking about up to 4 FLOPs per CUDA core?

Nvidia's Whitepaper said:
Each SM features 32 CUDA
processors—a fourfold
increase over prior SM
designs. Each CUDA
processor has a fully
pipelined integer arithmetic
logic unit (ALU) and floating
point unit (FPU). Prior GPUs used IEEE 754-1985
floating point arithmetic. The Fermi architecture
implements the new IEEE 754-2008 floating-point
standard, providing the fused multiply-add (FMA)
instruction for both single and double precision
arithmetic.

A frequently used sequence of operations in computer graphics, linear algebra, and scientific
applications is to multiply two numbers, adding the product to a third number, for example,
D = A × B + C. Prior generation GPUs accelerated this function with the multiply-add (MAD)
instruction that allowed both operations to be performed in a single clock. The MAD instruction
performs a multiplication with truncation, followed by an addition with round-to-nearest even.
Fermi implements the new fused multiply-add (FMA) instruction for both 32-bit single-precision
and 64-bit double-precision floating point numbers (GT200 supported FMA only in double
precision) that improves upon multiply-add by retaining full precision in the intermediate stage.
The increase in precision benefits a number of algorithms, such as rendering fine intersecting
geometry, greater precision in iterative mathematical calculations, and fast, exactly-rounded
division and square root operations.

They aren't talking about 2 ALUs separated ALUs for DP and SP operations in their whitepaper. I still believe it's 2 flops/clock for each core.

Jawed · Oct 4, 2009

MfA said:
What's so inefficient about putting circular append buffers between producer/consumer kernels and branching to consumers when they have full warps? It takes storage, but running strands on Larrabee with only a few active fibers won't be efficient either ... the storage is a necessity.

That will work. But you lose all scheduling granularity between the producer and consumer. They're now 1:1, which means that register allocation for both is carried by either.

Though D3D11 has features for tackling resource allocation for this "uber shader" type problem, with dynamic linkage. But I think that's still static at run time, so I don't think it solves this problem.

Jawed

Jawed · Oct 4, 2009

3dilettante said:
FP and INT share a data path (and possibly more hardware for all we know) and cannot coissue.

Yeah I agree with this, now.

Having something like that thrown back on the shader core would effectively end the decoupled texturing that lead to such efficiencies in earlier GPUs. I'd worry that shader work couldn't progress until texturing was done.

NVidia's instruction scheduler allows these instructions to be issued out of order and irrespective of thread. So if the INT functionality was a separate SIMD (like SFU is) then there'd be no problem.

How did you derive this rate for Larrabee, particularly the Z rate?

I'm basing 16 colour on HD5870's 32 colour, at ~ half the likely clock of Larrabee. Larrabee might be as fast as 2GHz, of course. As for Z, HD5870's 4x rate seems to be more than adequate, too. NVidia's 8x rate is clearly wasted on GT200. Though a real 8x rate would be useful. Of course on Larrabee the absolute Z-rate is down to whatever else the hardware's doing.

So one or more threads on the core will wait around for the producer to complete, then pick up, or is it multiple working threads, then a context switch to pull in a consumer?

See MfA's idea: you have an uber-kernel that does both sides on a GF100 core - or you split the kernels across the cores, some are producers and some are consumers.

It seems GF100 doesn't support context switches per core, only per GPU.

R600 supports up to 8 states at a time. It seemingly uses these to do multiple concurrent contexts, but the documentation is vague. AMD, according to TechReport, has claimed multiple kernels per core on R800:

(Incidentally, AMD tells us its Cypress chip can also run multiple kernels concurrently on its different SIMDs. In fact, different kernels can be interleaved on one SIMD.)

but I'm doubtful that's multiple compute kernels, merely multiple graphics kernels. i.e. VS and PS kernels can run on a single core (like they do on R600, I presume), but I'm dubious (until I see documentation that confirms otherwise) that two or more compute kernels can timeslice on a core. Maybe the 8-state support that is in R600 for graphics kernels has been extended to compute kernels.

Jawed

Jawed · Oct 4, 2009

nAo said:
G80 shared mem size == max number of temporaries available in SM4

SM4 requires 4096 vec4 registers per pixel, 64KB.

Jawed · Oct 4, 2009

trinibwoy said:
Yes, but many are making the mistake of thinking that equates to a de-emphasizing of gaming. I agree with Arun, all these editorials are off base. For starters, even at moderate clocks Fermi is 2x GTX285.

How?

Jawed

Jawed · Oct 4, 2009

trinibwoy said:
Fuad says it's only slightly smaller than GT200 and guesstimates ~550mm^2.

Meanwhile JHH says:

http://www.hardforum.com/showthread.php?t=1456146

On the size of Fermi, Jensen stated, "It is only big right now, because it is the biggest chip ever built."

So, erm, bigger than GT200?

Jawed

Jawed · Oct 4, 2009

Kaotik said:
ninelven, 1/5 of Cypress is 544, not 272GFLOPS

If my memory serves from the way it used to be in RV670 which first introduced DP on Radeons, basic add & substraction functions can be carried at 2/5 rate, while multiply and divide functions would go at 1/5 rate. (I doubt it has gone any worse from what it was back then)

DP Divide on R600/R700 is 12 cycles, it's a "macro" effectively.

Jawed

Jawed · Oct 4, 2009

DegustatoR said:
Sure. But if your GPU can "emulate" a new feature then it's technically compatible.
Wasn't Intel planning on doing something like this for LRB's DX compatibility?

Yeah, Intel's chosen a pretty good time for Larrabee as fixed-function texture decompression/filtering is unlikely to need to progress to any great degree beyond what's in D3D11. Nothing else new seems likely to be necessarily fixed-function for decent performance.

Jawed

Dave Baumann · Oct 4, 2009

Jawed said:
How will anyone use DP on HD5870? It appears to be a pure tick-box feature, with no OpenCL support for ~ a year. Writing Brook+ and IL, both of which are no longer supported?

IIRC its already exposed under DirectCompute. And as pointed out some support will come on OpenCL sooner as well.

Jawed · Oct 4, 2009

DegustatoR said:
It's really a question of wether they can map DX11 tesselation to their SMs well enough. I'm thinking that they may have chosen s/w tesselation because they are certain that s/w solution is preferable in the long run in the same way as unified PS/VS/GS are preferable to separate pipelines right now. Take Cell for example. AFAIK it's pretty good for tesselation. Does it have a h/w tesselator? Will it get one in the future? Will LRB have a h/w tesselator? Right now it looks like AMD may end up being the only one on the market with h/w tesselator in their chips. But who knows, maybe AMD's right and then everyone will be forced to implement a separate h/w tesselator at some point.
We need some benchmarks =)

At most TS can produce 32 new points per input patch. But the hardware probably can't rasterise more than one resulting triangle per clock, i.e. 1 new point per clock is the actual TS rate required. Though you can argue more if there's culling of various types to do (back-face, screen-clip).

DS throughput is also going to be a potential bottleneck, i.e. there's a lot of work to do to convert a point into a vertex - lots of interpolations, at least.

On GF100 at say 750MHz, and assuming it can rasterise 750M triangles per second, there'd be 1024 scalar operations per triangle (assuming ALU clock is twice core clock). So I can't see how software tessellation is going to be meaningfully constrained.

I don't understand why AMD implemented ALU-based interpolation (deleting SPI) but kept fixed function tessellation. The only thing I can think of is that it's a functional block that also does vertex/geometry assembly (to feed setup) and its deletion will come when the architecture is properly overhauled.

Jawed

Tim · Oct 4, 2009

Jawed said:
Meanwhile JHH says:

http://www.hardforum.com/showthread.php?t=1456146

So, erm, bigger than GT200?

Jawed

He might be talking about transistor count.

Jawed · Oct 4, 2009

nAo said:
It would certainly look more impressive if you guys had increased the setup rate.

It'd be great to see some justification for doing so.

Jawed

Jawed · Oct 4, 2009

CarstenS said:
Come on - normally you don't think in those simple terms.

Those terms seem much better than the transistor count-based scaling idiocy I see everywhere I look. Also, I don't see how I could make the caveat stronger.

Jawed

Jawed · Oct 4, 2009

DemoCoder said:
It could have been their RV770, but they are being more ambitious and simultaneously conservative, putting off a shrinkage/optimization pass until the next revision, opting instead of rearchitecting pieces.

Until we know what's happened with TMUs and ROPs, it's pretty murky. And the DP is definitely a monster improvement - though such a low base blunts that somewhat.

Their hand may have been forced. Looking at Larabee, and other product roadmaps, they probably felt they needed to get a much more general purpose GPU this generation else be caught with their pants down by Intel next year.

It'll be interesting to find out how much of a response to Larrabee it was - since Larrabee's been rumoured/outlined for quite a while now.

Jawed

Jawed · Oct 4, 2009

Dave Baumann said:
IIRC its already exposed under DirectCompute.

Isn't it optional? Slide 32:

http://developer.amd.com/gpu_assets/Your Game Needs Direct3D 11, So Get Started Now.pps

And as pointed out some support will come on OpenCL sooner as well.

Michael Chu clarified:

To clarify about DPFP support in OpenCL:

It is not currently in the OpenCL beta release. We start introducing various pieces of DPFP support in OpenCL over the next year. Some of the basic arithmetic functions will be introduced first with the math function support likely taking longer. Since full DPFP support is an optional extension in the OpenCL spec, we are concentrating first improving the performance and other aspects of the required spec first.

Curious why DirectCompute's optional DP is getting much higher priority than OpenCL's optional DP.

Jawed

DegustatoR · Oct 4, 2009

Arty said:
Well, sooner than you'd expect turned out to be late 4Q. I think you should concede this especially since Nvidia itself acknowledge its late. Unless you know more (your AFAIK != Nvidia) about their internal timelines, you cant say otherwise.

I'm sorry but what's all this italics and underlines? I don't remember NV ever saying anything in the lines of "sooner than you'd expect" and don't remember them aknowledging that it's late. They would want to have it right now of course but that doesn't mean that it's late. A1, August. If that chip was A3 and was done in Spring then you would have a reason to say that it's late. Right now it's on track. Whether that track is late in itself is another issue.

Arty said:
I think it was Fuad who said it would be GF100 based with disabled clusters.

So you're now trusting Fuad again? -)

mboeller said:
sorry for being late,
but why do you think that 256 independent 32bit wide load/store units would not be enough?

I don't know will this number be enough or not. I'm just saying that TMUs are neccessary for compute also, not just for graphics.
I don't understand that stance that Fermi is good for compute and bad for graphics. Most of what's needed for compute is needed for graphics too. If anything, Fermi should be better for graphics than previous generation architecture.

elsence · Oct 4, 2009

I have a question.
I see that there is a lot speculation in the net, regarding what performance, Fermi based designs are going to have in games (in relation with the old GT200 or the DX11 5870)

I guess the logic thing for Nvidia is to have only one Fermi design for both the Tesla market and for Gaming market. (cost/time/resources related issues)

The potential Tesla TAM according to NV in the next 18 months will be something like $1,2 billions.
The total revenue for NV for 18 months is close to $5 billions now (in the recent past it was $6 or more).

If Tesla accounted for less than 1.3% of NVIDIA's total revenue last quarter and this is indicative for all the quarters, then the Tesla revenue was something like $65 million. (18 months)

So what i am asking is this:

Is it impossible for NV to have 2 designs, if NV thinks that the Tesla revenue will be increased 5 fold for example?
($325 million, a little more than 25% of the potential Tesla TAM)

Also does anyone know if and in what percentage there is going to be a performance hit regarding ECC implementation? (ECC is helping scientific sectors but i don't think it matters in gaming, even if the GDDR5 leading to errors in higher degree than before, i suspect this isn't an issue for gaming applications)

I am worried also for FP64 perf.
Why the gaming part to have in this degree dedicated transistor space to FP64 performance in the gaming sector?
Isn't more logical for Nvidia to used the transistor space in a more efficient way for the gaming sector?
The certain thing imo is that at least the DX11 value parts are not going to have these features and FP64 ratios.

trinibwoy · Oct 4, 2009

Jawed said:
How?

Arstechnica said:
With Fermi acting as the tip of the wedge to drive NVIDIA up into HPC, and Tegra driving NVIDIA down into mobile computing products, it's worth asking where NVIDIA's volume discrete and integrated graphics products are headed. The short answer is that NVIDIA's prospects in these markets don't look so good.

Now tell me if that statement makes sense given these numbers below? Here I used relatively conservative clocks of 650/1300 for Fermi and the currently rumoured 128 TMUs. I counted only MAD flops, adjust as required if you consider the "missing MUL" useful.

Does that look like they've abandoned graphics keeping in mind the lowball clock estimates?

trinibwoy · Oct 4, 2009

elsence said:
Is it impossible for NV to have 2 designs, if NV thinks that the Tesla revenue will be increased 5 fold for example?

That's far riskier and more expensive than their current approach. People hype the bigger die sizes but only Nvidia knows how much that hurts the bottom line in the end. Also much of this is financed by "cheap" dies like G92 and lower where they are very competitive. In the end the big dies on the high end may not be as big a deal as commonly thought and it's a far easier proposition for them to leverage that investment in multiple markets.

elsence · Oct 4, 2009

trinibwoy said:
That's far riskier and more expensive than their current approach. People hype the bigger die sizes but only Nvidia knows how much that hurts the bottom line in the end. Also much of this is financed by "cheap" dies like G92 and lower where they are very competitive. In the end the big dies on the high end may not be as big a deal as commonly thought and it's a far easier proposition for them to leverage that investment in multiple markets.

Yes, i agree.

That's why i wrote:

"I guess the logic thing for Nvidia is to have only one Fermi design for both the Tesla market and for Gaming market. (cost/time/resources related issues)"

I just like surprises.

NVIDIA Fermi: Architecture discussion

MarkoIt

Jawed

Jawed

Jawed

Jawed

Jawed

Jawed

Jawed

Dave Baumann

Gamerscore Wh...

Jawed

Tim

Jawed

Jawed

Jawed

Jawed

DegustatoR

elsence

trinibwoy

Meh

trinibwoy

Meh

elsence

Similar threads