NVIDIA Fermi: Architecture discussion

FrameBuffer · Jan 5, 2010

XMAN26 said:
If they turn out to be true, I hope Charlie removes all his negitivity about the card from here and his site as he will have to eat major crow.

and if not we're to assume you'll do the same ??

Gipsel · Jan 5, 2010

CarstenS said:
For some reasons, Nvidia doesn't seem to want or to not be able to go the AMD way. Their chips tend to be quite big anyway and so making the basic units even bigger for coupling is maybe a worse choice wrt to their basic architecture than had they gone also for a 5-way VLIW.

Plus, it would go completely against their traditional strategy of first implemention for experiments, then making the feature usefull and only after that to go fully along that route.

But it would be quite a waste of resources. Remember that all units (in both subblocks) contain extended multiplier capabilities anyway. Why would they add that to a unit in the SP only subblock if it is wasted for DP, which needs an even larger multiplier (and the size of a multiplier scales roughly with the square of the number of bits to multiply)? They could have solved this by issuing 32bit integer multiplies only to the DP subblock. That would still be an enourmous increase in throughput and actually sensible as 32bit integer multiplies are not that important anyway (compare with the integer multiplier capabilities of CPUs).

Cypress on the other hand can't do 32bit integer multiplies in all ALUs (only in the t ALU, the other ones can do only 24bit, internally probably 27bits). That is the reason they need to combine 4 of those for multiplies (for adds 2 are enough).

And the last point, have you looked at the SMs visible in the Fermi die shot? The units look remarkably symmetric, don't you think? If there would be a DP and a SP subblock, some units would have a different size, isn't it?

Mize · Jan 5, 2010

Hey if those benches are right and nv's yields and die size as rumored it means an excellent new top-end part and 5870s for $299 so what's the matter with that?

mczak · Jan 5, 2010

trinibwoy said:
I don't think that's accurate. At least one patent highlights the ability of the dispatcher (running at base clock) to issue instructions to the SFU and ALU pipelines in alternate cycles, even in G80 class hardware. Each ALU instruction runs for 4 hot-clocks or 2 base-clocks which provides a window to do so. The SFU and ALU pipelines presumably have dedicated operand collectors to support this as well.

Yes, that's right for the older chips. Do we know if Fermi can still do this? ALU instructions now only run for 2 (hot) clocks, of course even if the chip can't do it it would still be able to co-issue at least one fp op from the other scheduler.
Thinking about this, I'm not sure though actually attribute interpolation rate will suffer from having fewer SFU units. The missing MUL is gone apparently, and maybe interpolation is gone from SFU as well, I see little reason why this couldn't be performed in the regular SPs. There are certainly enough adders and multipliers there to be able to do one scalar interpolation using both pipes per clock (AMD is doing it like that, 2 lanes combined for one scalar interpolation).

trinibwoy · Jan 5, 2010

mczak said:
Yes, that's right for the older chips. Do we know if Fermi can still do this? ALU instructions now only run for 2 (hot) clocks, of course even if the chip can't do it it would still be able to co-issue at least one fp op from the other scheduler.

I haven't seen any specific info about that but considering the schedulers are responsible for issuing instructions to the load/store units as well and they obviously aren't going to block waiting on those then it's fair to assume they don't block on the SFU either.

XMAN26 · Jan 5, 2010

FrameBuffer said:
and if not we're to assume you'll do the same ??

I dont run a site making posts about how bad fermi is or ATI is, charlie does. I haven't advertised that fermi is broken, doesn't work, underwhelms, underperforms or anything to the like. Nor about ATI cards, again, charlie has. I also said I hope it true, didn't say it was.

Gipsel · Jan 5, 2010

mczak said:
Yes, that's right for the older chips. Do we know if Fermi can still do this? ALU instructions now only run for 2 (hot) clocks

Right, the instructions are issued (they run much longer) over two clocks (one half warp per clock) now instead of a half warp every two (hot) clocks. But the same was true in a sense for G80/GT200 too. The scheduler only alternated between the ALUs and SFUs, now the schedulers can issues all types of instructions every two cycles.

digitalwanderer · Jan 5, 2010

A.L.M. said:
GF100's maximum load temperature is 55 C.

OMG, if that one is true I will eat my Razer hat!

Mize · Jan 5, 2010

digitalwanderer said:
OMG, if that one is true I will eat my Razer hat!

That's why Thermaltake has a special ducted-fan case for Fermi

Ailuros · Jan 5, 2010

Mize said:
Hey if those benches are right and nv's yields and die size as rumored it means an excellent new top-end part and 5870s for $299 so what's the matter with that?

http://www.pcgameshardware.com/aid,...DX-11-Update-Radeon-HD-5970-results/Practice/

Dirt2 performance; now if a GF100 can yield 148 fps in 1920*1200 with 4x Supersampling and 16xAF then of course pigs can fly

Silus · Jan 5, 2010

Ailuros said:
http://www.pcgameshardware.com/aid,...DX-11-Update-Radeon-HD-5970-results/Practice/

Dirt2 performance; now if a GF100 can yield 148 fps in 1920*1200 with 4x Supersampling and 16xAF then of course pigs can fly

Not saying it's true, but as we all discussed before, some loads favor one architecture over the other. In GRID for example, RV770s mopped the floor with GT200s and yet GT200 were faster in almost everything else. Why can't this be a similar case for GF100 vs RV870 (if true) ?

mczak · Jan 5, 2010

Gipsel said:
Right, the instructions are issued (they run much longer) over two clocks (one half warp per clock) now instead of a half warp every two (hot) clocks. But the same was true in a sense for G80/GT200 too. The scheduler only alternated between the ALUs and SFUs, now the schedulers can issues all types of instructions every two cycles.

oops, good catch there for "run for two cycles"...
I think I still don't quite get how the dual warp schedulers work.
The fermi whitepaper states that "Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs." (note the or) and "Most instructions can be dual issued; two integer instructions, two floating instructions, or a mix of integer, floating point, load, store, and SFU instructions can be issued concurrently".
And also (for SFU) "a warp executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied."
That's confusing to me, the latter somehow seems to suggest that after an instruction is issued to the SFU it'll just run for eight clocks without any further dispatch needed but in turn it's not possible to issue anything else at the same time (from one scheduler).
Though following that logic it would be impossible to get both 16 alu blocks running at the same time as the load/stores too which as mentioned doesn't make much sense.
I'm missing something...

DavidGraham · Jan 5, 2010

Silus said:
Not saying it's true, but as we all discussed before, some loads favor one architecture over the other. In GRID for example, RV770s mopped the floor with GT200s and yet GT200 were faster in almost everything else. Why can't this be a similar case for GF100 vs RV870 (if true) ?

I don't know , but he corrected it later , saying that Dirts 2 was MSAA , but AVP3 was SSAA !!

trinibwoy · Jan 5, 2010

"The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied."

That pretty explicitly says that the dispatch unit can issue more instructions while the SFU is running. Why do you take it to mean the opposite? Are you taking "dispatch unit" and "scheduler" to be two different things because I believe they're one and the same.

Ailuros · Jan 5, 2010

Silus said:
Not saying it's true, but as we all discussed before, some loads favor one architecture over the other. In GRID for example, RV770s mopped the floor with GT200s and yet GT200 were faster in almost everything else. Why can't this be a similar case for GF100 vs RV870 (if true) ?

Have you opened the link? A 5970 is under the DX9 mode clearly CPU bound with 4xMSAA/16xAF according to PCGH measurements:

1280*1024 = 128,3 fps
1680*1050 = 121,1 fps
1920*1200 = 113,9 fps

Now I'll be generous and I'll grant you the fabulous 148 fps in a CPU bound scenario; how much more generous can I actually be and claim the very same for something with 4x sample Supersampling, especially since the 5970 scales quite well considering it's much higher texel fillrate compared to a 5870. Add to that that I'd expect a 5970 to lose a significant amount of performance if you'd switch to supersampling.

Yes of course could NV have used their own benchmark for the game, but then I'd have to wonder if it's a single car race case in an empty desert heh...

mczak · Jan 5, 2010

trinibwoy said:
"The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied."

That pretty explicitly says that the dispatch unit can issue more instructions while the SFU is running. Why do you take it to mean the opposite? Are you taking "dispatch unit" and "scheduler" to be two different things because I believe they're one and the same.

No that's not the reason. I thought the decoupling meant it could sort of self-schedule for handling the full warp in 8 clocks, but it still needed "normal" instruction issue first (hence only on the first (two) clocks nothing else could be issued). If that's fully decoupled from dispatch unit, that's kind of weird SFU instructions would have a completely separate scheduler? That also doesn't really fit into the other quotes about dual-issue.

trinibwoy · Jan 5, 2010

Are you sure you're not over-complicating it? "Decoupled" simply means that the schedulers can issue an instruction and then go about their business issuing other instructions without blocking on the completion of the first one. It doesn't imply that there's a dedicated SFU scheduler.

In terms of "self-scheduling", I'm not sure that's the right term. In past architectures the ALUs would process a warp over 2 base-clocks but the instruction itself is issued in only the first. It's the operands that are fed over 2 clocks, not instruction issue. Depending on operand collector bandwidth all the operands for a Fermi SFU instruction may be provided in the first clock as well or fed in over 2 or 4 clocks.

thatdude90210 · Jan 5, 2010

So we found JHH's drinking buddy. The guy got JHH drunk and spill his guts on Fermi and Tegra 2 orders.

Gipsel · Jan 5, 2010

mczak said:
oops, good catch there for "run for two cycles"...
I think I still don't quite get how the dual warp schedulers work.
The fermi whitepaper states that "Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs." (note the or) and "Most instructions can be dual issued; two integer instructions, two floating instructions, or a mix of integer, floating point, load, store, and SFU instructions can be issued concurrently".
And also (for SFU) "a warp executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied."
That's confusing to me, the latter somehow seems to suggest that after an instruction is issued to the SFU it'll just run for eight clocks without any further dispatch needed but in turn it's not possible to issue anything else at the same time (from one scheduler).

What trinibwoy said.
Actually this is the typical nv speak which lets the people in the dark about the real pipeline length. If nv says an instructions "executes over x clocks" it means that the next instruction can be issued to the unit after x clocks. It isn't a latency figure at all. In fact nvidia almost always states only throughput figures in a somewhat weird and convoluted way.
So translated it means the SFUs have a throughput of 1 operation per cycle. As the instruction is issued for a whole warp (32 threads) and there are 4 SFUs, you arrive at a througput number of 8 cycles per instruction.

mczak · Jan 5, 2010

trinibwoy said:
Are you sure you're not over-complicating it? "Decoupled" simply means that the schedulers can issue an instruction and then go about their business issuing other instructions without blocking on the completion of the first one. It doesn't imply that there's a dedicated SFU scheduler.

Yes ok but then it's unrelated to dual issue.

In terms of "self-scheduling", I'm not sure that's the right term. In past architectures the ALUs would process a warp over 2 base-clocks but the instruction itself is issued in only the first. It's the operands that are fed over 2 clocks, not instruction issue. Depending on operand collector bandwidth all the operands for a Fermi SFU instruction may be provided in the first clock as well or fed in over 2 or 4 clocks.

Ah ok I thought that's how it handled SFU issue - first clock for alu, second for sfu.
So how do you keep all units (2 alu blocks, ld/st, sfu) busy in fermi? I can't see how this should work given the wording in the fermi whitepaper. Or is the dual-issue per scheduler? Doesn't really fit with that wording neither.

NVIDIA Fermi: Architecture discussion

FrameBuffer

Gipsel

Mize

3dfx Fan

mczak

trinibwoy

Meh

XMAN26

Gipsel

digitalwanderer

Mize

3dfx Fan

Ailuros

Epsilon plus three

Silus

mczak

DavidGraham

trinibwoy

Meh

Ailuros

Epsilon plus three

mczak

trinibwoy

Meh

thatdude90210

Gipsel

mczak

Similar threads