NVIDIA Fermi: Architecture discussion

Gipsel · Jan 5, 2010

mczak said:
So how do you keep all units (2 alu blocks, ld/st, sfu) busy in fermi? I can't see how this should work given the wording in the fermi whitepaper.

To say the truth, you are not

You can issue an instruction to 2 functional blocks per (scheduler) cycle (if you have DP instructions only to one). So if you have an instruction for the L/S or the SFU pipe, one of the ALU blocks is not getting one.

trinibwoy · Jan 5, 2010

mczak said:
Yes ok but then it's unrelated to dual issue.

Dual-issue is just Nvidia's marketing term for issuing a maximum of two instructions per clock. Don't think there's more to it than that.

Ah ok I thought that's how it handled SFU issue - first clock for alu, second for sfu.
So how do you keep all units (2 alu blocks, ld/st, sfu) busy in fermi? I can't see how this should work given the wording in the fermi whitepaper. Or is the dual-issue per scheduler? Doesn't really fit with that wording neither.

Well you can't issue to all units in the same clock but you can certainly have all of them busy at a given point in time. The SFU and LD/ST units have much higher latency than the ALU pipes so presumably there will be situations where they are working away while new instructions are fed to the ALUs.

neliz · Jan 5, 2010

Now I'm pretty sure some of you have clicked the link in my sig (not too many of you though) but here is a related newsblurb. http://www.slashgear.com/nvidia-optimus-teases-frugal-discrete-notebook-graphics-0567741/

Who is betting on lower idle power for GF100 over HD5870?

Sontin · Jan 5, 2010

neliz said:
Now I'm pretty sure some of you have clicked the link in my sig (not too many of you though) but here is a related newsblurb. http://www.slashgear.com/nvidia-optimus-teases-frugal-discrete-notebook-graphics-0567741/

Who is betting on lower idle power for GF100 over HD5870?

They will reintroduce Hybrid-Power?

neliz · Jan 5, 2010

Sontin said:
They will reintroduce Hybrid-Power?

Once the driver team is ready for it? Maybe..

mczak · Jan 5, 2010

trinibwoy said:
Well you can't issue to all units in the same clock but you can certainly have all of them busy at a given point in time. The SFU and LD/ST units have much higher latency than the ALU pipes so presumably there will be situations where they are working away while new instructions are fed to the ALUs.

Yes, but in terms of throughput you won't be able to keep all units busy always then. It would be impossible to achieve peak FP throughput, unless you never load/store any values. Though nvidia claims the design makes it easy to achieve close to full utilization.

mczak · Jan 5, 2010

neliz said:
Now I'm pretty sure some of you have clicked the link in my sig (not too many of you though) but here is a related newsblurb. http://www.slashgear.com/nvidia-optimus-teases-frugal-discrete-notebook-graphics-0567741/

Who is betting on lower idle power for GF100 over HD5870?

Nvidia Optimus = glorified name for power gating?

Tchock · Jan 5, 2010

Sounds like GT3XXM is actually GT2XXM + the ability to somehow interface with PM55 and the Arrandale IGP.

That might explain the Fudzilla optimism over nVidia's design wins in the market, but again I think that doesn't cover the whole story either.

p/s: More aggressive clock gating I do get, but power gating- and on Fermi flagship of all chips?

trinibwoy · Jan 5, 2010

mczak said:
Yes, but in terms of throughput you won't be able to keep all units busy always then. It would be impossible to achieve peak FP throughput, unless you never load/store any values. Though nvidia claims the design makes it easy to achieve close to full utilization.

Sure, every SFU or L/S instruction issued potentially results in a 2 cycle bubble in one of the ALU pipelines but it's a self-healing sorta problem. If your kernel is ALU bound then you issue to the other units less and have fewer bubbles in the ALU pipeline. If it's not ALU bound then you have more bubbles but it doesn't matter anyway since that's not your bottleneck.

MfA · Jan 5, 2010

If TSMC offered powergating for such high currents (even mobile GPUs have very large power currents relatively speaking) I think they would be a bit more vocal about it ... it's not really something NVIDIA could develop.

Kaotik · Jan 5, 2010

Tchock said:
Sounds like GT3XXM is actually GT2XXM + the ability to somehow interface with PM55 and the Arrandale IGP.

That might explain the Fudzilla optimism over nVidia's design wins in the market, but again I think that doesn't cover the whole story either.

p/s: More aggressive clock gating I do get, but power gating- and on Fermi flagship of all chips?

There's no indication whatsoever that they've changed the chips even one bit as far as I'm aware, and isn't pretty much all the mobile chips from both manufacturers now supporting some sort of switchable graphics with Intel IGPs?

mczak · Jan 5, 2010

trinibwoy said:
Sure, every SFU or L/S instruction issued potentially results in a 2 cycle bubble in one of the ALU pipelines but it's a self-healing sorta problem. If your kernel is ALU bound then you issue to the other units less and have fewer bubbles in the ALU pipeline. If it's not ALU bound then you have more bubbles but it doesn't matter anyway since that's not your bottleneck.

Right. For that perfect code which just uses the right amount of alu and other instructions to keep all execution units busy though it would incur a small performance hit

.
I just wondered how that works for AMD - load/stores will end up in different clauses than alu instructions, so it looks like in theory it would be possible to keep alus busy all the time and still do loads/stores (since different clauses can run in parallel at the same time). I doubt though it'll really work out that way in practice. Of course the special functions have no problem getting issued at the same time as other alu instructions from the hardware point of view for AMD, but we all know it's not easy for the driver to always fill those slots...
Anyway, so dual issue isn't actually an improvement over the pseudo dual issue of previous chips. In fact it's a step backwards in terms of being able to achieve peak throughput in some ways - sure previous chips didn't have true dual issue but they didn't have to feed two alu blocks neither (so their limited dual issue was able to keep both SFUs and alus busy without bubbles I think - dunno about loads/stores). Probably makes sense though for perf/area, and they needed some changes for DP, plus there are now more warps to chose from to keep those alus busy.

Tchock · Jan 5, 2010

Kaotik said:
There's no indication whatsoever that they've changed the chips even one bit as far as I'm aware, and isn't pretty much all the mobile chips from both manufacturers now supporting some sort of switchable graphics with Intel IGPs?

1. Software block + price discrimination

2. Yes they do, but is it seamless? Some implementations go from downright horrible (restart) to mildly tolerable (logoff/logon or "wait")

Gipsel · Jan 5, 2010

mczak said:
Anyway, so dual issue isn't actually an improvement over the pseudo dual issue of previous chips. In fact it's a step backwards in terms of being able to achieve peak throughput in some ways - sure previous chips didn't have true dual issue ...

I think it will be a step forward in the sense that it will work more often than the dual issue of G80/GT200. The "missing" in the term "missing MUL" came from the fact that it didn't work so well there, you know?

And nvidia added those MULs for their peak performance figures. It should be easier for Fermi to approach the advertized throughput as it was for the previous generations.

From the general point of view design got cleaner and therefore a bit simpler (should save some transistors compared to a GT200 like design with 4 SMs in a TPC) while not sacrificing (much) performance. Good choice in my opinion. Only SFU limited code will probably suffer.

spigzone · Jan 5, 2010

What fun awaits to see what bubble bubble toil and trouble Nvidia's CES 'sneak peek' at Fermi stirs up.

mczak · Jan 6, 2010

Gipsel said:
I think it will be a step forward in the sense that it will work more often than the dual issue of G80/GT200. The "missing" in the term "missing MUL" came from the fact that it didn't work so well there, you know?

Yeah true though the missing mul was easily found in GT200

.

And nvidia added those MULs for their peak performance figures. It should be easier for Fermi to approach the advertized throughput as it was for the previous generations.

Yes for normal ops only this shouldn't be an issue (it wasn't an issue in G80 neither if you disregard the missing mul).

From the general point of view design got cleaner and therefore a bit simpler (should save some transistors compared to a GT200 like design with 4 SMs in a TPC) while not sacrificing (much) performance. Good choice in my opinion. Only SFU limited code will probably suffer.

Surely you meant 3 SMs in a TPC for GT200. You're right it should be more effiicient certainly in terms of transistor count - a TPC in GT200 had 3 (pseudo dual issue) schedulers whereas there's now only 1 (true dual issue) - not to mention those 3 schedulers in GT200 only were for 24 cores not 32. So even if core utilization didn't go up at all or even slightly down in some cases it should still be a win presumably. We'd need to know die area of TPC/SM though to know.

rjc · Jan 6, 2010

neliz said:
Now I'm pretty sure some of you have clicked the link in my sig (not too many of you though) but here is a related newsblurb. http://www.slashgear.com/nvidia-optimus-teases-frugal-discrete-notebook-graphics-0567741/

Who is betting on lower idle power for GF100 over HD5870?

This Techreport article based on a nvidia blog posting on "Optimus", quoting the blog post direct:

As we approach CES we wanted to tell you about an upcoming mobile technology that we will be introducing in Q1. It is called NVIDIA Optimus technology. NVIDIA Optimus technology works on notebook platforms with NVIDIA GPUs. It is unique to NVIDIA. It is seamless and transparent to the user. Its purpose is to optimize the mobile experience by letting the user get the performance of discrete graphics from a notebook while still delivering great battery life. Look for more details next month.

Will be interesting if they can get switch working nicely with the onboard intel graphics, at the moment it is pretty clumsy(physical switch or reboot at least).

I also appears in the some driver docs on Nvidia's (Russian?) website:
http://ru.download.nvidia.com/Window...tart_Guide.pdf
On page 15 mentions "Nvidia Optimus Hybrid technology" under some option that looks like it downloads game profiles or general driver updates.

As an aside if nvidia keep this facebook/twitter/blog posting up as a cheap substitute for fully catered press junkets respectable hard working journalists are soon going to be starving in the street

FrameBuffer · Jan 6, 2010

rjc said:
This Techreport article based on a nvidia blog posting on "Optimus", quoting the blog post direct:

Will be interesting if they can get switch working nicely with the onboard intel graphics, at the moment it is pretty clumsy(physical switch or reboot at least).

I also appears in the some driver docs on Nvidia's (Russian?) website:
http://ru.download.nvidia.com/Window...tart_Guide.pdf
On page 15 mentions "Nvidia Optimus Hybrid technology" under some option that looks like it downloads game profiles or general driver updates.

As an aside if nvidia keep this facebook/twitter/blog posting up as a cheap substitute for fully catered press junkets respectable hard working journalists are soon going to be starving in the street

sounds like a bunch of hooha to strum up business for their failing chipset business if you ask me.. like a cross between switchable graphics (already in place) and their older hybrid graphics that was put out to pasture a while back.

MfA · Jan 6, 2010

rjc said:
Will be interesting if they can get switch working nicely with the onboard intel graphics, at the moment it is pretty clumsy(physical switch or reboot at least).

Why don't they just let the intel chipset handle all the display tasks but intercept 3D rendering calls to render frames on the discrete GPU? Then you can just blit the framebuffer into the integrated graphic's chip memory when doing 3D, and when not doing 3D turn off the discrete graphics chip without a care in the world.

Colourless · Jan 6, 2010

Reminds me of the hacky drivers used back in the day that 3dfx Voodoo 1/2 cards to do windowed mode in Glide. They would create a DDRaw surface on the 2D Card and and render on the Voodoo card with display passthrough enabled. They'd lock the framebuffer on the Voodoo and the DDRaw surface and copy it over. It wasn't 'fast' but transferring that much data over the PCI bus with no DMA was obviously going to be slow.

NVIDIA Fermi: Architecture discussion

Gipsel

trinibwoy

Meh

neliz

GIGABYTE Man

Sontin

neliz

GIGABYTE Man

mczak

mczak

Tchock

trinibwoy

Meh

MfA

Kaotik

Drunk Member

mczak

Tchock

Gipsel

spigzone

mczak

rjc

FrameBuffer

MfA

Colourless

Monochrome wench

Similar threads