NVIDIA Fermi: Architecture discussion

So how do you keep all units (2 alu blocks, ld/st, sfu) busy in fermi? I can't see how this should work given the wording in the fermi whitepaper.
To say the truth, you are not :D
You can issue an instruction to 2 functional blocks per (scheduler) cycle (if you have DP instructions only to one). So if you have an instruction for the L/S or the SFU pipe, one of the ALU blocks is not getting one.
 
Yes ok but then it's unrelated to dual issue.

Dual-issue is just Nvidia's marketing term for issuing a maximum of two instructions per clock. Don't think there's more to it than that.

Ah ok I thought that's how it handled SFU issue - first clock for alu, second for sfu.
So how do you keep all units (2 alu blocks, ld/st, sfu) busy in fermi? I can't see how this should work given the wording in the fermi whitepaper. Or is the dual-issue per scheduler? Doesn't really fit with that wording neither.

Well you can't issue to all units in the same clock but you can certainly have all of them busy at a given point in time. The SFU and LD/ST units have much higher latency than the ALU pipes so presumably there will be situations where they are working away while new instructions are fed to the ALUs.
 
Well you can't issue to all units in the same clock but you can certainly have all of them busy at a given point in time. The SFU and LD/ST units have much higher latency than the ALU pipes so presumably there will be situations where they are working away while new instructions are fed to the ALUs.
Yes, but in terms of throughput you won't be able to keep all units busy always then. It would be impossible to achieve peak FP throughput, unless you never load/store any values. Though nvidia claims the design makes it easy to achieve close to full utilization.
 
Sounds like GT3XXM is actually GT2XXM + the ability to somehow interface with PM55 and the Arrandale IGP.

That might explain the Fudzilla optimism over nVidia's design wins in the market, but again I think that doesn't cover the whole story either.


p/s: More aggressive clock gating I do get, but power gating- and on Fermi flagship of all chips?
 
Yes, but in terms of throughput you won't be able to keep all units busy always then. It would be impossible to achieve peak FP throughput, unless you never load/store any values. Though nvidia claims the design makes it easy to achieve close to full utilization.

Sure, every SFU or L/S instruction issued potentially results in a 2 cycle bubble in one of the ALU pipelines but it's a self-healing sorta problem. If your kernel is ALU bound then you issue to the other units less and have fewer bubbles in the ALU pipeline. If it's not ALU bound then you have more bubbles but it doesn't matter anyway since that's not your bottleneck.
 
If TSMC offered powergating for such high currents (even mobile GPUs have very large power currents relatively speaking) I think they would be a bit more vocal about it ... it's not really something NVIDIA could develop.
 
Sounds like GT3XXM is actually GT2XXM + the ability to somehow interface with PM55 and the Arrandale IGP.

That might explain the Fudzilla optimism over nVidia's design wins in the market, but again I think that doesn't cover the whole story either.


p/s: More aggressive clock gating I do get, but power gating- and on Fermi flagship of all chips?

There's no indication whatsoever that they've changed the chips even one bit as far as I'm aware, and isn't pretty much all the mobile chips from both manufacturers now supporting some sort of switchable graphics with Intel IGPs?
 
Sure, every SFU or L/S instruction issued potentially results in a 2 cycle bubble in one of the ALU pipelines but it's a self-healing sorta problem. If your kernel is ALU bound then you issue to the other units less and have fewer bubbles in the ALU pipeline. If it's not ALU bound then you have more bubbles but it doesn't matter anyway since that's not your bottleneck.
Right. For that perfect code which just uses the right amount of alu and other instructions to keep all execution units busy though it would incur a small performance hit :).
I just wondered how that works for AMD - load/stores will end up in different clauses than alu instructions, so it looks like in theory it would be possible to keep alus busy all the time and still do loads/stores (since different clauses can run in parallel at the same time). I doubt though it'll really work out that way in practice. Of course the special functions have no problem getting issued at the same time as other alu instructions from the hardware point of view for AMD, but we all know it's not easy for the driver to always fill those slots...
Anyway, so dual issue isn't actually an improvement over the pseudo dual issue of previous chips. In fact it's a step backwards in terms of being able to achieve peak throughput in some ways - sure previous chips didn't have true dual issue but they didn't have to feed two alu blocks neither (so their limited dual issue was able to keep both SFUs and alus busy without bubbles I think - dunno about loads/stores). Probably makes sense though for perf/area, and they needed some changes for DP, plus there are now more warps to chose from to keep those alus busy.
 
There's no indication whatsoever that they've changed the chips even one bit as far as I'm aware, and isn't pretty much all the mobile chips from both manufacturers now supporting some sort of switchable graphics with Intel IGPs?

1. Software block + price discrimination ;)
2. Yes they do, but is it seamless? Some implementations go from downright horrible (restart) to mildly tolerable (logoff/logon or "wait")
 
Anyway, so dual issue isn't actually an improvement over the pseudo dual issue of previous chips. In fact it's a step backwards in terms of being able to achieve peak throughput in some ways - sure previous chips didn't have true dual issue ...
I think it will be a step forward in the sense that it will work more often than the dual issue of G80/GT200. The "missing" in the term "missing MUL" came from the fact that it didn't work so well there, you know? ;)
And nvidia added those MULs for their peak performance figures. It should be easier for Fermi to approach the advertized throughput as it was for the previous generations.

From the general point of view design got cleaner and therefore a bit simpler (should save some transistors compared to a GT200 like design with 4 SMs in a TPC) while not sacrificing (much) performance. Good choice in my opinion. Only SFU limited code will probably suffer.
 
I think it will be a step forward in the sense that it will work more often than the dual issue of G80/GT200. The "missing" in the term "missing MUL" came from the fact that it didn't work so well there, you know? ;)
Yeah true though the missing mul was easily found in GT200 :).

And nvidia added those MULs for their peak performance figures. It should be easier for Fermi to approach the advertized throughput as it was for the previous generations.
Yes for normal ops only this shouldn't be an issue (it wasn't an issue in G80 neither if you disregard the missing mul).

From the general point of view design got cleaner and therefore a bit simpler (should save some transistors compared to a GT200 like design with 4 SMs in a TPC) while not sacrificing (much) performance. Good choice in my opinion. Only SFU limited code will probably suffer.
Surely you meant 3 SMs in a TPC for GT200. You're right it should be more effiicient certainly in terms of transistor count - a TPC in GT200 had 3 (pseudo dual issue) schedulers whereas there's now only 1 (true dual issue) - not to mention those 3 schedulers in GT200 only were for 24 cores not 32. So even if core utilization didn't go up at all or even slightly down in some cases it should still be a win presumably. We'd need to know die area of TPC/SM though to know.
 
Now I'm pretty sure some of you have clicked the link in my sig (not too many of you though) but here is a related newsblurb. http://www.slashgear.com/nvidia-optimus-teases-frugal-discrete-notebook-graphics-0567741/

Who is betting on lower idle power for GF100 over HD5870?

This Techreport article based on a nvidia blog posting on "Optimus", quoting the blog post direct:
As we approach CES we wanted to tell you about an upcoming mobile technology that we will be introducing in Q1. It is called NVIDIA Optimus technology. NVIDIA Optimus technology works on notebook platforms with NVIDIA GPUs. It is unique to NVIDIA. It is seamless and transparent to the user. Its purpose is to optimize the mobile experience by letting the user get the performance of discrete graphics from a notebook while still delivering great battery life. Look for more details next month.

Will be interesting if they can get switch working nicely with the onboard intel graphics, at the moment it is pretty clumsy(physical switch or reboot at least).

I also appears in the some driver docs on Nvidia's (Russian?) website:
http://ru.download.nvidia.com/Window...tart_Guide.pdf
On page 15 mentions "Nvidia Optimus Hybrid technology" under some option that looks like it downloads game profiles or general driver updates.

As an aside if nvidia keep this facebook/twitter/blog posting up as a cheap substitute for fully catered press junkets respectable hard working journalists are soon going to be starving in the street :cry:
 
This Techreport article based on a nvidia blog posting on "Optimus", quoting the blog post direct:


Will be interesting if they can get switch working nicely with the onboard intel graphics, at the moment it is pretty clumsy(physical switch or reboot at least).

I also appears in the some driver docs on Nvidia's (Russian?) website:
http://ru.download.nvidia.com/Window...tart_Guide.pdf
On page 15 mentions "Nvidia Optimus Hybrid technology" under some option that looks like it downloads game profiles or general driver updates.

As an aside if nvidia keep this facebook/twitter/blog posting up as a cheap substitute for fully catered press junkets respectable hard working journalists are soon going to be starving in the street :cry:


sounds like a bunch of hooha to strum up business for their failing chipset business if you ask me.. like a cross between switchable graphics (already in place) and their older hybrid graphics that was put out to pasture a while back.
 
Will be interesting if they can get switch working nicely with the onboard intel graphics, at the moment it is pretty clumsy(physical switch or reboot at least).
Why don't they just let the intel chipset handle all the display tasks but intercept 3D rendering calls to render frames on the discrete GPU? Then you can just blit the framebuffer into the integrated graphic's chip memory when doing 3D, and when not doing 3D turn off the discrete graphics chip without a care in the world.
 
Reminds me of the hacky drivers used back in the day that 3dfx Voodoo 1/2 cards to do windowed mode in Glide. They would create a DDRaw surface on the 2D Card and and render on the Voodoo card with display passthrough enabled. They'd lock the framebuffer on the Voodoo and the DDRaw surface and copy it over. It wasn't 'fast' but transferring that much data over the PCI bus with no DMA was obviously going to be slow.
 
Back
Top