NVIDIA GF100 & Friends speculation

Architecturally, no - compute is required for D3D. In terms of execution, clearly NVidia has serious problems with execution.

NVidia's old architecture has a severe setup bottleneck, theoretically - though one that was never seen in games, in practice. NVidia seemingly had no choice but the kind of significant architectural change we see, in order to implement decent tessellation performance. Do we believe the engineers who say that the distributed setup system was a ball breaker?...

There's very little about compute in Fermi that's beyond D3D spec: virtual function support is required, so we're looking at a few percent spent on double-precision and ECC as the CUDA 3.0 tax.

NVidia even attempted to pre-empt 40nm by moving "early", originally planning to release 40nm chips in autumn 2008 before AMD. But NVidia's execution is generally so bad (see the string of GPUs before this) that that came to naught. Charlie's argument that the architecture (G80-derived, essentially) is unmanufacturable appears to hold some water, because G80 is the only large chip that appeared in a timely fashion.

GDDR5 might have been the straw that broke the camel's back: leaving implementation until the 40nm chips seems like a mistake, but the execution quagmire drags that whole thing down anyway.

Jawed


That doesn't explain the larger chips (and thier 2 highest end cards) from AMD being very hard to get. The hd5870 seems to have now some what better, but the HD5970 can't seem to find it anywhere on the major internet shops.
 
I find this

http://forum.beyond3d.com/showpost.php?p=1396322&postcount=5533

to be a very good explanation for that.


for the HD5970 yes, but still, there are limited quantities of the HD5870, every single HD5870 on Newegg are limited to 1 card (looks like one has 2 cards) per customer. And early last month it was very hard to get. The only thing that would seem likely is TSMC is having alot of problems. Unless AMD is artificially keeping the supply low, which I don't see why they would since they have free marketshare at this point.
 
for the HD5970 yes, but still, there are limited quantities of the HD5870, every single HD5870 on Newegg are limited to 1 card (looks like one has 2 cards) per customer. And early last month it was very hard to get. The only thing that would seem likely is TSMC is having alot of problems. Unless AMD is artificially keeping the supply low, which I don't see why they would since they have free marketshare at this point.

Going down the list from newegg:

1
10
unlimited
4
4
4
10
unlimited
unlimited
10
99
99

So yeah, every single one but one is limited to 4 or greater. Amazon has plenty in stock too, as does ZZF, Tiger, CompUSA, etc. Availability of 5870 is no longer an issue.
 
for the HD5970 yes, but still, there are limited quantities of the HD5870, every single HD5870 on Newegg are limited to 1 card (looks like one has 2 cards) per customer. And early last month it was very hard to get. The only thing that would seem likely is TSMC is having alot of problems. Unless AMD is artificially keeping the supply low, which I don't see why they would since they have free marketshare at this point.

What?

Sorted by "lowest price" criterion:

— VisionTek 900298 Radeon HD 5870 (limit 1 per customer)
— XFX HD-587A-ZNF9 Radeon HD 5870 (limit 10 per customer)
— ASUS EAH5870/2DIS/1GD5/A Radeon HD 5870 (limit 10 per customer)
— XFX HD-587X-ZNFC Radeon HD 5870 (apparently, no limit)
— several SKUs at 4 to 10 per customer
— SAPPHIRE Vapor-X 100281VX-2SR Radeon HD 5870 (limit 99 per customer)

http://www.newegg.com/Product/Produ...&bop=And&ActiveSearchResult=False&Order=PRICE

Edit: looks like Aaron was faster...
 
If nv could do 1 tri/clk for pre-fermi, then why is it any bigger bottleneck than for AMD which also does 1 tri/clk and does not claim any significant benefits by improving setup rates.
NVidia's core clocks are notably lower and there's even some question over whether NVidia actually does 1 triangle per clock :???:

I am curious about the kind of virtual function support required for d3d11. I was under the impression that the support needed was a little bit below the full generality needed by C++.
I don't understand the gory detail.

Could it be that designing reticle sized chips for a cutting edge process is what is causing extra problems, not just normal execution ones? Similar doubts for LRB have been expressed before.

http://forum.beyond3d.com/showpost.php?p=1395490&postcount=5517
Hence "quagmire".

Didn't they have working GDDR5 chips in form of the GT21x on 40 nm? So shouldn't that have relieved the Fermi design effort? Or are you implying that the delay in GT21x rippled over to Fermi?
This is why I specifically qualified my comment with "40nm chips". Fermi and GT215 (the only chip with GDDR5, though GT214 and GT212 presumably were intended to have GDDR5 - but note that all these chips seem to have been scheduled after GT218 and GT216) are NVidia's first two chips with GDDR5. Remember the G9x GPU that was referenced on an Nvidia engineer's LinkedIn profile as a candidate for GDDR5 testing?...

It seems that GDDR5 in GT215 has low clocks which might indicate implementation problems. But with only 8 ROPs there's little justification for significantly more and we now know that GDDR5 physical interfaces vary "substantially" in area depending on target clocks (don't know what that amounts to, though).

Jawed
 
NVidia's core clocks are notably lower and there's even some question over whether NVidia actually does 1 triangle per clock :???:
I think it's pretty clear that nVidia is likely to do much more than one triangle per clock. The only way they would end up doing less than that would be if they completely fucked up the architecture. And given that they are showing dramatic performance improvements in geometry-limited situations, it seems pretty clear that this isn't the case
 
There's very little about compute in Fermi that's beyond D3D spec: virtual function support is required, so we're looking at a few percent spent on double-precision and ECC as the CUDA 3.0 tax.
There's the full IEEE compliance, which would add some bulk where none would be otherwise.

Then there's the exception handling.
I think Nvidia's claim was precise exceptions, which is something that can bloat a pipeline significantly. After paying the price for the ability to catch exceptions and save contexts for the handling thereof, what virtual function support Fermi has would be mostly paid for.

The shader cores are closer to being fully featured compute engines, but looking at the area that is not part of a SIMD, it seems that it has hurt compute density.

On the other hand, if Charlie's estimates of Fermi's power consumption are accurate, Fermi's FLOP count would be maxed out anyway. On yet another hand, a dumber Fermi derivative would be smaller and perform close to the same or better for graphics.
 
There's the full IEEE compliance, which would add some bulk where none would be otherwise.
ATI has the same. In that case part of it appears to be a side-effect of having double-precision capability - though NVidia's DP capability is more involved (as it likely has a dedicated path for the sub-normal adder).

Then there's the exception handling.
Which is at least partially in ATI too. Though I haven't taken a close look at the flags and performance.

I think Nvidia's claim was precise exceptions, which is something that can bloat a pipeline significantly. After paying the price for the ability to catch exceptions and save contexts for the handling thereof, what virtual function support Fermi has would be mostly paid for.
It's good design and makes it hard to point fingers.

The shader cores are closer to being fully featured compute engines, but looking at the area that is not part of a SIMD, it seems that it has hurt compute density.
L1/L2 are significant changes in Fermi. The way these have been knitted into the entire operation of the GPU, not simply tacked-on, would count for some growth. I expect GF100 to benefit strongly from the cache architecture. At minimum compute should be happy and it may turn out to be a key part of achieving high tessellation performance (keeping as much tessellation data on-die as possible - I still don't have a good idea how much tessellation data ATI will be shoving off die). Though some tessellation data needs to go off die (e.g. for multi-pass tessellation), so there's a question of the efficiency of doing that, something the caches might be a significant factor in (see previous discussions on ATI stream out architecture and why GS on ATI is not architecturally limited to 1024 DWords per primitive).

On the other hand, if Charlie's estimates of Fermi's power consumption are accurate, Fermi's FLOP count would be maxed out anyway. On yet another hand, a dumber Fermi derivative would be smaller and perform close to the same or better for graphics.
NVidia's promising greater ALU efficiency. The wider SIMDs in Fermi also imply less die-overhead in ALU control hardware.

The power consumption, if really that bad, seems to be a 40nm process problem run-away. We've only got the comments on RV740 as a guide. It appears NVidia needs B1 to solve those problems, provided it doesn't hit reticle limits... 40nm maturity will help, too (newest RV770s are cooler and consume less power than initial RV770s).

Jawed
 
There's the full IEEE compliance, which would add some bulk where none would be otherwise.

Then there's the exception handling.
I think Nvidia's claim was precise exceptions, which is something that can bloat a pipeline significantly. After paying the price for the ability to catch exceptions and save contexts for the handling thereof, what virtual function support Fermi has would be mostly paid for.

If the exception handling is there, then wouldn't it be better to handle the IEEE compliance (I mean denormal stuff) in software like CPU's. ATM, I can't recall whether or not Fermi has full speed denormal handling or not.

At any rate, I can't see any point whatsoever in full speed denormal handling in hw. DP denormals are O(10^-317). Hell, even string theorists don't handle numbers that small. And with SP : DP ratio 2:1, there is not much point in using SP for precision sensitive calculations, being lazy and just using DP throughout will probably how 99% of the people will use it. And graphics people couldn't care less about SP denormals either.

Apart from PR, I am sorry, but I just don't see the point of handling denormals in hw when you can throw per-vector lane exceptions.
 
Last edited by a moderator:
Hell, even string theorists don't handle numbers that small.
Well, string theorists don't typically handle numbers at all, so...

Anyway, the nice thing about denormals is with comparisons to zero. For example, if I have some sort of mask I want to apply to an image, which multiplies parts of it by zero, and I want to go back later and ask which pixels were masked and which weren't, it's a lot easier to directly compare against zero. If you don't have denormals, you can't do that and expect it to work.
 
Well, string theorists don't typically handle numbers at all, so...

Really?

Anyway, the nice thing about denormals is with comparisons to zero. For example, if I have some sort of mask I want to apply to an image, which multiplies parts of it by zero, and I want to go back later and ask which pixels were masked and which weren't, it's a lot easier to directly compare against zero. If you don't have denormals, you can't do that and expect it to work.

Good example, bad numerics. Do you or anybody you know has images with dynamic range from 1 to 10^-38 in the first place and a LOT of denormal pixels in it? If not, then why would you need SP denormal handling? And why would you need it in hw?

Or is it the case that you or anybody you know has imgaes with dynamic range from 1 to 10^-317? :runaway:
 
ATI has the same. In that case part of it appears to be a side-effect of having double-precision capability - though NVidia's DP capability is more involved (as it likely has a dedicated path for the sub-normal adder).
It's got most of what Fermi has. There are certain exceptions that will not be handled in the same manner. Exception handling of certain math conditions would segue into the implementation of the more general handling.

Which is at least partially in ATI too. Though I haven't taken a close look at the flags and performance.
At least part of the Cypress scheme is the generation of exception tokens, which can be used to reconstruct exceptions later. I think most of this can be done in a precise manner, but there were certain exceptions in FP that AMD's slides seemed to single out as not having the full treatment.

An exception is precise so long as the state of the system at the time of the exception can be determined, it's not necessary that it happen right at the time of the exception and it meets the definition if the exception is flagged in such a way that software can reconstruct the state at the time of the exception.
I think Cypress made a worthwhile trade-off there. It seems Nvidia's scheme is more expensive in terms of hardware, but also more performant and complete.

It's good design and makes it hard to point fingers.
Exception handling that is precise and handled by hardware would require the capability to save and restore sizeable contexts at speed could be invoked at many points in a potentially very long pipeline.
I would think it a safe bet this did add bloat, and with present graphics loads it would be mostly irrelevant.

L1/L2 are significant changes in Fermi.
The generalized load/store units would be significantly larger than the fixed stride length of typical texture caches.

The way these have been knitted into the entire operation of the GPU, not simply tacked-on, would count for some growth.
This is a potential source of bloat. There are a lot of clients to be served by whatever ties them together.

NVidia's promising greater ALU efficiency. The wider SIMDs in Fermi also imply less die-overhead in ALU control hardware.
In the die shot, the area in the cores devoted to the SIMDs is pretty easy to distinguish. The significant area not part of a SIMD is also very easy to distinguish.
 
Most any purely theoretical work hardly ever deals with actual numbers. Most of it is involved in examining the self-consistency (or lack thereof) of a particular theory.

Good example, bad numerics. Do you or anybody you know has images with dynamic range from 1 to 10^-38 in the first place and a LOT of denormal pixels in it? If not, then why would you need SP denormal handling? And why would you need it in hw?

Or is it the case that you or anybody you know has imgaes with dynamic range from 1 to 10^-317? :runaway:
It's a matter of convenience, mostly. Basically, if you don't do it in hardware, you can't do a simple comparison to check for zeroes. And if you don't do a simple comparison to check for zeroes, then you need to examine your system for a reasonable minimum cutoff, a cutoff that may potentially change with different data sources, and would be difficult to automate. In other words, it's quite a bit of extra work just to avoid doing "if (x == 0)"
 
Exception handling that is precise and handled by hardware would require the capability to save and restore sizeable contexts at speed could be invoked at many points in a potentially very long pipeline.
I would think it a safe bet this did add bloat, and with present graphics loads it would be mostly irrelevant.
The concept of a context in a GPU currently isn't like a CPU context though. If the exception is handled by call, then the handler sub-routine is part of the context in GPU terms. NVidia's scalar ALUs would appear to limit the amount of context that's in flux at the moment an exception occurs, i.e. a flush for a 12-stage pipeline is about the limit of the time overhead.

Predication around this handler call means that the context silently awaits the return, at which point context re-awakening is trivial.

The generalized load/store units would be significantly larger than the fixed stride length of typical texture caches.
Typical, traditional, texture caches follow almost none of the rules of CPU caches so anything that's more CPU-like in a GPU is going to look "bloaty" :p

This is a potential source of bloat. There are a lot of clients to be served by whatever ties them together.
I can't tell how much of the GPC-originated L2 traffic is routed through L1 (i.e. homogenised as part of the cache hierarchy's operation). Apart from that traffic, there is ROP traffic and TEX-L1 read-only traffic.

The mixed ROP/TEX capability in GF100's L2 is relatively expensive, agreed, in comparison with GT200's distinct pools.

Some kind of improved cache capability is a requirement for performant UAVs or append/consume, though.

In the die shot, the area in the cores devoted to the SIMDs is pretty easy to distinguish. The significant area not part of a SIMD is also very easy to distinguish.
I'd be interested in your specific interpretation of this, because unlike GT200's ALUs I can't see anything in GF100 that looks like ALUs.

Jawed
 
If the exception handling is there, then wouldn't it be better to handle the IEEE compliance (I mean denormal stuff) in software like CPU's. ATM, I can't recall whether or not Fermi has full speed denormal handling or not.
According to Nvidia statements, they do.
This must come at some cost in hardware, but it is one of the bullet points that Nvidia has used to compare itself to x86 IEEE compliance, where they claim orders of magnitude better cycle counts than a software trap.
 
According to Nvidia statements, they do.
This must come at some cost in hardware, but it is one of the bullet points that Nvidia has used to compare itself to x86 IEEE compliance, where they claim orders of magnitude better cycle counts than a software trap.

I know GT200 has hw denormal support, I was just wondering about Fermi.

On second thoughts, removing full speed denormals would be a regressive step, even though it was pointless in the first place, so probably PR people will shoot it down.

Having said that, there is a reason why it is not there in hw in cpu's. It's an unnecessary piece of silicon.
 
Nvidia needs to release the 1000+ dolar quadro and tesla cards at once with the GF100 to at least make some money from it.
It will be a first time to release quadro and tesla cards with gaming cards from same architecture at once. That doesnt even sound good.:rolleyes:
 
Most any purely theoretical work hardly ever deals with actual numbers. Most of it is involved in examining the self-consistency (or lack thereof) of a particular theory.

Fair enough. My intention was to say that even string theorists don't deal with numbers of this magnitude, even if the they don't do much numerical work on a day to day basis. I should have been clearer first up.

It's a matter of convenience, mostly. Basically, if you don't do it in hardware, you can't do a simple comparison to check for zeroes. And if you don't do a simple comparison to check for zeroes, then you need to examine your system for a reasonable minimum cutoff, a cutoff that may potentially change with different data sources, and would be difficult to automate. In other words, it's quite a bit of extra work just to avoid doing "if (x == 0)"
If you multiply a spfp number with zero, and then want to compare it with 0, why would you need denormal support? Multiplication with 0 is an exact operation even on ALU's which flush input denormals to zero, ain't it?

AFAIK, floating point comparisons are also implemented using subtracting the two operands. And why would you need denormals when none of your two operands is denormal, and infact one is zero?
 
The concept of a context in a GPU currently isn't like a CPU context though. If the exception is handled by call, then the handler sub-routine is part of the context in GPU terms.
Registers are allocated at invocation for all the potential operands used by whatever exception handler is coded?

NVidia's scalar ALUs would appear to limit the amount of context that's in flux at the moment an exception occurs, i.e. a flush for a 12-stage pipeline is about the limit of the time overhead.
Is this 12 cycles from the point of view of the affected thread, or from the hardware? With the warp schedulers, we have multiple warps in progress, and we would need to track them separately.
Worst-case, the hardware would have to track exceptions (possibly different ones?) from every lane in every warp that is currently in progress at the point of the first exception, wherever that first appears in the pipeline.

Nvidia claims to have changed the internal ISA to a load/store one, whereas the earlier variants had memory operands that would have been nightmarish to track as part of an ALU instruction.

I'd be interested in your specific interpretation of this, because unlike GT200's ALUs I can't see anything in GF100 that looks like ALUs.
My interpretation is that within each core, there are rectangular bands of straight silicon on the upper and lower edges, with one end marked by regions that look like the SRAMs for the register file.
Sandwiched between them would be stuff I attribute to special function and scheduling.
 
Back
Top