NVIDIA Fermi: Architecture discussion

Dave answered this question, but there are probably multiple reasons for dual rasterizers without increasing the setup rate. One is by working on 2 triangles or tiles in parallel you speed up rendering of triangles that don't cover multiple tiles.
How? You still only get one triangle per clock fed to the rasterizers.

Previous designs rasterized one triangle at a time so all pixels had to come from one triangle to achieve full rate. As Dave said each now works on a different group of tiles.
Are you talking about pixels to the rasterizer or pixels to the shader engine? For the latter, you can definately have pixels from multiple triangles. For the former, why does it matter? You only have a maximum of one triange per clock going into the rasterizer.

The only way dual rasterizers help is to reduce bubbles a bit. If the shader engines are busy working on pixels from bigger triangles, the setup engine will fill the post setup polygon cache, and these may be small triangles. When there's room for more pixels in the shader engine, the rasterizer can empty that cache since it processes it twice as fast as the setup engine fills it. Then, when a big triangle comes, the setup engine can still keep plugging away until the cache fills again.

The result is that instead of always having #cycles = #triangles + #quads/8 for simple pixels, you now have a best case of #cycles = max{#triangles, #quads/8} if the big and small triangles are sufficiently interleaved. For uniform loads, though, a single rasterizer is just as fast as dual.

Interestingly, this also affects where you decide to put your cache. With a 1 tri/clk rasterizer, you have no need for a post-setup cache, and instead buffer pixels after the rasterizer.
 
One serves one half of the tiles the other serves the other half of the tiles.

Dave, could you elaborate on why this design decision was made? Did the design staff believe the existing rasterizer couldn't handle more than 16 RBEs or is there more at work here?

Are the rasterizers able to operate across the RBEs for their assigned tiles or does each rasterizer have a (static) assigned set of RBEs?
 
Thanks DemoCoder, your post made a lot of sense. :) I think I have slowly come to terms with reality and tone down my expectations on GF100. I understand they needed to concentrate on the growing threat of LRB from Intel but I still expect some radical tweaking/optimizing in the GF100 refresh at the minimum.

Their cores consume since G80 more die area because of the large frequency differences, probably in order to keep power consumption under control with a relatively low transistor density. They probably could have gone theoretically for say 10+10 clusters (ie 640SPs) but then they would have also needed 64 ROPs and a 512bit bus. In order to go that route though I'm not so sure their hw budget for 40nm would had also fit all the added computing functonalities and I honestly doubt that the whole added package is only worth a couple of hundred M transistors.
Why? They dont need to go all out on everything.

Take a look at Cypress it doubled about every aspect compared to RV770 and so far looks to have an outstanding price/performance ratio but not really up to the point some would have expected when seeing raw specs. This of course could change as drivers mature more and/or more demanding games kick in in the future. However under your reasoning I could have wished for more than 20 clusters here too (not really), but the immediate next question would had been where the additional needed bandwidth would have come from. And yes the obvious answer would be a wider bus, but then you end up with higher chip complexity more die area, higher costs and you don't have a performance part anymore either with would make their entire strategy (including Hemlock) completely redundant.
AMD made the 2.5x jump already, expecting another such jump wouldnt be as reasonable as expecting Nvidia to do something similar.

What NV could have done in my humble layman's opinion is to focus from the get go on developing/releasing a performance part first and the high end part later on. I don't think it would change their so far monolithic large single chip strategy but could speed up a bit time to market. One has to be blind to see that their execution lacks since GT200 and no G80 isn't something that would speak against it as an example as it had a high end chip in the form of R600 to counter.
I whole heartedly agree on this, in fact I was expecting GF100 to be something on these lines.

It depends on what you've planned more than on what your competition did. AFAIK they never planned to launch earlier than 4Q. Thus it will be "late" only if it misses that target. And last time I checked it's still 2009 here.
Well, sooner than you'd expect turned out to be late 4Q. I think you should concede this especially since Nvidia itself acknowledge its late. Unless you know more (your AFAIK != Nvidia) about their internal timelines, you cant say otherwise.

Was it GF100 though or just Fermi in general?
I think it was Fuad who said it would be GF100 based with disabled clusters.
 
With regards to it being a 'software' tesselator, what I mean by that is I think there's some hardware support, mainly in terms of memory management, but they didn't manifest the entirety of what they could do in silicon.

Regardless, the support will be fully DX11-compliant.

But inquiring minds want to know how fast it will be relative to other offerings.
 
Of course, there are limits to how dense they can go, but I wouldn't say they can't do better. Really, this is not much different in software development where whenever you introduce major architectural changes, you end up going for correctness first, and then later, you go back and optimize everything.

Their hand may have been forced. Looking at Larabee, and other product roadmaps, they probably felt they needed to get a much more general purpose GPU this generation else be caught with their pants down by Intel next year.

I agree. Larrabee is in their way, so that seems more likely their target. But they've got a long way to go before they have a GP PU.

I do hope they optimize the hardware. It *seems* like there's an awful lot of processing power idling in this core: ALU -or- FPU; FMA, but no MUL and ADD coissue. But, I think I looks forward more to the generalizing of the processor. 25us context switch times are nice, but shutting down a 16SM core before doing so is ... sad. That's only going to get worse as the chip gets larger. It's great you can run more than one kernel on a processor now, but they can't run in the same SM. That makes streaming kernels tough to deal with unless you run the stream over the L2. And speaking of caches, there's a cache coherence problem that needs dealing with.

There's lots of interesting problems in there, and that's without even considering interesting stuff like dynamic warp formation, or transferring the responsibility of ROPs and/or TUs to the main cores, as Jawed keeps pining for :)

Could be an exciting few years -- I hope it isn't like the ones that followed G80, as those were somewhat dull....

-Dave
 
Games utilizing PhysX should be far superior on GF100 vs GT200 derivatives, with much greater chance of having playable framerates at high resolutions.

This is especially true with Multi GPU. At least until Nvidia gets its multi GPU PhysX drivers in a decent shape.
 
All these co-issue schemes imply significantly higher register file read bandwidth and 2x register write bandwidth, which from what I understand is quite expensive. I would venture to guess that the cost of an integer ALU is small relative to those things and would be pretty surprised if the fp mul HW weren't reused in some way for integer multiplies. So I don't think I agree that there really is "an awful lot of processing power idling in this core", especially when you consider all the infrastructure necessary to feed the ALUs. With regards to implementing stream producer/consumer communication via L2 - why is that a bad thing? I would imagine interleaved execution on a single SM would often end up spilling from L1 to L2...
 
Why? They dont need to go all out on everything.

Well then you'd have to clarify what exactly you mean; in the case of having say >600SPs but still the same 128 TMUs / 48 ROPs they have today it would increase the final gaming performance theoretically by what percentage? Certainly not by the percentage peak arithmetic values would increase. I am thinking in clusters because it would bring a healthier performance increase and by the time you increase even just the texel fillrate bandwidth won't be enough etc.

You'd have to be a bit more specific don't you think?

AMD made the 2.5x jump already, expecting another such jump wouldnt be as reasonable as expecting Nvidia to do something similar.

Isn't Cypress rather 2* RV790 (raw bandwidth excluded)? I see 10 clusters @850MHz on RV790 and 20 clusters on Cypress @850MHz. Now on the other side I don't see a 2x times increase in terms of TMUs that's true and "only" a 50% increase in ROP count. But that's still 128 TMUs and 48 pixels/clock, which given that no one so far hinted or confirmed a dual rasterizer I still haven't understood what they need the 48 pixels/clock for.

In Cypress' case it makes perfect sense so far, especially if you consider that there's one rasterizer per 10 clusters and how tiles get handled as Dave just explained. There 32 pixels/clock or else 2*16 make sense to the layman here.

I still have far too many questionmarks considering GF100; so far what we have is mostly computing oriented stuff, but little to nothing what "3D transistors" or better efficiency concerns.

I whole heartedly agree on this, in fact I was expecting GF100 to be something on these lines.

Remember that Fudo article about architectures/roadmaps not being able to get changed on short notice? How's the boomerang effect for a change? ;)

Well, sooner than you'd expect turned out to be late 4Q. I think you should concede this especially since Nvidia itself acknowledge its late. Unless you know more (your AFAIK != Nvidia) about their internal timelines, you cant say otherwise.

Arty I'd say that beyond doubt NVIDIA would had wished for to be able to launch before win7 and preferably before AMD too. And before someone says yeah but it ended up only a couple of months later, the next best question is what kind of launch.

I think it was Fuad who said it would be GF100 based with disabled clusters.

Me not interested in dual-whatever merci bien ;)
 
If the L2 cache is divided per memory controller such that each section covers the memory space covered by its linked controller, it does open up questions about contention.
Even if no real sharing occurs, happening to write the same portion of the address space could lead to many writes to the same L2 section.
It might inject some additional latency, and the program will need to try to keep accesses spread out to prevent internal bandwidth from going down to 1/6 peak in certain cases.
Graphics has exactly the same problem. I don't think there's any explicit solution - merely the law of averages.

Visually, it's always struck me how different the caches appear in die shots between CPUs and GPUs.
CPU caches are very regular and grid-like. On-board SRAM on GPUs always looks so chaotic.
Is it a difference in how they take pictures, a difference in the amount of porting for those caches?
You only need to look at the Larrabee die picture to see that architecture is the key.

Jawed
 
dnavas: I don't know if they're separate like that, but one GT200 diagram at the Tesla Editor Day clearly indicated separate INT units and then an engineer told me outright that was only marketing when I asked. Maybe it's the same this time around, or maybe it isn't. Heck, maybe that's what Bob is implying here! (i.e. there are cases where the units can actually both be used at the same time)
Well it seems to come down to there really only being a single execution unit that's 16-wide, whose INT and FP paths are overloaded for double-precision computations.

So that knocks on the head any of my interpretations about the feasibility of there being no TMUs. Undecided about ROP blending though...

Jawed
 
DPFP perf/w is going to be an interesting one as well...
How will anyone use DP on HD5870? It appears to be a pure tick-box feature, with no OpenCL support for ~ a year. Writing Brook+ and IL, both of which are no longer supported?

Jawed
 
I found this very ironic and so decided to share this with you guys the next gen Radeon is already in the works slated for launch Q3 2010 which means late q1 to early q 2 tapeout and is called to RADEON 100 . And Fermi is GF 100. :D
 
They have value ... but per cacheline MOESI is an extreme. In cost, the amount of effort necessary to scale it and fragility of scaled up implementations.
Yeah, it'll be interesting to compare the more complex Larrabee cache scheme (whose cost is amortised over 8MB of cache).

And it'll be interesting to compare Larrabee's in-order instruction issue, as opposed to NVidia's out-of-order.

Nvidia also seems to have halved the number of cycles per instruction - GT200 MAD is 4 cycles (though I still wonder whether convoys make it effectively 8 cycles), but GF100 is 2 cycles. Larrabee is 1 cycle per instruction.

Jawed
 
Back
Top