NVIDIA Fermi: Architecture discussion

It can only cull 4 per clock.
Which article is that from? I was going by what I saw here:
http://www.anandtech.com/video/showdoc.aspx?i=3721&p=2

Attribute setup is in the polymorph engine. It makes zero sense to do backface culling after that. At the very least you can do the majority of backface culling before edge setup and rasterization with a cross product. The geometry shader and tesselator needs to fetch vertices anyway, and that's definately in the Polymorph engine.
 
I don't think that's right. It rasterizes 1 per clock, and according to the B3D article the two rasterizers in the diagrams was just an error.

Having more rasterizers than setup units does nothing for you, because you can't feed the rasterizers any quicker than 1 per clock. Buffering gobs of triangles between setup and rasterization is going to be very costly and not very effective.

GF100 has 16 Polymorph engines and 4 rasterizers because it can then cull triangles at up to 16 per clock. This is a very useful ability. This is also where unified shader architechtures made the biggest impact, because long vertex shaders creating off-screen or backfacing triangles would be processed at one per clock instead of one every 5 clocks or whatever, depending on the vertex shader. Fermi will now blast through them an order of magnitude faster.

R100/GF3 did a similar thing with pixels. Blast through the invisible ones as fast as possible, because they're going to come in clumps too large to buffer and they'll hold back the rest of the pipeline.
As Dave said there really are 2 rasterizers in Cypress so I'm not sure what's with Rys' wink. I thought maybe there was a ninja edit of the diagram, but it doesn't seem it's that. I didn't go back and look at the text of the article.

I agree it doesn't make sense to buffer gobs of triangles, but it does make sense to buffer some and it can smooth out the workload for small groups.

To turn the discussion back to GF100... even with the ability to setup multiple primitives per clock Nvidia will need a large buffer between setup and the rasterizer (read/write cache probably) in order to maintain order and parallelizm.
 
As Dave said there really are 2 rasterizers in Cypress so I'm not sure what's with Rys' wink.
That's just semantics, and he never said that it can rasterize two triangles per clock.

To turn the discussion back to GF100... even with the ability to setup multiple primitives per clock Nvidia will need a large buffer between setup and the rasterizer (read/write cache probably) in order to maintain order and parallelizm.
Depends on your definition of large. You're going to get diminishing returns very quickly.

It's from Nvidia's whitepaper.
Nvidia have said that they can get up to 8x the triangle throughput of GT200 with GF100, and that jives perfectly with 4x raster rate and much faster cull rate.

I don't believe that all backface culling is done there because it's just silly. I'll accept that you can't BF cull every last triangle in view space due to precision issues, but you can do it for most.
 
Depends on your definition of large. You're going to get diminishing returns very quickly.
I figure the smallest split of the index buffer is on a patch basis and a single patch can generate hundreds, maybe 1000+ prims. I don't expect games to use the highest tessellation level anytime soon, but it would generate a lot of prims. So you need to buffer at least that much. If groups of patches are sent to each SM then you need even more buffering. If for some reason further amplification is done in the GS it increases the number of prims to be buffered.
 
I figure the smallest split of the index buffer is on a patch basis and a single patch can generate hundreds, maybe 1000+ prims. I don't expect games to use the highest tessellation level anytime soon, but it would generate a lot of prims. So you need to buffer at least that much.
Why do you need to buffer at least that much? You start generating primitives, and when the buffer gets full, you wait, even if it's in the middle of a patch. When you finish working on one set of indices, you work on the next patch. Unless you have alternating patches of big triangles and small triangles, buffering a whole patch's triangles buys you very little.

Yeah, after I wrote the line you just quoted. I'll go reply in that thread.
 
Why do you need to buffer at least that much? You start generating primitives, and when the buffer gets full, you wait, even if it's in the middle of a patch. When you finish working on one set of indices, you work on the next patch. Unless you have alternating patches of big triangles and small triangles, buffering a whole patch's triangles buys you very little.
My example was off base. How much buffering you need is determined by the latency of the DS which allows multiple clusters to work in parallel.
 
It's interesting to see nvidia change from having a large amount of texture throughput relative to pixel processing to having less so they can go to a more geometry focused architecture. I guess something had to give....

Lets hope tessellation takes off this time like it failed to do when the first gpu's got some chip real estate assigned to it.

( that's my rather none technical view anyway - I'm more a driver than someone with their head under the bonnet like you guys.)
 
At Tech Report, Scott Wasson mentioned two big hints about GF100's clock speeds:

1) Theoretical texture filtering rate on GF100 will be lower than GT200b (even though real world texture filtering performance for GF100 will often be superior to GT200b)

2) Running texturing hardware at half [hot clock] frequency will result in 12-14% boost compared to running at core clock frequency.

GT200b has a core clock frequency of 648 MHz, and a hot clock frequency of 1476 MHz.

Well, look at this, if GF100 has the exact same core clock frequency and hot clock frequency as GT200b, then the two conditions above are so nicely met!

So here are my predictions on first iteration GF100 clock frequencies:

Core Clock: 648 MHz
Hot Clock [Half Rate]: 738 MHz
Hot Clock: 1476 MHz
 
At Tech Report, Scott Wasson mentioned two big hints about GF100's clock speeds:

1) Theoretical texture filtering rate on GF100 will be lower than GT200b (even though real world texture filtering performance for GF100 will often be superior to GT200b)

2) Running texturing hardware at half [hot clock] frequency will result in 12-14% boost compared to running at core clock frequency.

GT200b has a core clock frequency of 648 MHz, and a hot clock frequency of 1476 MHz.

Well, look at this, if GF100 has the exact same core clock frequency and hot clock frequency as GT200b, then the two conditions above are so nicely met!

So here are my predictions on first iteration GF100 clock frequencies:

Core Clock: 648 MHz
Hot Clock [Half Rate]: 738 MHz
Hot Clock: 1476 MHz

Albeit those frequencies sound quite reasonable to me and aren't above give or take what I'd expect, trying to calculate frequencies from one of NV's real time texturing increases graphs sounds a bit weird at best. Usually when you get a 40-70% benefit in terms of texturing rates the theoretical peak should be a lot higher; and yes it would then break my own expectations again and I'm at the same dead end as before.

It would be way too easy for someone to figure something out concerning frequencies if all those hints would align and they could have then given frequencies away as well. All the 2nd tells me is that the core clock (ROPs/L2) is likely 12-14% lower than half the hot clock.

------------------------------------

As for triangle rates:

GF100 can thus claim a peak theoretical throughput rate of four polygons per cycle, although Alben called that "the impossible-to-achieve rate," since other factors will limit throughput in practice. Nvidia tells us that in directed tests, GF100 has averaged as many as 3.2 triangles per clock, which is still quite formidable.
http://www.techreport.com/articles.x/18332/2

Those who know what "directed tests" exactly stand for will be able to make an even more realistic estimate for possible real time gaming conditions which should be way below 3.2 Tris/clock but I'd personally have a hard time digesting that something give or take 2 Tris/clock with 4 rasters would be impossible.
 
Albeit those frequencies sound quite reasonable to me and aren't above give or take what I'd expect, trying to calculate frequencies from one of NV's real time texturing increases graphs sounds a bit weird at best.

That's not what I was doing. Those two conditions above reportedly came directly from the NVIDIA engineers, according to TR (see TR GF100 graphics preview article for details).
 
That's not what I was doing. Those two conditions above reportedly came directly from the NVIDIA engineers, according to TR (see TR GF100 graphics preview article for details).

Techreport's and your assumptions may well be damn close or exactly on spot. However: Running texturing hardware at half [hot clock] frequency will result in 12-14% boost compared to running at core clock frequency.; why can I not read it as there being a 12-14% difference on GF100 between its core clock and 1/2 hot clock? There it could be as well 600/672 as even 725/825.

As I said it sounds like a too easy riddle to solve considering that NV wasn't willing to reveal zilk about frequencies yet. And in such a case it usually means that frequencies at that time hadn't been finalized yet with a possible +/- not exceeding the ~50MHz IMHO.
 
Techreport's and your assumptions may well be damn close or exactly on spot. However: Running texturing hardware at half [hot clock] frequency will result in 12-14% boost compared to running at core clock frequency.; why can I not read it as there being a 12-14% difference on GF100 between its core clock and 1/2 hot clock? There it could be as well 600/672 as even 725/825.

As I said it sounds like a too easy riddle to solve considering that NV wasn't willing to reveal zilk about frequencies yet. And in such a case it usually means that frequencies at that time hadn't been finalized yet with a possible +/- not exceeding the ~50MHz IMHO.

Point taken, but considering that power consumption has to be kept under control at stock clock speeds for this beast, and considering that GF100 will support GPU overvoltaging for "extreme" overclocking, I would be shocked if clock speeds were significantly higher than GT200b. Also, if it is indeed true that GF100 theoretical texture filtering rate is lower than GT200b, then that would set a limit to the hot clock frequency of no more than ~ 1620 MHz at stock settings (which would put the half rate hot clock frequency limit at ~ 810 MHz).

One thing I feel pretty confident about is that GF100 will have a core clock frequency / half rate hot clock frequency ratio of 648/738 . Why do I say this? Because this would mean that either half rate hot clock is 14% higher than core clock, or core clock is 12% lower than half rate hot clock, depending on how one calculates the difference.
 
GTX285 has a 205W TDP and if rumors I've heard aren't wrong the GF100 should end up at 280W TDP (1 6pin + 1 8pin probably which gives the theoretical headroom for overclocking).
 
I needed a laugh this morning...

... and boy does this deliver:

http://www.techreport.com/articles.x/18332/5

I should pause to explain the asterisk next to the unexpectedly low estimate for the GF100's double-precision performance. By all rights, in this architecture, double-precision math should happen at half the speed of single-precision, clean and simple. However, Nvidia has made the decision to limit DP performance in the GeForce versions of the GF100 to 64 FMA ops per clock—one fourth of what the chip can do. This is presumably a product positioning decision intended to encourage serious compute customers to purchase a Tesla version of the GPU instead. Double-precision support doesn't appear to be of any use for real-time graphics, and I doubt many serious GPU-computing customers will want the peak DP rates without the ECC memory that the Tesla cards will provide. But a few poor hackers in Eastern Europe are going to be seriously bummed, and this does mean the Radeon HD 5870 will be substantially faster than any GeForce card at double-precision math, at least in terms of peak rates.
Jawed
 
After ATI's decision to cut out DP on the lower end parts it was dead in the water for games anyway ... so meh. All this does is make it easier for ATI to compete in HPC, which is boring anyway.
 
After ATI's decision to cut out DP on the lower end parts it was dead in the water for games anyway

Was it ever alive for games? Except for that lone dude that wanted to setup/rasterize his big triangles himself on the ALUs or something? I can't remember anyone caring about DP in the game world (which can simply mean that I wasn't looking where I should've been looking), but I do remember a number of devs being quite explicit about it NOT being needed there:-?
 
I think it just indicates NVidia thinks that neither ECC nor increased memory capacity is enough to justify buying a Tesla. Those seemed like decent differentiating factors - well to those that think ECC will make a difference, I'm still waiting for any data that shows GPUs without ECC suffer from memory errors (once the memory has passed being soak-tested for hardware problems).

Or, maybe GF100s will catch fire at consumer clocks if DP is ran at full rate?

Jawed
 
Back
Top