NVIDIA Fermi: Architecture discussion

tesselation

Just a question about tesselation to clear things out.:oops:
As GF100 should be heavy tesselation oriented. The main advatage of teselation is that it can increase geometry complexity (and controling it) without taking place and bandwith from offchip memory. So the final result should be near same as u would put a real 1M poly model in the scene but with all its cons too.
The domain shader calculates the final vertex positions and its passing it to geometry shader.
And now here comes my question :rolleyes:
The whole thing behind geometry is solved now but doesnt this thing increase heavyli shading/lightning that depends on the new geometry ? Or things like multisampling on the new finer edges ?
Without deffered rendering too many lights on heavy tesselated scene and u are done not :LOL: ?
So the question is : could tesselation be limited more by raw shader troughput with advanced real time lightning,shadows (and other efects that need the new geometry) than setup rate :?: (i mean realy heavy tesselation like in heaven engine)
 
Though it is highly subjective, it would also mean that as resolutions increase dramatically (5760x1200) and pixels (pitch) remains the same, the need for higher levels of AA decreases.

Here's a quote from a metro 2033 exec:

games.on.net: Speaking of which, what frame rate are you targeting on the 360?

Olez: The 360 is always running at thirty. We try to run at 30 as well for PC, maybe a bit more because of v-synch.

I'm not talking about pixel pitch, I'm talking about the kind of performance developers will target with new engines/games. Each pixel is going to get more expensive. He didn't specify which resolution or hardware they're targeting for 30fps but it most certainly isn't going to be 5760x1200 on a single GPU.
 
A shame devs target lowish fps, instead of as high as possible. Over the years, gameplay has become more in sync with lower frame-rates.
 
Chasing DP in GPUs is not very efective. I would rather say that both nvidia/ati could make pure compute chips with full DP and without the GPU pipeline bloat.:rolleyes:
For the same transistor budget they could be monster DP chips. And the costs would come back very fast with profesional hardware margins.
Actually they are monster DP chips :D. At least for the right problems, but that is always an issue with GPGPU.

And I guess there's a bit of confusion created by the similarity of names between the Milkyway One Installation in China using downclocked HD4870X2s and the Milkyway @ home project Jawed referred to first. The application they use there is extremely suited to GPU computing and runs actually with a higher efficiency on GPUs than on CPUs (speedup is higher than the theoretical peak ratio) due to the latency hiding features of GPUs and the limited ILP in the code. Just look at the table on page 6 here for a performance comparison. To give you a real number, a HD5870 achieves more than 400 GFlop/s in DP there, quite close to peak. One would need roughly 100 CPU cores @ 3GHz to achieve the same performance.

And on a related note, a HD5870 is 6.4 times as fast as a GTX285 there. Will get interesting if Fermi is able to catch Cypress in that DP heavy scenario ;)
 
Last edited by a moderator:
Those guys certainly care but a market they do not make.
Apparently NVidia thinks so - and wants to charge something like 5x the $/DP-FLOP that they prolly paid for that supercomputer's GPUs. Which is where Aaron comes in...

Did NVidia hear that AMD was adding DP to RV670 and so tacked-on the DP for GT200? Is that why it was late and huge etc.?

The MilkyWay project alone isn't going to subsidize the added cost of including DP in mainstream chips where the vast majority of consumers won't perceive any added value. The "waste" was bad enough with GT200, it'll only get progressively worse with more price sensitive parts.
As has been discussed earlier in the thread, the DP overhead is likely a couple of percent in ATI - the entirety of the ALUs are likely to be in the range 25-40% of the die. Until we see a die picture for a DP-capable and non-DP capable GPU from the same family we're stuck.

Supposedly GT215 has DP (have to admit I'm not sure whether to believe that), but GT216/218 don't, but of course it's a cluster fuck.

Jawed
 
The whole thing behind geometry is solved now but doesnt this thing increase heavyli shading/lightning that depends on the new geometry ? Or things like multisampling on the new finer edges ?
It really depends on whether adaptive tessellation is being used to control the level of detail throughout the scene. Tessellation doesn't simply mean "more triangles" but "better triangles".

If the total number of triangles does increase then yes the slowdowns you've mentioned will occur. We still don't know what the overall effect is, for a given triangle budget.

Initially games are more likely simply to add triangles rather than remove them. I think that's what we're seeing in Dirt 2 and STALKER COP. So the early games may well turn out to be the worst offenders in terms of performance impact in comparison with overall visual quality.

Though it's worth noting that tessellation is normally switched on at the same time as other D3D11 options are switched on. So it's really hard to tell if the hit is just because of tessellation.

Also, we don't know if Evergreen is any good at tessellation. Just have to wait for GF100 to arrive...

Finally, GF100 may turn out to have such a significant boost in performance in geometry processing, even when tessellation is not being used, that is leaves HD5870 way behind - e.g. shadow buffer rendering could be dramatically faster with the combination of huge setup rate and monster caching of ROP data :D

Lots of unknowns :cry:

Without deffered rendering too many lights on heavy tesselated scene and u are done not :LOL: ?
STALKER COP is a deferred rendering engine (like the earlier games in that series) and appears to feature not very much tessellation. It doesn't seem to be much of a gain - and appears to be tacked-on.

So the question is : could tesselation be limited more by raw shader troughput with advanced real time lightning,shadows (and other efects that need the new geometry) than setup rate :?: (i mean realy heavy tesselation like in heaven engine)
Yes tessellation may not be the bottleneck. But exploration of these issues is pending...

Jawed
 
Unlike the texture cache the L1 bandwidth is almost certainly highly dependent on access patterns (ie. bank conflicts).

The texture cache could be subject to bank conflicts as well.

Fermi's L1 bandwidth is very nice - need to add in the TMU's L1, what's the bandwidth there?

A recently published Nvidia texture patent calls for 256 bytes per clock from a 32 bank L1 texture cache. Not sure if that's counting compression but it amounts to 3 TB/s.

Data is read from the banks. In the specific embodiment, eight bytes of data could be read each clock cycle from each of 32 banks, for a total of 256 bytes. Each bank is uniquely addressable during a read, that is, data from different cache lines can be read from each bank. Again, to support an eight-bilerp data rate, during each clock cycle, up to eight texels, two bilerps, can be read for each pixel in a pixel quad or four bilerps can be read for two pixels in a pixel quad.

http://patft.uspto.gov/netacgi/nph-...TXT&S1=nvidia.ASNM.&OS=an/nvidia&RS=AN/nvidia
 
A recently published Nvidia texture patent calls for 256 bytes per clock from a 32 bank L1 texture cache. Not sure if that's counting compression but it amounts to 3 TB/s.
Isn't texture data in L1 actually uncompressed (but compressed in L2)? I thought that was usually the case...

Jawed said:
Supposedly GT215 has DP (have to admit I'm not sure whether to believe that), but GT216/218 don't, but of course it's a cluster fuck.
Where did you see that? Seems like GT215 is CUDA 1.2, hence no DP. Or did you mean it's just not enabled?
Would be slower than any quad-core cpu anyway (well ok the gddr5 versions would have a memory bandwidth advantage so it would still be faster in some cases probably).
 
Isn't texture data in L1 actually uncompressed (but compressed in L2)? I thought that was usually the case...

Not in this particular patent. Here the decompression logic is between L1 and the filtering hardware. It's probably referring to G80 though where there was a 2:1 ratio on filtering to addressing. Fermi could very well have lower bandwidth per L1.
 
Yes tessellation may not be the bottleneck. But exploration of these issues is pending...

Given that Nv did revise the # of units in their former-TPC-now-SM quite a bit, I'd wager that they've done some simulations and perf-analysis with developers as to how much shading power needs to be "behind" the tessellator to no overly bottleneck the pipeline. But with 8k triangles from a single geometry patch, you could saturate a few shader cycles I think.
 
What's the rate at which Fermi generates new "vertices" in the TS stage? Is each tessellator (there are 16 in a "full spec" chip) producing 1 vertex per clock, or more than that?

Jawed
 
What's the rate at which Fermi generates new "vertices" in the TS stage? Is each tessellator (there are 16 in a "full spec" chip) producing 1 vertex per clock, or more than that?

Jawed

The only fixed function unit is the tesselator, HS and DS are using a single SM (32 cuda cores).
HS is making control point evaluation and tesselation level evaluation. DS evaluate surface given parametric UV coordinates, interpolate attributes , apply displacements and generates the final vertex from HS and tesselator inputs.
I dont know how much shader usage is needed for all this but i think reaching 1 vertex per clock from each tesselation engine for nvidia would be quite good and with higher number of control points (32 is max) it could be less than 1 per clock.
 
What's the rate at which Fermi generates new "vertices" in the TS stage? Is each tessellator (there are 16 in a "full spec" chip) producing 1 vertex per clock, or more than that?

Jawed

I don't know, but you'd be limited by the 4 tri/clk afterwards anyway, wouldn't you? I did not think about that in the first place, but I seem to remember that i read somewhere about it being 4 clocks pre tri in the PM-Engine (could be remembering wrong though, especially since i cannot find where i read it).
 
Not necessarily if you do the culling in the PM engine as well. You could simply generate for instance 8 triangles/clock, and cull 4, so you still get only 4 triangles per clock that have to be fed into the rasterizers. In case not all PMs are busy, it could be also important to create more 1 triangle/clock/per 3 PM or so -- as we have no idea what conditions have to be met that all PMs can really work in parallel, as well as the rasterizers. Maybe there is some intermediate binning stage or so, which limits how many PMs can run in parallel afterwards. After all, multiple rasterizers seem to hint at a LRB style tiled rendering approach, but in the Fermi case, NVIDIA seems to have implemented the sorting/binning completely in hardware (I wouldn't be surprised though if that part is implemented in CUDA, with special commands to dispatch work to HW units (like the PM)).
 
They will need to sync the tesselation somehow together too.
What if they just parallelising the stages with patches trough the 4 tesselators as the inputs are same for a single vertex that would be tesselated to become more vertices(in each GPC).
The DS can have only input(control points,Tess factors,U V {W} coordinates) for a single vertex and output one new vertex at a time. Everything before DS could be runing parallel just the final inputs to DS would be 4 different vertices at a time into 4 different DS if each of them would know which one to take.
But than again what would this help u if you'd be limited by the 4 tri/clk afterwards anyway.:eek:
 
I don't know, but you'd be limited by the 4 tri/clk afterwards anyway, wouldn't you?
I was thinking the limit would be 1 triangle per clock per GPC, because the Raster Engine has a throughput of 1 per clock (first stage is setup).

I did not think about that in the first place, but I seem to remember that i read somewhere about it being 4 clocks pre tri in the PM-Engine (could be remembering wrong though, especially since i cannot find where i read it).
Hmm, if it's really 4 clocks that's effectively 1 triangle per clock per GPC. Of course this makes it super-cheap to implement, useful when there's lots of them.

Still, I was pondering whether the rate might be faster, because stream out to memory is logically not setup limited (i.e. not 1-primitive per clock limited) and primitives don't necessarily go to setup when stream output is used. That's simply because setup comes after stream out.

One of the peculiarities of the whitepaper is it describes Stream Output as the final stage of the PolyMorph Engine, but logically in the D3D pipeline it comes before both Viewport Transform and Attribute Setup (both of which work on completed triangles output by GS).

It might be that Stream Output with no primitives being sent to the Raster Engine causes Viewport Transform and Attribute Setup to be skipped. It'd make sense. But, they might be bottlenecking Stream Output, causing it to be 1 primitive per clock limited, regardless.

Since GS can amplify primitives, too, it might have been seen as pointless to go faster than 1 vertex per clock in TS. Additionally, with or without TS, GS input might be deliberately limited to one primitive per clock, because it can amplify (and theoretically produce more than one triangle per clock when stream out is active but rasterisation is off).

Still, it seems to me like a potential missed opportunity, as dumping geometry into memory using a stream-out-only rendering pass is a useful technique which, not being setup limited, should go as fast as the wind. Though to be honest I don't know what proportion of frame rendering time it'd amount to - or how widely used this technique would be.

Jawed
 
But as DS is limited to single vertex from a input u would gain 4 times less time wasted with creating the
new geometry even while u use only 4/clock.
So u can save time on tesselation itself could be the meaning of 16 polymorph engines.:?:
 
I don't see DS as relevant here. It's merely a vertex shader that consumes naked vertices and paints them with attributes (position, normal, colour...). That's how DS was implemented all these years in ATI's pre-D3D11 tessellation pipeline.

My interpretation of there being 4 PolyMorph engines, rather than 1, per GPC is that this enables each SM to take vertices from VS all the way through TS, HS, DS, GS, SO and pre-setup triangle operations to produce a completed triangle. By localising the processing of primitives like this, keeping them private to a SM, NVidia has minimised the amount of communication outside of each SM. This is important because communication is expensive and slow.

Jawed
 
Back
Top