NVIDIA Fermi: Architecture discussion

Which AMD chips are on 40 nm and which chips are supply constrained? Which AMD chips are being sold right now and process are they made on, to my knowledge all AMD chips are made on 40nm now and have been for a while since sometime last year. And flooding the market with nV 40nm chips, not the case. Why would one AIB be supply constrained and loose market share while the other one can flood the market with theirs to cause a market share lose? Only possibility is AMD isn't purchasing production wafers which 2 q's they have the same problem? Doesn't make sense.

and %'s are off.

Only 40nm chips from AMD are DX11 and RV740, RV740 probably isn't made anymore so only Cypress and Juniper on AMD come from 40nm lines (and ofc Redwood/Cedar but they're not on Q4 numbers yet)

The %'s are counted for the ~1.6-1.8 million DX11 chips from the 7.5 million total shipped
 
Could you explain this?
Rasterisation is fundamentally bandwidth limited because it's pixel shading (with other bottlenecks such as texture rate at times). Or if the pixel shader has no texturing and is very short then the ROPs are likely to be running at full speed. The count of ROPs is usually limited by design to what's meaningful for the available bandwidth (not if it's R600) which is why ROPs scale with MCs, generally (though ATI's at it again with the peculiarities of Redwood).

Jawed
 
I could rather question if GF100 future architecture will be used in the near future or u will need another generation till the rest of the chip will catch up with the rasterizers (bandwith,number of cores) :?:
I'd say the question mark is on ALU capability. But as I've queried before, we don't know if DS is more typically TEX limited than math limited.

I can't remember what a typical vertex shader load is for current games (as a percentage of total ALU workload per frame). Is it about 10%? If it's that high then a 4x increase in triangles per frame would kill just about any GPU configuration, wouldn't it?

Jawed
 
Cypress has twice the alus of juniper but only one setup-engine and one tessellation unit. That's result in a higher decrease in performance in the unigine benchmark with tessellation. Without tessellation the 5870 is 81% faster, with only 58% faster than juniper.
http://www.hardocp.com/article/2009/11/06/unigine_heaven_benchmark_dx11_tessellation/
HD5870 with tessellation is faster than HD5770 without tessellation, with tessellation and setup running at the same rate in both GPUs. If HD5870 is tessellation/setup limited in this test, what's going on there?

Is there a more up-to-date comparison with current drivers?

Jawed
 
I'd say the question mark is on ALU capability. But as I've queried before, we don't know if DS is more typically TEX limited than math limited.

I can't remember what a typical vertex shader load is for current games (as a percentage of total ALU workload per frame). Is it about 10%? If it's that high then a 4x increase in triangles per frame would kill just about any GPU configuration, wouldn't it?

Jawed
It's not 10% anymore. It was probably somewhere in that neighborhood before unified GPUs came around (with some games a lot more than others), but now they just blast through vertex shaders. A cache-friendly mesh approaches two triangles per vertex transformation, so if you look at Cypress, you'd need 3200 MADs per vertex before the VS becomes a factor. Remember that portions of the scene that are geometry limited have very little pixel load at the same time.

Now, its true that Fermi will bring this down to 512 MADs per vertex due to higher setup speed and lower ALU count, but that's still plenty of ops for all but the craziest vertex/geometry/hull shaders.
 
I'm not talking about only Dx11 chips. I'm talking about ALL 40nm chips.

But the DX11 chips are the only ATI 40nm chips assuming they've stopped making RV740s, if they still make them, then it's DX11 + RV740.
All the other 7xx's are 55nm
 
JPR's latest reports, stated that because of AMD's supply constraint due to TSMC issues, they lost desktop discrete marketshare to nV, which is quite interesting, how bad were the yields of AMD's 40nm chips, and imagine how a chip that is ~35% larger would be affected.
Except bear in mind that nVidia is using this process months later, which likely means higher yields. Since they are using the same fab, it's going to have (more or less) the same level of maturity for both of them by the time the GF100 starts shipping. So while ATI's yields are likely to improve as well, nVidia is just less likely to have the same problem, despite the larger chip size.
 
I'd say the question mark is on ALU capability. But as I've queried before, we don't know if DS is more typically TEX limited than math limited.

I can't remember what a typical vertex shader load is for current games (as a percentage of total ALU workload per frame). Is it about 10%? If it's that high then a 4x increase in triangles per frame would kill just about any GPU configuration, wouldn't it?

Jawed

Ratio of vertex shader load should decrease with increasing resolution. That was the main reason why shading power had increased more than 100 times while geometry processing rate only few times. Resolution increased pixels by square while geometry was same.
Hardocp could have tested heaven benchmark from 640*480 to max resolution with tesselation. The numbers could be interesting.

Anyway i think that tesselation should be used for increasing complex details from close and not wasted on roof tiles, facing stones that could be modeled in 3d with much less triangels. Also using tesselation on such basic things like brick walls is quite overkill. What if u have a realy long brick wall and u watch it from a smal angle. U will get insane amount of triangles in a single pixel.:rolleyes:
 
What if u have a realy long brick wall and u watch it from a smal angle. U will get insane amount of triangles in a single pixel.:rolleyes:
It'd be nice if you could use whole words, but the fact is that with tessellation, you can dynamically change the density of polygons, such that you never get too many triangles per pixel. I'm not sure what the current targets are for GF100-level hardware, but it is certainly reasonable to have it target, for example, one triangle per pixel.
 
Anyway i think that tesselation should be used for increasing complex details from close and not wasted on roof tiles, facing stones that could be modeled in 3d with much less triangels. Also using tesselation on such basic things like brick walls is quite overkill. What if u have a realy long brick wall and u watch it from a smal angle. U will get insane amount of triangles in a single pixel.:rolleyes:


You're not thinking about the goal of tessellation which is to reduce bumps and curves ideally to imperceptible levels of tessellation but no more. This is not necessary "brick walls need less tessellation" but more like a way of imagining each surface broken into roughly pixel-sized polygons. A sampling of about two polygons per pixel is completely feasible and produces no aliasing. In your brick wall example you'd never want the nearby bricks undertesslated and the distant bricks overtessilated. It must be adaptive in screen space.

Such tessellation is effectively how most 3D movie graphics are rendered (when using the REYES algorithm) and recent research (Renderants for example) has shown that real time tessellation to similar pixel-sized levels is not just possible but inevitable.
 
HD5870 with tessellation is faster than HD5770 without tessellation, with tessellation and setup running at the same rate in both GPUs. If HD5870 is tessellation/setup limited in this test, what's going on there?
Just because this test has setup limitations doesn't mean that it's purely setup limited. If we assume that the 5870 is twice as fast as the 5770 at pixel loads but equal at geometry, then those scores suggest the following:

No tesselation:
5870 - 11.3ms pixel, 2.6 ms geometry
5770 - 22.7ms pixel, 2.6 ms geometry

With tesselation:
5870 - 14.2ms pixel, 10.2 ms geometry
5770 - 28.4ms pixel, 10.2 ms geometry

I would say that spending 42% of your time on geometry is pretty significant. Look at it another way: 41 fps isn't exactly framerate nirvana, and only if you lowered resolution from 1680x1050 down to 1024x768 could you crack 60 fps.

I'm not sure what the current targets are for GF100-level hardware, but it is certainly reasonable to have it target, for example, one triangle per pixel.
GF100 has fast setup speed, but let's not get carried away. One pixel triangles would mean pixel shading power is cut by more than four due to each triangle occupying four threads (a quad) in the shaders.
 
JPR's latest reports, stated that because of AMD's supply constraint due to TSMC issues, they lost desktop discrete marketshare to nV, which is quite interesting, how bad were the yields of AMD's 40nm chips, and imagine how a chip that is ~35% larger would be affected.

I was merely stating the obvious; by the time you play with the thought that NV didn't use 512SPs in the CES PCs the upside is that there's possibly a tad more performance than shown and the downside being that there's a possibility that you'll need a shotgun to find 512SP variants. And I'm not going in the direction that they did it on purpose because it sounds to silly to me at the moment.
 
You're not thinking about the goal of tessellation which is to reduce bumps and curves ideally to imperceptible levels of tessellation but no more. This is not necessary "brick walls need less tessellation" but more like a way of imagining each surface broken into roughly pixel-sized polygons. A sampling of about two polygons per pixel is completely feasible and produces no aliasing. In your brick wall example you'd never want the nearby bricks undertesslated and the distant bricks overtessilated. It must be adaptive in screen space.

Such tessellation is effectively how most 3D movie graphics are rendered (when using the REYES algorithm) and recent research (Renderants for example) has shown that real time tessellation to similar pixel-sized levels is not just possible but inevitable.

We will need to wait what can game developers make from it. Tech demos are narrow minded and quite primitive to real games where plenty of things need to be balanced out (and can go wrong).
It will take lot of time , experimenting and experience to use tesselation in games so u wont have tesselation where u dont need it and keep fps balanced.(even on slower cards)
 
But the DX11 chips are the only ATI 40nm chips assuming they've stopped making RV740s, if they still make them, then it's DX11 + RV740.
All the other 7xx's are 55nm


what 40 nm chips were nV making? last q they were still making the rv740, think it was December time it went to EOL. AMD hasn't been making any 55nm graphics chip since mid 2 q's ago.
 
It's not 10% anymore.
Apparently it's about 23% in Unigine with tessellation off on HD5870 according to your follow up post. But other bottlenecks are in play, i.e. HD5870 isn't ALU limited for geometry here - it could be vertex bandwidth limited or fillrate (Z rate) limited.

In this post:

http://forum.beyond3d.com/showpost.php?p=1383133&postcount=1004

AvP without tessellation is 96fps, but in wireframe is ~500fps and that's without the geometry workload of shadows. 500fps could be setup limited rather than VS limited, or it could be vertex fetch limited.

Also, I don't have a decent idea of how expensive wireframe rendering, itself, is.

It was probably somewhere in that neighborhood before unified GPUs came around (with some games a lot more than others), but now they just blast through vertex shaders. A cache-friendly mesh approaches two triangles per vertex transformation, so if you look at Cypress, you'd need 3200 MADs per vertex before the VS becomes a factor.
Does that take account of multi-pass geometry, shadow buffer rendering passes, overdraw, transparency passes and shrinking triangles?

Remember that portions of the scene that are geometry limited have very little pixel load at the same time.
I don't understand what you mean by portions of the scene.

Now, its true that Fermi will bring this down to 512 MADs per vertex due to higher setup speed and lower ALU count, but that's still plenty of ops for all but the craziest vertex/geometry/hull shaders.
I'm assuming this is the scenario you're painting: that VS is setup limited, that's 4 triangles per SM clock or 2 vertices per SM clock. There are 1024 ALU instructions per SM clock (hot clock is 2x SM clock), so 512 ALU instructions per vertex per clock is the limit for 100% VS usage of the GPU. I wasn't thinking of 100% usage, I was contemplating VS becoming the dominant shading workload (>50%) after tessellation (4x multiplier of triangles), which is 64 instructions per vertex.

Jawed
 
Just because this test has setup limitations doesn't mean that it's purely setup limited.
That was my point. There's a suggestion that tessellation/setup rate limit is what's occurring here, but at worst this is not a continuous bottleneck.

If we assume that the 5870 is twice as fast as the 5770 at pixel loads but equal at geometry, then those scores suggest the following:

No tesselation:
5870 - 11.3ms pixel, 2.6 ms geometry
5770 - 22.7ms pixel, 2.6 ms geometry

With tesselation:
5870 - 14.2ms pixel, 10.2 ms geometry
5770 - 28.4ms pixel, 10.2 ms geometry
How are you calculating this?

Tessellation's resulting smaller triangles (apparently a lot of them are smaller than a pixel in this benchmark) will hugely increase the pixel shading workload. Additionally tessellation will create overdraw in parts of the scene (e.g. the cobbled road) where previously there was zero (it was merely a bump-mapped road).

I would say that spending 42% of your time on geometry is pretty significant. Look at it another way: 41 fps isn't exactly framerate nirvana, and only if you lowered resolution from 1680x1050 down to 1024x768 could you crack 60 fps.
I agree, the resulting framerates are troublingly low. Even without tessellation they're troublingly low. But then, that's what synthetic benchmarks do.

GF100 has fast setup speed, but let's not get carried away. One pixel triangles would mean pixel shading power is cut by more than four due to each triangle occupying four threads (a quad) in the shaders.
Which is why I'm dubious of the pixel/geometry balance you've derived above, as well as other factors.

Still, I think this is the right direction.

Additionally, it seems to me that GF100's substantial read/write L1/L2 cache can't help but be a significant factor in performance here. NVidia describes L2 as replacing various FIFOs, and it's certainly a key part of geometry processing.

Jawed
 
How are you calculating this?
Tessellation's resulting smaller triangles (apparently a lot of them are smaller than a pixel in this benchmark) will hugely increase the pixel shading workload. Additionally tessellation will create overdraw in parts of the scene

It's just the numbers made to fit the above assumption about double/same performance for pixel/geo load (i think shader load would be more correct than pixel load). So the conclusion about the 42% is mostly a consequence of that assumption.
I was also sceptical about the assumption, but doing the same thing on the 1920a8 result (ie keepting the 10.2ms geometry), makes 34/63 ms for the pixel(/shader) load - still quite close to double.
Regarding the difference in pixel numbers w/o tesselation; the result is without MSAA so the number of fragments should be the same, except for the overdraw - and then we would have to add the additional shading load from tesselation stages - so maybe 25% isn't too far off (and it's 25% for both GPUs, making the assumption a better fit).
 
Tessellation's resulting smaller triangles (apparently a lot of them are smaller than a pixel in this benchmark) will hugely increase the pixel shading workload. Additionally tessellation will create overdraw in parts of the scene (e.g. the cobbled road) where previously there was zero (it was merely a bump-mapped road).

TBH, tesselation seems like another big push (apart from the relatively small increase in bandwidth from rv770->rv870) towards deferred rendering to me. With MSAA now possible with deferred rendering as well.

And while your rendering has been deferred, why not render in tiles to increase bandwidth efficiency as well. LRB FTW :LOL:

If only an IHV will take the initiative to include hw acceleration for sorting fragments in tiles and exposing it in OCL. ;)

Absent that, I guess they could try slapping a 256MB DRAM module on top of the GPU die using PoP tech.

http://en.wikipedia.org/wiki/Package_on_package

The only problem with doing that with the GPU's seems to be heat. :rolleyes:
 
TBH, tesselation seems like another big push (apart from the relatively small increase in bandwidth from rv770->rv870) towards deferred rendering to me. With MSAA now possible with deferred rendering as well.
Huh what? In what way is tessellation in any way related to deferred rendering?
 
Back
Top