HSR-mechanisms - as evident in the fact, that those huge advantages are only in outdoor environments in place.
are there any HSR performance figures on different gens? Would be interesting to see.
HSR-mechanisms - as evident in the fact, that those huge advantages are only in outdoor environments in place.
are there any HSR performance figures on different gens? Would be interesting to see.
I only read the page you linked to (conclusion), but didn't see anything that singled out the HD 2900XT as you write above. They didn't like the performance of any cards.
If you're referring to my post: Oblivions outdoor areas taxe Nvidia-GPUs up until G80 heavily - regardless of HDR or Bloom selected.
And presuming that NVidia hardware still doesn't do AA resolve in its ROPs (but somewhere in the display chain) there's even less need for lots of blending, I guess.It should be noted that Xenos has a 4 blend units per ROP, and we don't know how many G80 has. It may only have 1, because if the samples aren't identical (and hence not compressed) it'll take a while to transfer the data anyway. Also, I think G80 has half speed blend like NV40. The 3DMark numbers are showing precisely 12 pixels per clock (6.9 GTex/s). That will save a lot of transistors.
The architecture, binding MCs and ROPs together (with L2) doesn't really give them much choice in this matter if they want to a bus that's bigger than 256 bits. So, it's deliberate.Nonetheless, let's assume your right with 100M. That's still inline with what I was talking about. Suppose 16 DX10 ROPs doing 2 samples per clock cost 40M transistors. I think that's fair. 60M additional transistors is under 10% of G80, and though I can never prove it, I think G80 has benefitted more than 10% in AA performance (the only performance that counts for this market segment). I really don't think they've been wasteful here
ATI tweaked drivers for R5xx so that no-AF performance suffered at the same time as AF performance improved. No-AF really isn't interesting, it should be always-on.Even with AF disabled, R580 was very often only marginally faster than R520. (Trilinear has always had a low impact with the mipmap optimizations we've seen, so bilinear rate = fetch rate for all intents and purposes.)
All those >8bpc formats are a tiny proportion of frame rendering time. The "HDR" and "shadowing" impact on games isn't TMU-centric, it's ROP-centric and bandwidth-centric.Originally, yes. I thought it was wasteful because you only need a bit more hardware to make double speed bilinear (it's probably even a greater fraction than what's shown by that chart once you take the sample fetching, cache, and MC into account). However, then I realized that you would need twice the threads and registers to truly double bilinear texturing speed. That was a pretty big oversight in my original analysis. Moreover, I did not consider that it would also double the single mipmap speed of FP16, I16, FP32, other >8bpc HDR formats, and 8bpc volume textures.
The irony being that as screen resolutions gradually climb and game worlds "expand", the AF workload per screen pixel drops, because we're stuck at 16xAF - the "backwall" of AF is not receding with increased resolution. I think we need to go to 64x or higher...That's true, and I've brought that point up many times myself too. I wasn't saying AF is the biggest cause of dips in framerate, but just suggesting that it's a load that could wildly fluctuate when everything else stays the same, especially if texture resolutions start going up the way we all want them to.
You've prolly got a point with 2003 and earlier games, but z fillrate seems dominant in all recent game tests and math is nowhere to be seen. Trying to find a game that's simultaneously bilinear-texturing and bandwidth-bound today has got to be a real chore.Even if you disable AF to almost negate the filtering advantage, the 8800GTS-640 is not far from the HD2900XT. With less fillrate (well, w/o AA only I guess), way less bandwidth, and way less math ability, you have to think the equal texturing rate has something to do with it, right?
My theory is that G80 is TA-limited and the TMUs are woefully inefficient (or there's just too damn many of them for the bandwidth available). So I'm suggesting cutting the TFs back to 75% and doubling the TAs.But that would mean 48 bilinear TMUs! Weren't you just saying fetch rate is less important than filtering rate?
That's way before my time :smile:Remember how the original Geforce could do single cycle trilinear (i.e. 2 filtered quads per clock) in 15M transistors. It cost NVidia 10M transistors to double the addressing ability and double the pixels in flight for the Geforce 2. Though I suspect they increased texture blending math and I guess increased clock speed had some cost as well.
No, G8x has register-file bandwidth corner cases and hazards. And MAD+SF co-issue scheduling is much easier if you join two batches together (a warp) than if you try to schedule 16 object batches. All of this is statically schedulable as far as I can tell.I'm not following you on the impact of pixel shader vs. vertex shader on the register file. In a scalar architecture it doesn't really matter.
We're in a grey area here because NVidia won't enumerate the register file limitations, "if the performance is weird and you've tried everything else it's prolly the register file".You store your data in a SOA manner in the RF. Parallel MADDs still need to load the same 8 floats for each of 3 operands as with serial MADDs. The clocks between instruction changes makes sense, though. A vec4 op on a batch of 16 would uses the same instruction for 8 clocks on 8-wide SIMD arrays.
Doubling the width of all those buses is hardly trivial! There are 3 buses feeding the ALU pipeline: RF, constant and PDC.Also, making the register file wider isn't nearly as costly as making it bigger. The number of operands are still the same.
A SIMD might be 16-wide (5-way), but the RF is prolly organised as 4 separate blocks, each block private to a set of 20 ALU pipes. I say that because each "private quarter-RF" gets its results from just one quad-TU - the timings of RF writes by the quad-TU should be independent across the set of 4 quad-TUs that are all producing texture results for a batch in one 16-SIMD.One quarter of R600 fetches data for 80 SPs each clock.
The sixteen arrays in G80 only fetch data for 8 SPs. Anyway, regardless of how you approach it, it would definately be more costly to keep the warp size the same and double the math instead of doubling both.
I'm not totally convinced of that... particularly with 1-component formats (like standard shadow mapping) I find myself more often limited by total *number* of texture fetches rather than bandwidth, etc. Even on cards with stupidly large amounts of bandwidth, performance is barely increased, whereas performing fewer texture fetches generally improves performance much more, even if the same amount of data is being fetched.All those >8bpc formats are a tiny proportion of frame rendering time. The "HDR" and "shadowing" impact on games isn't TMU-centric, it's ROP-centric and bandwidth-centric.
This is true. Shader and textures units are designed to hide latency. Fetching one component when the HW is designed to cover the latency of four likely won't make much difference in performance. Also note that, in general, texture lookups in the shader can't be optimized whereas math can be.I'm not totally convinced of that... particularly with 1-component formats (like standard shadow mapping) I find myself more often limited by total *number* of texture fetches rather than bandwidth, etc. Even on cards with stupidly large amounts of bandwidth, performance is barely increased, whereas performing fewer texture fetches generally improves performance much more, even if the same amount of data is being fetched.
See above.I don't claim to completely understand what's going on with the hardware here, but it is annoying to me that the *number* of texture fetches is often so important - it really restricts one's choice of algorithms.
Ah, I was under the impression that shadow map resolution and number of lights/radius/direction is causing a fillrate bottleneck in games with shadows and that's the primary bottleneck for shadowing.I'm not totally convinced of that... particularly with 1-component formats (like standard shadow mapping) I find myself more often limited by total *number* of texture fetches rather than bandwidth, etc. Even on cards with stupidly large amounts of bandwidth, performance is barely increased, whereas performing fewer texture fetches generally improves performance much more, even if the same amount of data is being fetched.
I'd forgotten about fetch4. Do you have an HD2900XT now or is this on X1k?This is particularly evident with Fetch4... if it was just about bandwidth, there would be no need for Fetch4 at all. However the performance difference that it makes is quite pronounced, although my guess is that it would be less so on NVIDIA hardware (perhaps why they don't support it).
When a texture fetch is performed, it's going to populate texture cache with neighbouring shadow map samples anyway. So it seems to me the difference you're seeing with fetch4 is purely utilisation of texture addressing and fetch pipes. Er, stating the obvious.I don't claim to completely understand what's going on with the hardware here, but it is annoying to me that the *number* of texture fetches is often so important - it really restricts one's choice of algorithms.
It would depend on the shader. If, with PCF, the shader is exactly balanced between ALU and TEX instructions, then swapping the PCF lookup(s) for FETCH4 could incur a performance hit. This could also be true if the original PCF shader was ALU-biased.Potentially stupid question: do PCF and fetch4 run at the same speed (if you have an HD2900XT) - I presume it's the same number of fetches, and then it's down to the ALU capability of HD2900XT, which I presume is fast enough to make fetch4 fetch-bound.
You may be right, particularly for "simple" shadow implementations that take a small number of samples. I'm just relaying my experience.Ah, I was under the impression that shadow map resolution and number of lights/radius/direction is causing a fillrate bottleneck in games with shadows and that's the primary bottleneck for shadowing.
This is on a X1k. We just got a HD2900XT at the office so I'll get to mess with it a bit at some point, but unfortunately it's in an XP machine right now so no D3D10I'd forgotten about fetch4. Do you have an HD2900XT now or is this on X1k?
Makes sense. Actually the application in which I saw the largest difference was when fetching bilinear samples from four corners of a summed-area table. (Fetch4 at each corner, SAT math for the four rectangles, then bilinearly weight the sums). It was almost worth splitting into *two* textures for VSM (and paying the MRT and SAT generation pass costs) just to get the use of Fetch4.When a texture fetch is performed, it's going to populate texture cache with neighbouring shadow map samples anyway. So it seems to me the difference you're seeing with fetch4 is purely utilisation of texture addressing and fetch pipes. Er, stating the obvious.
Yeah that would be quite interesting. I have an 8800GTX in my desktop and an 8600M in my laptop, but the performance delta between the two is so great as to make a comparison largely useless.It'd be interesting to compare the shadow map filtering performance on 8600GTS versus 8800GTX to see how the ratios impact the fetch-count scaling that you're observing: TA:TF, ROP:bandwidth, bilinear-rate:zixel-rate ratios.
This is for general shader use actually (not just PCF), and I have no real numbers to back that up. It has just been my experience that I can throw *gobs* of texture lookups at NVIDIA hardware whereas on ATI I have to be careful with the *number* of lookups, although very wide formats (4xfp32) and Fetch4 are fine. I haven't done a direct comparison though so take that with a grain of salt.Also, I'm mildly puzzled why you say fetch4 prolly wouldn't improve G80 performance much (or NVidia hardware in general)? Is that because you're always using PCF, anyway? Or because of the lower ALU:TEX ratio?
Those quotes prove my point. Anandtech didn't single out a specific card as being bad at any particular DX10 feature.We are also hearing that some of the exclusive DX10 features that could enable unique and amazing effects DX9 isn't capable of just don't perform well enough on current hardware. Geometry shader heavy code, especially involving geometry amplification, does not perform equally well on all available platforms (and we're looking at doing some synthetic tests to help demonstrate this). The performance of some DX10 features is lacking to the point where developers are limited in how intensely they can use these new features.
(....)
we really need to see faster hardware before developers can start doing more impressive things with DirectX 10.
What's your point? You were claiming that "G80 is so wildly wasteful of its texturing and raster capabilities". I was demonstrating to you that it probably isn't.The architecture, binding MCs and ROPs together (with L2) doesn't really give them much choice in this matter if they want to a bus that's bigger than 256 bits. So, it's deliberate.
I don't think shadowmap fillrate is much of a limitation on the high end cards. Even a 2048x2048 shadow map with 2x overdraw (remember that arial views have low overdraw) would get filled in under a millisecond on R580, so that's 57fps vs. 60fps for an infinitely fast fillrate. R600 has 2.3x that rate, G80 much more. Triangle setup is usually the big cost for shadow maps.I wish there was more analysis of shadowing performance - shadowmap fillrate is the one area that G8x is deliciously fast in - it's here I can't tell whether the spec is "just right" or well over the top. Stalker's dynamic lighting seems to demonstrate justice for G8x ROPs, but I don't know.
I was just talking about a test where we can take G80's 1:2 TA:TF advantage out of the equation. In such a scenario the GTS is still close to R600 in games, despite having much lower math ability, bandwidth, vertex setup, etc. It has equal texturing rate and faster AA fillrate (and the latter not true with 2xAA or w/ blending). These are the only things that would let the GTS catch R600, so saying "G80 is so wildly wasteful of its texturing and raster capabilities" is just hyperbole.ATI tweaked drivers for R5xx so that no-AF performance suffered at the same time as AF performance improved. No-AF really isn't interesting, it should be always-on.
Don't forget hierarchical stencil. That makes a huge difference in Doom3. I don't think this game's performance is indicative of texture fetch performance at all.HD2900XT is ~50% faster than X1950XT in the most texture-fetch bound gaming scenario I can find, when it's 14% faster theoretically (25%, 45% and 72% at the three resolutions). Trouble is, zixel rate prolly dominates (but HD2900XT has got 128% more zixel rate).
Effectively, yeah. I just avoided "TA rate" because ATI claims more than 16 per clock when really that isn't true for pixel shader texturing. Not sure what you mean by "G80 falls down", though. It's still 55% more than R600 for the GTX, and matches R600 in the GTS.To be fair, TA rate is where I think G80 falls down - it's not fetch rate per se, but I'm wondering if it's effectively what you mean when you say fetch rate? What percentage bigger would G80 have been with double the TAs?
Why does G84 have a 1:1 TA:TF ratio, not 1:2? Is it because 32 TFs would have made it way too large, or is it because an extra 8 TAs (from 8 to 16) was low cost?
Right now, yeah, but ATI also wanted to avoid halving the filter rate for these formats, and they did it in a way that didn't have the auxillary benefits of NVidia's approach. If both IHV's are doing this, they clearly think there's some future for these formats.All those >8bpc formats are a tiny proportion of frame rendering time. The "HDR" and "shadowing" impact on games isn't TMU-centric, it's ROP-centric and bandwidth-centric.
The "backwall" has little to do with AF load. The reason AF load per pixel drops is that all the near pixels don't need as much AF, so the "frontwall" (i.e. the point where at least 2xAF is needed) moves back. In any case, I believe that texture size will increase faster than resolution, and thus AF load per pixel will increase, all else being equal.The irony being that as screen resolutions gradually climb and game worlds "expand", the AF workload per screen pixel drops, because we're stuck at 16xAF - the "backwall" of AF is not receding with increased resolution. I think we need to go to 64x or higher...
Also, the gradual switch, with newer games, to developer-managed AF rather than CP-activated AF frees-up a load of AF workload.
Again, there's too many variables to make any conclusion like that. D3 performance is also heavily dependent on early fragment rejection (Z and stencil), which is quite different between R600 and G80 (rate, granularity, corner cases, etc). Add in drivers (sometimes the GTX is more than 53% faster than the GTS!), memory efficiency, per-frame loads (resolution scaling isn't perfect), etc and now there are a lot of unquantified factors from which you're trying to isolate the effect of just one.In that D3 test I linked above, at 2560x1600, 8800GTS-640 is 38% faster than X1950XTX, with the same bandwidth. 8800GTX is 42% faster than 8800GTS-640, with 32% more bandwidth, 53% more bilinear rate and 38% more fillrate. So 8800GTS might be just feeling the first pinch of bandwidth limitation there and it might just be a texturing-bound test. At that resolution, HD2900XT is 24% faster, despite having 30% of the zixel rate and the same bilinear rate.
You have no evidence for such a theory. Texturing tests show that efficiency is just fine. Game tests have too many variables for you to blame TA ability.My theory is that G80 is TA-limited and the TMUs are woefully inefficient (or there's just too damn many of them for the bandwidth available). So I'm suggesting cutting the TFs back to 75% and doubling the TAs.
I still don't see why, for a warp size of 16, these would affect scalar/vec2 code more so than vec4 code. I understand how instruction issue rate changes, but not register related issues. Are you talking about a latency between writing a result and reading it again? That's easy to solve with a temp register caching the write port, so this is not an issue that's holding NVidia back from reducing warp size.No, G8x has register-file bandwidth corner cases and hazards. And MAD+SF co-issue scheduling is much easier if you join two batches together (a warp) than if you try to schedule 16 object batches. All of this is statically schedulable as far as I can tell.
Yeah, it's pretty trivial. PDC might be different since I don't really know the details, but for a fixed number of ports in the RF and CB, doubling bus width is easy. An equally sized RF or CB with double the bus width and double the granularity simply halves the number of partitions you're selecting from and doubles some wires.Doubling the width of all those buses is hardly trivial! There are 3 buses feeding the ALU pipeline: RF, constant and PDC.
So R600 has 16 of these 20-ALU blocks. G80 has 16 groups of 8-ALU array. Care to explain why it's so hard for NVidia to double the SIMD width next gen when it's still smaller than ATI this gen?A SIMD might be 16-wide (5-way), but the RF is prolly organised as 4 separate blocks, each block private to a set of 20 ALU pipes.
Fix what?
And you can't say that with a completely straight face either. There's nothing easy in writing drivers for new GPU architectures, and the opportunities to leverage expertise in writing a driver for an old arch on the same API aren't as wide ranging as you might think. It'd be nice to see AMD or NVIDIA publish a diagram or two of the driver stack that highlights the reality of the complexity they contain.Fix why it's so slow. I get it drivers are a big issue.. well DX10 drivers should be an issue, DX9 drivers should've been up there from the start. Very Shoddy AMD.
Looking at the ATI PCF demonstration for R580:You may be right, particularly for "simple" shadow implementations that take a small number of samples. I'm just relaying my experience.
I realised afterwards that your comments were mostly directed at R580, so it may be that there wouldn't be much mileage in a comparison across G8x family anyway.Yeah that would be quite interesting. I have an 8800GTX in my desktop and an 8600M in my laptop, but the performance delta between the two is so great as to make a comparison largely useless.
8800GTX is 77% faster in theoretical bilinear rate than X1950XTX.This is for general shader use actually (not just PCF), and I have no real numbers to back that up. It has just been my experience that I can throw *gobs* of texture lookups at NVIDIA hardware whereas on ATI I have to be careful with the *number* of lookups, although very wide formats (4xfp32) and Fetch4 are fine. I haven't done a direct comparison though so take that with a grain of salt.
I expect newer GPUs are more forgiving. R600 should have "practically zero" register pressure problems from what I can discern, and it has vast amounts of latency-hiding. R600's complication should be to do with instruction issue rate. This is NVidia's argument, that things like shadow buffers, being scalar, will always have their math run at full speed.I'd love to get some hard data on how one can trade-off number of texture lookups and bandwidth, etc. but it seems like a particularly complicated function once register pressure and latency hiding are considered (and they really must be on a GPU).
Why do you think that?R600 should have "practically zero" register pressure problems from what I can discern
Over 160 GB/s on R600 for the SGL part.is that R580's cache isn't too hot when it comes to supplying the same texels repeatedly, whereas G80 is good. See the 4-Component Floating Point Input Bandwidth test, with SGL access pattern (137GB/s for G80, 39GB/s for R580). I think R600 is radically better here (anyone ran GPUBench on R600?), which should make a huge difference.
Yeah that's interesting and those results are certainly in-line with what I've seen. Has anyone run that demo on R600? I'll maybe try if I get a chance at the office this week (it's DX9 right?).Is it showing that R580's texture cache architecture is running into a wall? We know R600's texture cache architecture is designed for "better performance".
That comment was actually more directed at a comparison between G70 and R580, which is when I was doing more direct comparisons - since G80 there's really no competition unfortunately, excepting that G80's geometry shader does appear to be almost unusably slow (although I've not played with R600's yet).8800GTX is 77% faster in theoretical bilinear rate than X1950XTX.
Right. Well I should really make some time to play around with R600 this week and next to get a better idea of the capabilities. Unfortunately I can't run a lot of our applications/benchmarks because OpenGL is still pretty broken on ATI (as always - sigh), and it's currently in an XP machine. So basically all I have is DX9 stuff for now...I expect newer GPUs are more forgiving. R600 should have "practically zero" register pressure problems from what I can discern, and it has vast amounts of latency-hiding. R600's complication should be to do with instruction issue rate.