What's your point? You were claiming that "G80 is so wildly wasteful of its texturing and raster capabilities". I was demonstrating to you that it probably isn't.
This is getting there:
http://www.ixbt.com/video/itogi-video/0607/itogi-video-ss1-wxp-2560-pcie.html
but 8800GTX is 35% faster than HD2900XT, despite its 55% theoretical advantage in bilinear texturing.
I don't think shadowmap fillrate is much of a limitation on the high end cards. Even a 2048x2048 shadow map with 2x overdraw (remember that arial views have low overdraw) would get filled in under a millisecond on R580, so that's 57fps vs. 60fps for an infinitely fast fillrate. R600 has 2.3x that rate, G80 much more. Triangle setup is usually the big cost for shadow maps.
OK, so setup rate and texel fetch rate are the key bottlenecks, for current "simple PCF" type shadowing in games. It'd be nice to see some results that confirm this...
I was just talking about a test where we can take G80's 1:2 TA:TF advantage out of the equation. In such a scenario the GTS is still close to R600 in games, despite having much lower math ability, bandwidth, vertex setup, etc.
Math is moot, it's hard to find a game that's math-bound (certain driving games that are known for their high math load rarely seem to be benchmarked). Bandwidth is also moot, because without AA/AF the bottleneck moves somewhere else, e.g. CPU or zixel rate. Though that SS2 test I linked above hints that bandwidth may be very important - but I dunno what's happening with fillrate there, either.
It's also why I introduced the G84/RV630 comparison, since G84 is 1:1. But looking at scalings comparing the two couplings (G80 v R600 and G84 v RV630) makes me think results are sabotaged by drivers. 8600GTS is theoretically faster (69% bilinear, 69% colour fillrate), than HD2600XT-GDDR4 across the board than 8800GTX (55% bilinear, 16% colour fillrate) versus HD2900XT, yet it's the latter coupling that shows the bigger performance margin in games.
It could be that G84's ALU:TEX ratio, being effectively lower for bilinear than G80, is running into the "not enough latency hiding" problem you mentioned earlier. Whoops. Not enough objects in flight. Yet those patents imply that math throughput in G84 should be higher than in G80.
Don't forget hierarchical stencil. That makes a huge difference in Doom3. I don't think this game's performance is indicative of texture fetch performance at all.
It was the closest I could find with AF/AA off
I don't know enough about SS2's engine to analyse the new data.
Effectively, yeah. I just avoided "TA rate" because ATI claims more than 16 per clock when really that isn't true for pixel shader texturing.
Yeah, point filtering with existing games seems to be moot, except perhaps when dealing with shader-filtered shadows (rather than PCF). Which prolly only ever happens on ATI hardware because of NVidia's historical advantage with PCF.
Not sure what you mean by "G80 falls down", though. It's still 55% more than R600 for the GTX, and matches R600 in the GTS.
I'm thinking of the percentage of the time that 32 of G80's TFs are idle because the GPU is doing bilinear or point-sampling. We also don't know if vertex texture fetch is performed using the TMU hardware in G80...
Everyone's made a ballyhoo about the "100% utilisation" of the ALU pipeline in G80 (which is far from true - hello SFU). I think TMUs prolly use up dramatically more die space, but hey, I don't know of any decent NVidia GPU die shots.
For G84, I think it's partly because reviewers don't enable AF for cards of that budget as consistently as they do for the high end cards (or at least that's what NVidia thought at design time).
It's got a vast amount of theoretical filtering rate, 91% of HD2900XT for bilinear and 82% more trilinear
It seems like people really want to run at native LCD resolutions, even if they have to disable AF/AA (which I think is stupid, but whatever).
Native LCD resolutions do seem to have mucked-up the stratification of GPUs, what with 1280x1024 being practically "entry-level" for LCDs. But, yeah, that's a whole other topic.
Right now, yeah, but ATI also wanted to avoid halving the filter rate for these formats, and they did it in a way that didn't have the auxillary benefits of NVidia's approach. If both IHV's are doing this, they clearly think there's some future for these formats.
Right now I'm thinking ATI's ratio is the right way. Xmas's point about RGBE, which is also relevant to sRGB-aware texture filtering (sRGB will presumably become the majority of the usage once devs embrace the concept), is fairly important. But those are both D3D10-specific arguments for 32-bits per texel formats, which need "fp16" capable TUs.
G80's TA:TF ratio just seems to be a matter of circumstance. I'd be surprised to see it retained in the future.
The "backwall" has little to do with AF load.
I disagree. Driving games for example already show that AF runs out well short of the view distance. Though you might argue that only affects a relatively small portion of the screen area (one that grows perceptually as resolution increases) - it's just a pity that it's where you're looking...
The reason AF load per pixel drops is that all the near pixels don't need as much AF, so the "frontwall" (i.e. the point where at least 2xAF is needed) moves back. In any case, I believe that texture size will increase faster than resolution, and thus AF load per pixel will increase, all else being equal.
I agree about the front wall, and that's where a dramatic difference due to higher-resolution texturing appears. And, yeah, screen resolution has hit 2560x1600 and it seems like it'll be years before we get better than that.
However, your second point makes sense, so all else is not equal. Assuming they enable AF selectively (say, on just the base texture), you're probably right about AF performance playing a lesser role.
And deferred rendering will also chip away at this issue, with overdraw falling off dramatically.
I also wonder if MegaTexturing changes the landscape of AF load
Again, there's too many variables to make any conclusion like that. D3 performance is also heavily dependent on early fragment rejection (Z and stencil), which is quite different between R600 and G80 (rate, granularity, corner cases, etc). Add in drivers (sometimes the GTX is more than 53% faster than the GTS!), memory efficiency, per-frame loads (resolution scaling isn't perfect), etc and now there are a lot of unquantified factors from which you're trying to isolate the effect of just one.
Hence my general displeasure with the "analysis" of the reviews out there though ATI's drivers are a major source of grief, making analysis really difficult. Oh well.
You have no evidence for such a theory. Texturing tests show that efficiency is just fine.
http://www.techreport.com/reviews/2007q2/radeon-hd-2900xt/index.x?pg=4
I'm not sure of the quality of the 3DMk06 single-texture test though.
Game tests have too many variables for you to blame TA ability.
Game tests show that 8800GTS is pissing its TF (and fillrate) up the wall, while HD2900XT is pissing its bandwidth up the wall
Perhaps this is why R6xx appears to be scaling down better - even though the advantage should lie with G8x based on theoreticals? All R6xx variants seem to have an "excess" of bandwidth against the competing NVidia parts.
Are you complaining about why the filtering units aren't saturated? Is that what this whole rant is about? You're complaining about why the GTX isn't 3x the speed of R600 in games???
Who cares? There's no reason to expect that. They're there to reduce the performance hit of high TF scenarios.
Relatively, they're expensive to build and fly in the face of "high utilisation" as a design goal. As I've said before, I think G80 just came out the way it did through "least resistance". Off-die NVIO is clearly a bugbear. My suggestions for alternative configurations (TA:TF 1:1 or 12 clusters) seem to be the victims of die size one way or another. Die size will be much less of a problem in the new variants, I trust.
I still don't see why, for a warp size of 16, these would affect scalar/vec2 code more so than vec4 code. I understand how instruction issue rate changes, but not register related issues. Are you talking about a latency between writing a result and reading it again? That's easy to solve with a temp register caching the write port, so this is not an issue that's holding NVidia back from reducing warp size.
It's true that's a solution (to latency), something you'll find in Xenos and R600. Wasn't one of the recent patents pointing at that? Anyway, the signs are that warp size will stay at 32 - it's my suspicion that batch size will increase from 16 to 32 (G80's warp is two batches married).
It's also about co-issue - MAD and SF co-issue breaks apart if you only have 16 objects (one batch, half a warp). SFs last for 4 or 8 clocks and co-issue becomes problematic if you run out of issuable instructions on the MAD pipe, when the MAD operands are scalars, not vec4s.
Finally, it's about RF address rate. If you have lots of scalar operands, then the RF is being pushed to access more addresses than it can handle. A 2-clock instruction (for a single batch) requiring 3 operands for 16 pixels is requesting 3 different addresses in 2 clocks
A four clock (2 batches in a warp) instruction requiring 3 operands for 16 pixels is requesting 3 different addresses in 4 clocks
Yeah, it's pretty trivial.
32 scalar operands, each of 32 bits, that's 1024 bits.
PDC might be different since I don't really know the details, but for a fixed number of ports in the RF and CB, doubling bus width is easy. An equally sized RF or CB with double the bus width and double the granularity simply halves the number of partitions you're selecting from and doubles some wires.
RF is banked. I think it's 16-banked in G80. Same for CB and PDC I expect. Each of these 3 data sources is presumably feeding an operand re-order buffer.
Also, you have to double the size of the RF (can't keep the size constant), because you want to double the number objects in flight, since your ALU pipe is now chewing through them twice as fast. Otherwise you've just lost half your latency-hiding.
Doubling SIMD width and warp size is way cheaper than doubling the number of arrays. I can't see why you'd think otherwise.
I'm sceptical about the feasibility of the 3x1024-bit buses and the doubled scale of the operand re-ordering "crossbar" (doubling its bank/port count I guess, erm...). 4-clock instructions in the ALU pipe are even cheaper to implement, I would say...
So R600 has 16 of these 20-ALU blocks. G80 has 16 groups of 8-ALU array. Care to explain why it's so hard for NVidia to double the SIMD width next gen when it's still smaller than ATI this gen?
I don't know R600's RF fetch rate. At minimum it's presumably 640-bits wide (20x32) into the 8KB cache (which is used as an operand reordering buffer) - or at least a quarter of the 8KB cache, presumably dedicated to those 20 pipes. It might be double that, of course, 1280 bits.
Jawed