Sir Eric Demers on AMD R600

Razor1 · Jul 6, 2007

Quasar said:
HSR-mechanisms - as evident in the fact, that those huge advantages are only in outdoor environments in place.

are there any HSR performance figures on different gens? Would be interesting to see.

Pete · Jul 6, 2007

Razor1 said:
are there any HSR performance figures on different gens? Would be interesting to see.

Mint just posted some Compbase.de numbers (expand the Rightmark tests at the very bottom). I don't know why there aren't any ATI cards in that review's VillageMark bench, but there are plenty ATI and more NV #s in their 2900 review here (and more RightMark #s).

Razor1 · Jul 6, 2007

ah cool thx!

http://www.digit-life.com/articles2/video/r580plus-part2.html

near bottom of the page, older cards

wow huge differences is HSR!

Julidz · Jul 6, 2007

3dcgi said:
I only read the page you linked to (conclusion), but didn't see anything that singled out the HD 2900XT as you write above. They didn't like the performance of any cards.

We are also hearing that some of the exclusive DX10 features that could enable unique and amazing effects DX9 isn't capable of just don't perform well enough on current hardware. Geometry shader heavy code, especially involving geometry amplification, does not perform equally well on all available platforms (and we're looking at doing some synthetic tests to help demonstrate this). The performance of some DX10 features is lacking to the point where developers are limited in how intensely they can use these new features.

(....)

we really need to see faster hardware before developers can start doing more impressive things with DirectX 10.

swaaye · Jul 6, 2007

Quasar said:
If you're referring to my post: Oblivions outdoor areas taxe Nvidia-GPUs up until G80 heavily - regardless of HDR or Bloom selected.

I was able to run Oblivion on my X800GTO2 (16-pipe at 520/630) at 1920x1200. It was almost acceptable with 2X AA, but not really fast enough so I usually ran without AA. I had a 6800 Go in my laptop at the time (NV42 12-pipe @ 380/770) that just dragged at 1440x900. A friend had a 6800 GT OC (370 core) that could barely handle 1680x1050. I think the 6800 GT OC ran the game around half as fast as my X850XT-wannabe.

There is a fake HDR shader out there, made by a fellow who calls himself Timeslip, for PS 2.x cards. It has a similar performance hit to the real HDR in the game, but it's obviously not as accurate. Pretty close to the same visual result tho, IMO.

HDR wasn't really a viable option for the 6800s in Oblivion, IMO, unless you were ok with 1024x768 or so. I wondered too the resulting vertex shading advantage of R4x0, from high clock speed, was showing up in Oblivion.

I found more Villagemark/Fablemark tests, too. These show 6800 being close to R420/R480 in HSR.
http://www.computerbase.de/artikel/...test_sapphire_radeon_x800_gt_gto_gto/drucken/

How come ATI didn't just build a doubled-up R580? In a similar move to R360 -> R420. Something like that would probably hold its own against 8800 GTX. We don't need any of this fancy DX10 stuff for a few years, anyway.

Jawed · Jul 6, 2007

Mintmaster said:
It should be noted that Xenos has a 4 blend units per ROP, and we don't know how many G80 has. It may only have 1, because if the samples aren't identical (and hence not compressed) it'll take a while to transfer the data anyway. Also, I think G80 has half speed blend like NV40. The 3DMark numbers are showing precisely 12 pixels per clock (6.9 GTex/s). That will save a lot of transistors.

And presuming that NVidia hardware still doesn't do AA resolve in its ROPs (but somewhere in the display chain) there's even less need for lots of blending, I guess.

Nonetheless, let's assume your right with 100M. That's still inline with what I was talking about. Suppose 16 DX10 ROPs doing 2 samples per clock cost 40M transistors. I think that's fair. 60M additional transistors is under 10% of G80, and though I can never prove it, I think G80 has benefitted more than 10% in AA performance (the only performance that counts for this market segment). I really don't think they've been wasteful here

The architecture, binding MCs and ROPs together (with L2) doesn't really give them much choice in this matter if they want to a bus that's bigger than 256 bits. So, it's deliberate.

Also, the 8 Zs per clock (no AA) must be some attribute of the NVidia's long-running ROP architecture, because it's been like this (2:1 z-only:AA) since NV30 (not sure about earlier).

G84 v RV630 is more damning in terms of theoreticals, G84's numbers are just mind-bogglingly over the top. Very hard to conclude anything other than G8x's ROP:MC ratio is bonkers.

I wish there was more analysis of shadowing performance - shadowmap fillrate is the one area that G8x is deliciously fast in - it's here I can't tell whether the spec is "just right" or well over the top. Stalker's dynamic lighting seems to demonstrate justice for G8x ROPs, but I don't know.

Even with AF disabled, R580 was very often only marginally faster than R520. (Trilinear has always had a low impact with the mipmap optimizations we've seen, so bilinear rate = fetch rate for all intents and purposes.)

ATI tweaked drivers for R5xx so that no-AF performance suffered at the same time as AF performance improved. No-AF really isn't interesting, it should be always-on.

Interestingly:

http://www.computerbase.de/artikel/...st_ati_radeon_hd_2900_xt/18/#abschnitt_doom_3

HD2900XT is ~50% faster than X1950XT in the most texture-fetch bound gaming scenario I can find, when it's 14% faster theoretically (25%, 45% and 72% at the three resolutions). Trouble is, zixel rate prolly dominates (but HD2900XT has got 128% more zixel rate).

Maybe Serious Sam 2 is a better test:

http://www.computerbase.de/artikel/...radeon_hd_2900_xt/26/#abschnitt_serious_sam_2

though I think HDR lighting may be dominating.

To be fair, TA rate is where I think G80 falls down - it's not fetch rate per se, but I'm wondering if it's effectively what you mean when you say fetch rate? What percentage bigger would G80 have been with double the TAs?

Why does G84 have a 1:1 TA:TF ratio, not 1:2? Is it because 32 TFs would have made it way too large, or is it because an extra 8 TAs (from 8 to 16) was low cost?

Originally, yes. I thought it was wasteful because you only need a bit more hardware to make double speed bilinear (it's probably even a greater fraction than what's shown by that chart once you take the sample fetching, cache, and MC into account). However, then I realized that you would need twice the threads and registers to truly double bilinear texturing speed. That was a pretty big oversight in my original analysis. Moreover, I did not consider that it would also double the single mipmap speed of FP16, I16, FP32, other >8bpc HDR formats, and 8bpc volume textures.

All those >8bpc formats are a tiny proportion of frame rendering time. The "HDR" and "shadowing" impact on games isn't TMU-centric, it's ROP-centric and bandwidth-centric.

That's true, and I've brought that point up many times myself too. I wasn't saying AF is the biggest cause of dips in framerate, but just suggesting that it's a load that could wildly fluctuate when everything else stays the same, especially if texture resolutions start going up the way we all want them to.

The irony being that as screen resolutions gradually climb and game worlds "expand", the AF workload per screen pixel drops, because we're stuck at 16xAF - the "backwall" of AF is not receding with increased resolution. I think we need to go to 64x or higher...

Also, the gradual switch, with newer games, to developer-managed AF rather than CP-activated AF frees-up a load of AF workload.

Even if you disable AF to almost negate the filtering advantage, the 8800GTS-640 is not far from the HD2900XT. With less fillrate (well, w/o AA only I guess), way less bandwidth, and way less math ability, you have to think the equal texturing rate has something to do with it, right?

You've prolly got a point with 2003 and earlier games, but z fillrate seems dominant in all recent game tests and math is nowhere to be seen. Trying to find a game that's simultaneously bilinear-texturing and bandwidth-bound today has got to be a real chore.

In that D3 test I linked above, at 2560x1600, 8800GTS-640 is 38% faster than X1950XTX, with the same bandwidth. 8800GTX is 42% faster than 8800GTS-640, with 32% more bandwidth, 53% more bilinear rate and 38% more fillrate. So 8800GTS might be just feeling the first pinch of bandwidth limitation there and it might just be a texturing-bound test. At that resolution, HD2900XT is 24% faster, despite having 30% of the zixel rate and the same bilinear rate.

But that would mean 48 bilinear TMUs! Weren't you just saying fetch rate is less important than filtering rate?

My theory is that G80 is TA-limited and the TMUs are woefully inefficient (or there's just too damn many of them for the bandwidth available). So I'm suggesting cutting the TFs back to 75% and doubling the TAs.

Remember how the original Geforce could do single cycle trilinear (i.e. 2 filtered quads per clock) in 15M transistors. It cost NVidia 10M transistors to double the addressing ability and double the pixels in flight for the Geforce 2. Though I suspect they increased texture blending math and I guess increased clock speed had some cost as well.

That's way before my time :smile:

I'm not following you on the impact of pixel shader vs. vertex shader on the register file. In a scalar architecture it doesn't really matter.

No, G8x has register-file bandwidth corner cases and hazards. And MAD+SF co-issue scheduling is much easier if you join two batches together (a warp) than if you try to schedule 16 object batches. All of this is statically schedulable as far as I can tell.

Obviously vertex shaders can invoke these limitations, but VS code is rarely a bottleneck, in the global graphics pipeline, so it doesn't matter if VS code runs at 85% throughput (or whatever) instead of 100%.

You store your data in a SOA manner in the RF. Parallel MADDs still need to load the same 8 floats for each of 3 operands as with serial MADDs. The clocks between instruction changes makes sense, though. A vec4 op on a batch of 16 would uses the same instruction for 8 clocks on 8-wide SIMD arrays.

We're in a grey area here because NVidia won't enumerate the register file limitations, "if the performance is weird and you've tried everything else it's prolly the register file".

The reality is in CUDA and pixel shaders you won't see this happen much (if ever) because 2 batches are married together...

Also, making the register file wider isn't nearly as costly as making it bigger. The number of operands are still the same.

Doubling the width of all those buses is hardly trivial! There are 3 buses feeding the ALU pipeline: RF, constant and PDC.

One quarter of R600 fetches data for 80 SPs each clock.

A SIMD might be 16-wide (5-way), but the RF is prolly organised as 4 separate blocks, each block private to a set of 20 ALU pipes. I say that because each "private quarter-RF" gets its results from just one quad-TU - the timings of RF writes by the quad-TU should be independent across the set of 4 quad-TUs that are all producing texture results for a batch in one 16-SIMD.

The sixteen arrays in G80 only fetch data for 8 SPs. Anyway, regardless of how you approach it, it would definately be more costly to keep the warp size the same and double the math instead of doubling both.

The cost of doubling the duration of an instruction should be relatively small. As far as I can tell, a warp causes the "operand re-ordering buffer" to be sized for 4 clocks per instruction, so that cost has already been paid.

Making a cluster 4 SIMDs instead of 2 will at least have a sub-linear increase in costs in terms of ALU<->TMU communication. That's an increase in crossbar nodes, not an increase in bus widths.

Anyway, we know NVidia is planning to increase the ALU:TMU ratio...

Jawed

Andrew Lauritzen · Jul 6, 2007

Jawed said:
All those >8bpc formats are a tiny proportion of frame rendering time. The "HDR" and "shadowing" impact on games isn't TMU-centric, it's ROP-centric and bandwidth-centric.

I'm not totally convinced of that... particularly with 1-component formats (like standard shadow mapping) I find myself more often limited by total *number* of texture fetches rather than bandwidth, etc. Even on cards with stupidly large amounts of bandwidth, performance is barely increased, whereas performing fewer texture fetches generally improves performance much more, even if the same amount of data is being fetched.

This is particularly evident with Fetch4... if it was just about bandwidth, there would be no need for Fetch4 at all. However the performance difference that it makes is quite pronounced, although my guess is that it would be less so on NVIDIA hardware (perhaps why they don't support it).

I don't claim to completely understand what's going on with the hardware here, but it is annoying to me that the *number* of texture fetches is often so important - it really restricts one's choice of algorithms.

OpenGL guy · Jul 7, 2007

AndyTX said:
I'm not totally convinced of that... particularly with 1-component formats (like standard shadow mapping) I find myself more often limited by total *number* of texture fetches rather than bandwidth, etc. Even on cards with stupidly large amounts of bandwidth, performance is barely increased, whereas performing fewer texture fetches generally improves performance much more, even if the same amount of data is being fetched.

This is true. Shader and textures units are designed to hide latency. Fetching one component when the HW is designed to cover the latency of four likely won't make much difference in performance. Also note that, in general, texture lookups in the shader can't be optimized whereas math can be.

I don't claim to completely understand what's going on with the hardware here, but it is annoying to me that the *number* of texture fetches is often so important - it really restricts one's choice of algorithms.

See above.

Jawed · Jul 7, 2007

AndyTX said:
I'm not totally convinced of that... particularly with 1-component formats (like standard shadow mapping) I find myself more often limited by total *number* of texture fetches rather than bandwidth, etc. Even on cards with stupidly large amounts of bandwidth, performance is barely increased, whereas performing fewer texture fetches generally improves performance much more, even if the same amount of data is being fetched.

Ah, I was under the impression that shadow map resolution and number of lights/radius/direction is causing a fillrate bottleneck in games with shadows and that's the primary bottleneck for shadowing.

This is particularly evident with Fetch4... if it was just about bandwidth, there would be no need for Fetch4 at all. However the performance difference that it makes is quite pronounced, although my guess is that it would be less so on NVIDIA hardware (perhaps why they don't support it).

I'd forgotten about fetch4. Do you have an HD2900XT now or is this on X1k?

Potentially stupid question: do PCF and fetch4 run at the same speed (if you have an HD2900XT) - I presume it's the same number of fetches, and then it's down to the ALU capability of HD2900XT, which I presume is fast enough to make fetch4 fetch-bound.

On R580:

http://www.beyond3d.com/content/reviews/2/4

The biggest difference with fetch4 there is with the 12-tap filter, which more than doubles performance. Interestingly the 12-tap fetch4 filter is approximately the same performance as the 8x8 fetch4 filter - which is performing 16 fetches - so the math seems to be an overhead there.

I wonder if anyone's got results for that app running on HD2900XT.

I don't claim to completely understand what's going on with the hardware here, but it is annoying to me that the *number* of texture fetches is often so important - it really restricts one's choice of algorithms.

When a texture fetch is performed, it's going to populate texture cache with neighbouring shadow map samples anyway. So it seems to me the difference you're seeing with fetch4 is purely utilisation of texture addressing and fetch pipes. Er, stating the obvious.

It'd be interesting to compare the shadow map filtering performance on 8600GTS versus 8800GTX to see how the ratios impact the fetch-count scaling that you're observing: TA:TF, ROP:bandwidth, bilinear-rate:zixel-rate ratios.

Also, I'm mildly puzzled why you say fetch4 prolly wouldn't improve G80 performance much (or NVidia hardware in general)? Is that because you're always using PCF, anyway? Or because of the lower ALU:TEX ratio?

Does anyone know of any reviews that examine game shadowing performance (as opposed to shadowing demos): fixed shadow map settings but varying filter quality?...

This thread didn't get very far, unfortunately:

http://forum.beyond3d.com/showthread.php?t=42128

sigh, I didn't really pay attention to Xmas's point about filtering precision (for RGBE textures) back then

Jawed

OpenGL guy · Jul 7, 2007

Jawed said:
Potentially stupid question: do PCF and fetch4 run at the same speed (if you have an HD2900XT) - I presume it's the same number of fetches, and then it's down to the ALU capability of HD2900XT, which I presume is fast enough to make fetch4 fetch-bound.

It would depend on the shader. If, with PCF, the shader is exactly balanced between ALU and TEX instructions, then swapping the PCF lookup(s) for FETCH4 could incur a performance hit. This could also be true if the original PCF shader was ALU-biased.

Andrew Lauritzen · Jul 7, 2007

Jawed said:
Ah, I was under the impression that shadow map resolution and number of lights/radius/direction is causing a fillrate bottleneck in games with shadows and that's the primary bottleneck for shadowing.

You may be right, particularly for "simple" shadow implementations that take a small number of samples. I'm just relaying my experience.

Jawed said:
I'd forgotten about fetch4. Do you have an HD2900XT now or is this on X1k?

This is on a X1k. We just got a HD2900XT at the office so I'll get to mess with it a bit at some point, but unfortunately it's in an XP machine right now so no D3D10

Jawed said:
When a texture fetch is performed, it's going to populate texture cache with neighbouring shadow map samples anyway. So it seems to me the difference you're seeing with fetch4 is purely utilisation of texture addressing and fetch pipes. Er, stating the obvious.

Makes sense. Actually the application in which I saw the largest difference was when fetching bilinear samples from four corners of a summed-area table. (Fetch4 at each corner, SAT math for the four rectangles, then bilinearly weight the sums). It was almost worth splitting into *two* textures for VSM (and paying the MRT and SAT generation pass costs) just to get the use of Fetch4.

Jawed said:
It'd be interesting to compare the shadow map filtering performance on 8600GTS versus 8800GTX to see how the ratios impact the fetch-count scaling that you're observing: TA:TF, ROP:bandwidth, bilinear-rate:zixel-rate ratios.

Yeah that would be quite interesting. I have an 8800GTX in my desktop and an 8600M in my laptop, but the performance delta between the two is so great as to make a comparison largely useless.

Jawed said:
Also, I'm mildly puzzled why you say fetch4 prolly wouldn't improve G80 performance much (or NVidia hardware in general)? Is that because you're always using PCF, anyway? Or because of the lower ALU:TEX ratio?

This is for general shader use actually (not just PCF), and I have no real numbers to back that up. It has just been my experience that I can throw *gobs* of texture lookups at NVIDIA hardware whereas on ATI I have to be careful with the *number* of lookups, although very wide formats (4xfp32) and Fetch4 are fine. I haven't done a direct comparison though so take that with a grain of salt.

I'd love to get some hard data on how one can trade-off number of texture lookups and bandwidth, etc. but it seems like a particularly complicated function once register pressure and latency hiding are considered (and they really must be on a GPU).

3dcgi · Jul 7, 2007

Julidz said:
We are also hearing that some of the exclusive DX10 features that could enable unique and amazing effects DX9 isn't capable of just don't perform well enough on current hardware. Geometry shader heavy code, especially involving geometry amplification, does not perform equally well on all available platforms (and we're looking at doing some synthetic tests to help demonstrate this). The performance of some DX10 features is lacking to the point where developers are limited in how intensely they can use these new features.

(....)

we really need to see faster hardware before developers can start doing more impressive things with DirectX 10.

Those quotes prove my point. Anandtech didn't single out a specific card as being bad at any particular DX10 feature.

Mintmaster · Jul 8, 2007

Jawed said:
The architecture, binding MCs and ROPs together (with L2) doesn't really give them much choice in this matter if they want to a bus that's bigger than 256 bits. So, it's deliberate.

What's your point? You were claiming that "G80 is so wildly wasteful of its texturing and raster capabilities". I was demonstrating to you that it probably isn't.

I wish there was more analysis of shadowing performance - shadowmap fillrate is the one area that G8x is deliciously fast in - it's here I can't tell whether the spec is "just right" or well over the top. Stalker's dynamic lighting seems to demonstrate justice for G8x ROPs, but I don't know.

I don't think shadowmap fillrate is much of a limitation on the high end cards. Even a 2048x2048 shadow map with 2x overdraw (remember that arial views have low overdraw) would get filled in under a millisecond on R580, so that's 57fps vs. 60fps for an infinitely fast fillrate. R600 has 2.3x that rate, G80 much more. Triangle setup is usually the big cost for shadow maps.

ATI tweaked drivers for R5xx so that no-AF performance suffered at the same time as AF performance improved. No-AF really isn't interesting, it should be always-on.

I was just talking about a test where we can take G80's 1:2 TA:TF advantage out of the equation. In such a scenario the GTS is still close to R600 in games, despite having much lower math ability, bandwidth, vertex setup, etc. It has equal texturing rate and faster AA fillrate (and the latter not true with 2xAA or w/ blending). These are the only things that would let the GTS catch R600, so saying "G80 is so wildly wasteful of its texturing and raster capabilities" is just hyperbole.

HD2900XT is ~50% faster than X1950XT in the most texture-fetch bound gaming scenario I can find, when it's 14% faster theoretically (25%, 45% and 72% at the three resolutions). Trouble is, zixel rate prolly dominates (but HD2900XT has got 128% more zixel rate).

Don't forget hierarchical stencil. That makes a huge difference in Doom3. I don't think this game's performance is indicative of texture fetch performance at all.

To be fair, TA rate is where I think G80 falls down - it's not fetch rate per se, but I'm wondering if it's effectively what you mean when you say fetch rate? What percentage bigger would G80 have been with double the TAs?

Why does G84 have a 1:1 TA:TF ratio, not 1:2? Is it because 32 TFs would have made it way too large, or is it because an extra 8 TAs (from 8 to 16) was low cost?

Effectively, yeah. I just avoided "TA rate" because ATI claims more than 16 per clock when really that isn't true for pixel shader texturing. Not sure what you mean by "G80 falls down", though. It's still 55% more than R600 for the GTX, and matches R600 in the GTS.

For G84, I think it's partly because reviewers don't enable AF for cards of that budget as consistently as they do for the high end cards (or at least that's what NVidia thought at design time). It seems like people really want to run at native LCD resolutions, even if they have to disable AF/AA (which I think is stupid, but whatever).

All those >8bpc formats are a tiny proportion of frame rendering time. The "HDR" and "shadowing" impact on games isn't TMU-centric, it's ROP-centric and bandwidth-centric.

Right now, yeah, but ATI also wanted to avoid halving the filter rate for these formats, and they did it in a way that didn't have the auxillary benefits of NVidia's approach. If both IHV's are doing this, they clearly think there's some future for these formats.

The irony being that as screen resolutions gradually climb and game worlds "expand", the AF workload per screen pixel drops, because we're stuck at 16xAF - the "backwall" of AF is not receding with increased resolution. I think we need to go to 64x or higher...

Also, the gradual switch, with newer games, to developer-managed AF rather than CP-activated AF frees-up a load of AF workload.

The "backwall" has little to do with AF load. The reason AF load per pixel drops is that all the near pixels don't need as much AF, so the "frontwall" (i.e. the point where at least 2xAF is needed) moves back. In any case, I believe that texture size will increase faster than resolution, and thus AF load per pixel will increase, all else being equal.

However, your second point makes sense, so all else is not equal. Assuming they enable AF selectively (say, on just the base texture), you're probably right about AF performance playing a lesser role.

In that D3 test I linked above, at 2560x1600, 8800GTS-640 is 38% faster than X1950XTX, with the same bandwidth. 8800GTX is 42% faster than 8800GTS-640, with 32% more bandwidth, 53% more bilinear rate and 38% more fillrate. So 8800GTS might be just feeling the first pinch of bandwidth limitation there and it might just be a texturing-bound test. At that resolution, HD2900XT is 24% faster, despite having 30% of the zixel rate and the same bilinear rate.

Again, there's too many variables to make any conclusion like that. D3 performance is also heavily dependent on early fragment rejection (Z and stencil), which is quite different between R600 and G80 (rate, granularity, corner cases, etc). Add in drivers (sometimes the GTX is more than 53% faster than the GTS!), memory efficiency, per-frame loads (resolution scaling isn't perfect), etc and now there are a lot of unquantified factors from which you're trying to isolate the effect of just one.

My theory is that G80 is TA-limited and the TMUs are woefully inefficient (or there's just too damn many of them for the bandwidth available). So I'm suggesting cutting the TFs back to 75% and doubling the TAs.

You have no evidence for such a theory. Texturing tests show that efficiency is just fine. Game tests have too many variables for you to blame TA ability.

Are you complaining about why the filtering units aren't saturated? Is that what this whole rant is about? You're complaining about why the GTX isn't 3x the speed of R600 in games???

Who cares? There's no reason to expect that. They're there to reduce the performance hit of high TF scenarios.

No, G8x has register-file bandwidth corner cases and hazards. And MAD+SF co-issue scheduling is much easier if you join two batches together (a warp) than if you try to schedule 16 object batches. All of this is statically schedulable as far as I can tell.

I still don't see why, for a warp size of 16, these would affect scalar/vec2 code more so than vec4 code. I understand how instruction issue rate changes, but not register related issues. Are you talking about a latency between writing a result and reading it again? That's easy to solve with a temp register caching the write port, so this is not an issue that's holding NVidia back from reducing warp size.

Doubling the width of all those buses is hardly trivial! There are 3 buses feeding the ALU pipeline: RF, constant and PDC.

Yeah, it's pretty trivial. PDC might be different since I don't really know the details, but for a fixed number of ports in the RF and CB, doubling bus width is easy. An equally sized RF or CB with double the bus width and double the granularity simply halves the number of partitions you're selecting from and doubles some wires.

Doubling SIMD width and warp size is way cheaper than doubling the number of arrays. I can't see why you'd think otherwise.

A SIMD might be 16-wide (5-way), but the RF is prolly organised as 4 separate blocks, each block private to a set of 20 ALU pipes.

So R600 has 16 of these 20-ALU blocks. G80 has 16 groups of 8-ALU array. Care to explain why it's so hard for NVidia to double the SIMD width next gen when it's still smaller than ATI this gen?

Unknown Soldier · Jul 8, 2007

rwolf said:
Fix what?

Sorry for the late reply.

Fix why it's so slow. I get it drivers are a big issue.. well DX10 drivers should be an issue, DX9 drivers should've been up there from the start. Very Shoddy AMD.

Fix High Power usage - It's really a tradegy that it needs all that power to be racing only against the GTS like AMD have stated. This card should've been going against the GTX.

High GPU speeds but low performance - When will AMD learn to use it's high clock speeds effectively. The G80 does more using less. It's about time AMD learn to do the same since the past few generations have done the same thing.

High core speeds means high power usage, AMD get something right next time and try to minimise power output while actually getting high performance from your high core speeds. Then sort out useful drivers before the damn card is actually released. To have a card for almost a half a year(if not more) then to release it and have really shoddy drivers is not a good thing. Yes performance should increase with newer driver updates, but get it right first time dammit then you don't need to go through the process of trying to get it right. Last, actually try and release the card early enough and make sure it can actually compete against the oppositions fastest card. It's been a real mess from AMD this time around.

US

fellix · Jul 8, 2007

You can't blame ATi for a high-leaking third party manufacturing process. The just hadn't choice at the time, except to introduce more delays 'til the next "good" node.

Rys · Jul 8, 2007

Unknown Soldier said:
Fix why it's so slow. I get it drivers are a big issue.. well DX10 drivers should be an issue, DX9 drivers should've been up there from the start. Very Shoddy AMD.

And you can't say that with a completely straight face either. There's nothing easy in writing drivers for new GPU architectures, and the opportunities to leverage expertise in writing a driver for an old arch on the same API aren't as wide ranging as you might think. It'd be nice to see AMD or NVIDIA publish a diagram or two of the driver stack that highlights the reality of the complexity they contain.

Jawed · Jul 8, 2007

AndyTX said:
You may be right, particularly for "simple" shadow implementations that take a small number of samples. I'm just relaying my experience.

Looking at the ATI PCF demonstration for R580:

http://www.beyond3d.com/content/reviews/2/4

the only "reasonable" comparison of fetch-rate seems to be between the 64 fetches for 8x8 PCF at 159fps and the 16 fetches for 8x8 PCF-fetch4 at 263fps. Instead of being fetch-bound at ~66fps, the 64 fetches are running ~2.4x faster than that.

Is it showing that R580's texture cache architecture is running into a wall? We know R600's texture cache architecture is designed for "better performance".

One of the things that the GPUBench tests show:

http://graphics.stanford.edu/projects/gpubench/results/X1900XTX-5534/
http://graphics.stanford.edu/projects/gpubench/results/8800GTX-0003/

is that R580's cache isn't too hot when it comes to supplying the same texels repeatedly, whereas G80 is good. See the 4-Component Floating Point Input Bandwidth test, with SGL access pattern (137GB/s for G80, 39GB/s for R580). I think R600 is radically better here (anyone ran GPUBench on R600?), which should make a huge difference.

Yeah that would be quite interesting. I have an 8800GTX in my desktop and an 8600M in my laptop, but the performance delta between the two is so great as to make a comparison largely useless.

I realised afterwards that your comments were mostly directed at R580, so it may be that there wouldn't be much mileage in a comparison across G8x family anyway.

This is for general shader use actually (not just PCF), and I have no real numbers to back that up. It has just been my experience that I can throw *gobs* of texture lookups at NVIDIA hardware whereas on ATI I have to be careful with the *number* of lookups, although very wide formats (4xfp32) and Fetch4 are fine. I haven't done a direct comparison though so take that with a grain of salt.

8800GTX is 77% faster in theoretical bilinear rate than X1950XTX.

I'd love to get some hard data on how one can trade-off number of texture lookups and bandwidth, etc. but it seems like a particularly complicated function once register pressure and latency hiding are considered (and they really must be on a GPU).

I expect newer GPUs are more forgiving. R600 should have "practically zero" register pressure problems from what I can discern, and it has vast amounts of latency-hiding. R600's complication should be to do with instruction issue rate. This is NVidia's argument, that things like shadow buffers, being scalar, will always have their math run at full speed.

Jawed

nAo · Jul 8, 2007

Jawed said:
R600 should have "practically zero" register pressure problems from what I can discern

Why do you think that?

fellix · Jul 8, 2007

Jawed said:
is that R580's cache isn't too hot when it comes to supplying the same texels repeatedly, whereas G80 is good. See the 4-Component Floating Point Input Bandwidth test, with SGL access pattern (137GB/s for G80, 39GB/s for R580). I think R600 is radically better here (anyone ran GPUBench on R600?), which should make a huge difference.

Over 160 GB/s on R600 for the SGL part.

Too pity that due to driver issues, I can't complete the entire set of GPUbench.

Andrew Lauritzen · Jul 8, 2007

Jawed said:
Is it showing that R580's texture cache architecture is running into a wall? We know R600's texture cache architecture is designed for "better performance".

Yeah that's interesting and those results are certainly in-line with what I've seen. Has anyone run that demo on R600? I'll maybe try if I get a chance at the office this week (it's DX9 right?).

Jawed said:
8800GTX is 77% faster in theoretical bilinear rate than X1950XTX.

That comment was actually more directed at a comparison between G70 and R580, which is when I was doing more direct comparisons - since G80 there's really no competition unfortunately, excepting that G80's geometry shader does appear to be almost unusably slow (although I've not played with R600's yet).

[Aside: I disagree with the current AMD press though that basically equates not using the GS to "DirectX 9 style rendering". That's nonsense... I personally have some reservations about the efficiency vs utility of the GS in general, particularly when many amplification, etc. algorithms can be implemented more efficiently using global (segmented) collective operations like reductions/scan - which is eve more viable given DirectX 10's nice resources/views setup which runs flawlessly and fast on G80 I might add. Anyways I still have hopes for R600+ and it's nice to have a fast GS, but honestly it's not really as big of a deal as AMD is making it out to be, although of course they have to find something...]

Jawed said:
I expect newer GPUs are more forgiving. R600 should have "practically zero" register pressure problems from what I can discern, and it has vast amounts of latency-hiding. R600's complication should be to do with instruction issue rate.

Right. Well I should really make some time to play around with R600 this week and next to get a better idea of the capabilities. Unfortunately I can't run a lot of our applications/benchmarks because OpenGL is still pretty broken on ATI (as always - sigh), and it's currently in an XP machine. So basically all I have is DX9 stuff for now...

PS: I'd certainly be interested in hearing some IHV feedback particularly on the texture fetching/bandwidth issues that we've been discussing, and how G80 and R600 compare there.

Sir Eric Demers on AMD R600

Razor1

Pete

Moderate Nuisance

Razor1

Julidz

swaaye

Entirely Suboptimal

Jawed

Andrew Lauritzen

Moderator

OpenGL guy

Jawed

OpenGL guy

Andrew Lauritzen

Moderator

3dcgi

Mintmaster

Unknown Soldier

fellix

Rys

Graphics @ AMD

Jawed

nAo

Nutella Nutellae

fellix

Andrew Lauritzen

Moderator

Similar threads