Sir Eric Demers on AMD R600

Status
Not open for further replies.
There. So, if your card spends 3% of it's time working on the Geometry shader, no matter HOW FAST that Geometry shader might be, it's not making up for the other 97% of the frame. Get it now?
But its relative. If card a spends 3% on GS but card b is 50x slower in GS then its spend 150% of the frame time; all else being equal card b is taking greater than 2x the time to render a frame!

;)
 
But its relative. If card a spends 3% on GS but card b is 50x slower in GS then its spend 150% of the frame time; all else being equal card b is taking greater than 2x the time to render a frame!

;)

All else being equal. Is that always the case?Some of that delta can be made up in other places, as you are certainly well aware. I simply don`t see the GS as being a trump card this round, a table-turner or whatever. Just IMHO, with all due respect.

Dave, are you at liberty to say whether or not an X64 version of the Lost Planet hotfix will be made available?Or are we(the X64 crowd) to wait for the next Cats?Thank you.
 
But its relative. If card a spends 3% on GS but card b is 50x slower in GS then its spend 150% of the frame time; all else being equal card b is taking greater than 2x the time to render a frame!

;)

Well, you're leaving out the other 97% of the equation. Which means we could flip it backwards:

If Card A takes 1msec to compute the GS, and then 100msec to compute the rest , whereas Card B takes 50msec to compute the GS and 35msec to do the rest, then the 50x speedup still got you nowhere.

;)
 
You are bad prophet, aren't you? :LOL:

untitled1iu1.png


This is the most recent survey from the Steam site, although catching just a fraction of the total user-base, but it should be enough representative for the trends.

Well, that or either way, this picture draws bad name for the whole DX10 market galore at all, though... :cool:

Almost 1.5% of the userbase are using Vista and 8800's? Wow im actually quite suprised at how high that is!!

In fact I just looked at the total results, and a staggering (relatively speaking!) 4% of the total userbase is on the 8800! Damn do hear hear that game devs! Start catering to us! ;)

EDIT: Looking at those results it seems around 15% of the PC gaming market is operating at the "high end" of 7800 or above. Thats not bad I guess but when you consider its only about 100,000 users and I really don't see steam accounting for less than 10% of the total high end PC gaming market. So 1 million high end PC gamers at most.....
 
Last edited by a moderator:
I don't know what the hoopla is all about. There are plenty of cases where GS runs faster on G80 than R600.
 
I don't know what the hoopla is all about. There are plenty of cases where GS runs faster on G80 than R600.

Just for purposes of this thread and informational value, do you have any links to such testing?
 
Not only that, but the 8-bit "HDR" formats (texture+render target) in D3D10 make fp16 texture filtering (or fp32 filtering) something of an historical blip I suspect, caught between those formats and deferred rendering engines.
They're 32 bpp formats, not 8 bpc. That's an important distinction since these formats still require high precision filtering, at least for 3 channels.
 
They're 32 bpp formats, not 8 bpc. That's an important distinction since these formats still require high precision filtering, at least for 3 channels.
Thanks.

Hmm, so the fp16 texture filtering functionality is presumably deployed to filter this format. So 32-bit RGBE doesn't save on filtering hardware, it only saves on bandwidth.

Jawed
 
Any ideas why the AA performance of the r600 takes a huge nose dive? I find it odd that it takes a larger hit then the r580, (not talking about when it really gets hit hard and performs less then an r580, those are obvious bugs)
 
My wild idea is, that the whole AA concept implementation in R600 is based upon an assumption that this kind of post-processing (in the foreseeable DX10 environment) must be carried exclusively by the developer, using its own sample resolve coding.
That said, the current AA support, provided on the driver-level for R600, is a kind of too rough and "wastefull", but it works for nearly everything... duh!
 
Geometry amplification will not be used extensively for quite a while, because even if we have relative numbers between IHVs, we have jack absolute performance numbers when doing amplification, and I have a hunch that they`re not that extraordinary for first-gen hardware.

Well, if you're using the GS chances are pretty high you'll be doing geometry amplification. Most meaningful uses of the GS include some amount of amplification.

Well, you're leaving out the other 97% of the equation. Which means we could flip it backwards:

If Card A takes 1msec to compute the GS, and then 100msec to compute the rest , whereas Card B takes 50msec to compute the GS and 35msec to do the rest, then the 50x speedup still got you nowhere.

;)

Dave's point was that you can't assign a percentage of the workload to GS like that. If the R600 spends 3% of the time on GS the G80 might spend 95% on it. Just because say 3% of the draw calls in a game uses GS doesn't mean it's going to be 3% of the workload. Being incredibly slow at 3% of the rendering calls can most certainly bring down the framerate below another card even if you're faster on the other 97%.
 
Xenos's 8 ROPs, which have essentially no compression functions, cost ~20M transistors and are less functional than D3D10 ROPs. 24 ROPs in G80 prolly cost in the region of 100M transistors. 8 Zs per ROP per clock is a lot.
Okay, that's a good point. They still have to decompress the data stream, but I'll agree that it's less than what an EDRAM-less GPU has to deal with.

It should be noted that Xenos has a 4 blend units per ROP, and we don't know how many G80 has. It may only have 1, because if the samples aren't identical (and hence not compressed) it'll take a while to transfer the data anyway. Also, I think G80 has half speed blend like NV40. The 3DMark numbers are showing precisely 12 pixels per clock (6.9 GTex/s). That will save a lot of transistors.

Nonetheless, let's assume your right with 100M. That's still inline with what I was talking about. Suppose 16 DX10 ROPs doing 2 samples per clock cost 40M transistors. I think that's fair. 60M additional transistors is under 10% of G80, and though I can never prove it, I think G80 has benefitted more than 10% in AA performance (the only performance that counts for this market segment). I really don't think they've been wasteful here
Texture fetch (as opposed to bilinear or AF filter) is rarely a bottleneck.
Even with AF disabled, R580 was very often only marginally faster than R520. (Trilinear has always had a low impact with the mipmap optimizations we've seen, so bilinear rate = fetch rate for all intents and purposes.)

You lamented single-cycle trilinear as wasteful back just before G80 launched.

Filtering is the single most expensive part of texturing:
Originally, yes. I thought it was wasteful because you only need a bit more hardware to make double speed bilinear (it's probably even a greater fraction than what's shown by that chart once you take the sample fetching, cache, and MC into account). However, then I realized that you would need twice the threads and registers to truly double bilinear texturing speed. That was a pretty big oversight in my original analysis. Moreover, I did not consider that it would also double the single mipmap speed of FP16, I16, FP32, other >8bpc HDR formats, and 8bpc volume textures.

I think alpha-blending and overdraw are what cause the worst framerate minima, things like explosions, clouds of smoke, lots of characters running around the screen, large areas of foliage. I think in comparison, "texture load" is relatively constant in regions of heavy texturing - you don't get a 50% reduction in framerate from crouching.
That's true, and I've brought that point up many times myself too. I wasn't saying AF is the biggest cause of dips in framerate, but just suggesting that it's a load that could wildly fluctuate when everything else stays the same, especially if texture resolutions start going up the way we all want them to.

<other points>
Agreed.

best-case. So, ahem, G84 v RV630 is a solid win with 4xAA/AF at 1280x1024, but 8800GTS-640 against HD2900XT is a narrower victory at 1600x1200 8xAA/AF:
Even if you disable AF to almost negate the filtering advantage, the 8800GTS-640 is not far from the HD2900XT. With less fillrate (well, w/o AA only I guess), way less bandwidth, and way less math ability, you have to think the equal texturing rate has something to do with it, right?

I'm thinking of an alternate history where G80 was 12 clusters, with a 1:1 TA:TF ratio, so more ALU capability and less TF.
But that would mean 48 bilinear TMUs! Weren't you just saying fetch rate is less important than filtering rate? By definition a cluster has a TMU quad, so if you wanted more math and less texturing, you'd go for few clusters with wider SIMDs (and a larger warp size).

Remember how the original Geforce could do single cycle trilinear (i.e. 2 filtered quads per clock) in 15M transistors. It cost NVidia 10M transistors to double the addressing ability and double the pixels in flight for the Geforce 2. Though I suspect they increased texture blending math and I guess increased clock speed had some cost as well.

G8x architecture was set too far back for the "prove to the world" thing. Don't forget that a batch in G80 is actually 16 objects in size. A warp is two batches married, because it makes the more complex register operations of pixel shaders kinder on the register file. Pixel shaders will tend towards vec2 or scalar operands, while vertex shaders will tend towards vec4 operands.

There are 2 parameters in warp size: clocks per instruction and SIMD width. The problem with making the SIMD wider, say 16 objects, is that the register file needs to get twice as wide. G80 fetches 16 scalars per clock (and 16 constants per clock and 16 scalars per clock from PDC). All of these would have to be doubled if the SIMD gets wider.
Okay, we can scrap my "prove to the world" idea, but given that ATI went nuts for R520's tiny batch size, I can see how NVidia would do the same when planning for G80. They obviously knew how useless G70's DB was.

I'm not following you on the impact of pixel shader vs. vertex shader on the register file. In a scalar architecture it doesn't really matter. You store your data in a SOA manner in the RF. Parallel MADDs still need to load the same 8 floats for each of 3 operands as with serial MADDs. The clocks between instruction changes makes sense, though. A vec4 op on a batch of 16 would uses the same instruction for 8 clocks on 8-wide SIMD arrays.

Also, making the register file wider isn't nearly as costly as making it bigger. The number of operands are still the same. One quarter of R600 fetches data for 80 SPs each clock. The sixteen arrays in G80 only fetch data for 8 SPs. Anyway, regardless of how you approach it, it would definately be more costly to keep the warp size the same and double the math instead of doubling both.
 
http://www.anandtech.com/video/showdoc.aspx?i=3029&p=7


here says that even HD2900XT doesn't have hardware to do well in Directx10 and also that geometry shaders doesn't perform well with the GPU

true ?


well , I though that R600 was a lot more powerful than G80 in GS...
I only read the page you linked to (conclusion), but didn't see anything that singled out the HD 2900XT as you write above. They didn't like the performance of any cards.
 
ATIs chips, including R4x0, have a notable advantage in Oblivion that I've always been curious about. NV40 really bit the dust hard in that game, for some reason. G70 wasn't hugely better, either.
HSR-mechanisms - as evident in the fact, that those huge advantages are only in outdoor environments in place.
 
But the R4x0 can`t be tested with HDR, as they don`t support FP blending...so I`m guessing those comparisons are made with bloom(sorry, wasn`t really interested in GPUs at the time, so I don`t have benchies in my mind ATM).
 
But the R4x0 can`t be tested with HDR, as they don`t support FP blending...so I`m guessing those comparisons are made with bloom(sorry, wasn`t really interested in GPUs at the time, so I don`t have benchies in my mind ATM).

If you're referring to my post: Oblivions outdoor areas taxe Nvidia-GPUs up until G80 heavily - regardless of HDR or Bloom selected.
 
If you set AF at app. pref, you get much better (AF) performance then forced via CCC! This with a X2900Xt 1 GB GDDR4 and cat 7.6.
 
But the R4x0 can`t be tested with HDR, as they don`t support FP blending...so I`m guessing those comparisons are made with bloom(sorry, wasn`t really interested in GPUs at the time, so I don`t have benchies in my mind ATM).

You mention HDR, but he mentioned HSR -- you know, hidden surface removal? Different stuff there ;)
 
You mention HDR, but he mentioned HSR -- you know, hidden surface removal? Different stuff there ;)

AAAAHHHH, so that`s what it was....thank you, I feel most educated now:p:D

The point of my post was that an apples to apples comparison between nV and ATi parts of the time could only be done without HDR, in Bloom mode. Which is quite irrelevant in the context of this thread anyway.
 
Status
Not open for further replies.
Back
Top