Sir Eric Demers on AMD R600

leoneazzurro · Jul 4, 2007

nAo said:
fetching != filtering, AFAIK R600 filters FP32 textures at half FP16 rate, so it's still twice as slow as G80 on a per clock basis.
R600 strength (or weakness imho) is to be able to filter FP16 and RGBA8 textures at same rate.

Hmm... I meant "fetching a FP32 filtered value per clock" as stated on some ATI pre-release architecture slides. But it seems it can happen only in some instance (FP32 1-channel), as I looked at the synthethic tests of B3D and yes, total FP 16/FP32 filtering power on R600 is normally about 64% of G80's, given the clock differencies.

Razor1 · Jul 4, 2007

wouldn't say its sacraficial, I see advantages in the long run for the r600, but I don't think they will be that usable. Depending on how developers make their games of course, and since the g80 came out first, well that kinda puts a damper on this gen.....

leoneazzurro · Jul 4, 2007

I have to agree partially with Jawed, while I think R600 is not completely "sacrificial", especially regarding DX10 functionality, IMHO some achievements seem clearly like a "test" for the future architectures, i.e. the 512 bit-bus seems simply overkill for R600 numbers but now it's done, and maybe it will be ready for a refresh that can actually use all that bandwidth.

swaaye · Jul 4, 2007

What bothers me the most is that they seem to make the same mistakes over and over. Continually disappointing mid-range parts. X700 vs. 6600. X1600 vs. 7600. Especially with the latter, we see plans for future shader-heavy apps. Did X1600 ever "come into its own" on later games? I don't think so. It's like they keep targeting apps that aren't really out there.

Tho w/ X700 it came down to their inability to get the needed clock speed for RV410 to defeat NV43's more efficient arch.

Arun · Jul 4, 2007

For reference's sake, the B3D filtering results: http://www.beyond3d.com/content/reviews/16/13
Note that the G80's results are a bit weird, it's more than theoretically possible for scalar FP32, and yet barely so. This might or might not be a bug... (but probably is IMO)

Arnold Beckenbauer · Jul 4, 2007

Razor1 said:
And how about fp 64?

is this what we see when HDR is activated in Far Cry, the r600 gets hurt?

Correct me, but fp16 filtering isn't used in Far Cry (&SC3).

tEd · Jul 4, 2007

Arnold Beckenbauer said:
Correct me, but fp16 filtering isn't used in Far Cry (&SC3).

AFAIK it is in both

Razor1 · Jul 4, 2007

Far Cry definitly uses FP16 not sure about SC3

Jawed · Jul 4, 2007

Razor1 said:
Far Cry definitly uses FP16 not sure about SC3

What fp16 texture assets are filtered in Far Cry? As far as I know the only fp16 textures in FC are "HDR" render targets converted to textures, in readiness for a tone-mapping pass. This is point filtered, isn't it?

Jawed

Razor1 · Jul 4, 2007

hmm that I have to check up on, not sure, but still have to filter those fp16 render targets. Well after the render target, don't know if thats done by point filtering or not but that would be the only thing neccessary I think.

Actually point filtering won't work to well I think.

Yeah Far Cry uses FP16 filtering and blending.

Jawed · Jul 5, 2007

http://ati.amd.com/developer/gdc/D3DTutorial08_FarCryAndDX9.pdf

I thought bloom might be a TMU-filtered process, but according to this:

http://www.daionet.gr.jp/~masa/archives/GDC2003_DSTEAL.ppt

I don't think it is.

Jawed

Jawed · Jul 5, 2007

Hmm, does exposure-determination use an iterative bilinear downsampling?

Jawed

nAo · Jul 5, 2007

Jawed said:
Hmm, does exposure-determination use an iterative bilinear downsampling?

It would be less accurate to use bilinear filtering to determine exposure (as you'd need to averate log luminance of four pixels, not to compte the log luminance of the average of four pixels) but it's certainly useable and I wouldn't be suprised if it gives good results)

Mintmaster · Jul 5, 2007

Jawed said:
What's continually entertaining about this is people are unwilling to ask how come G80 is so wildly wasteful of its texturing and raster capabilities, being massively over-specified for the performance it delivers.

I think it's a lot less wasteful than you think.

Rasterization is cheap. Remember that the 15M transistor Geforce could do 4 pixels per clock, and only a fraction of its transistors would be devoted to rasterization. Going from 32 to 96 samples per clock won't triple performance, but even a 10% increase is probably worth the die space.

Texture fetching is very useful. The 8800GTX is has the same fetch rate as R600 and is pretty close in performance, even though it is has a heavy deficit in math throughput and bandwidth.

Texture filtering may seem like overkill, but NVidia basically decided that instead of just free FP16 like ATI, adding double speed 8-bit doesn't cost a whole lot. It's one of the main reasons that G80 could trounce G71 in shader tests where R580 would barely improve over R520. Sure, we don't see those shaders (using volume textures) in games right now, but we do get increased AF speed that ties well into another point you made...

Jawed said:
And, sadly, we're still in a world where all credence is given to average frame rates. It makes my blood boil. The distortions brought about by silly-high max-FPS have no place in architectural analysis.

Fast filtering could really help with min framerate. Game situations can arise where a different view or simply crouching can drastically increase the AF load, especially when texture resolutions are high (as we hope they will be in upcoming games).

The most damning aspect of R6xx, for me, is the fact that ATI put together all that bandwidth and then left it on the table. Now is the time for 4xAA samples per loop, not in another 2 years' time. NVidia got this right, even if only just about so. Xenos was a huge misdirection. ARGH.

Here I agree, though I'm not sure what you're saying about Xenos. It was a mistake, or it misled you into what you expected for R600?

Eric has said it himself that it was never designed to be 2x the performance of R5xx (even if it occasionally gets there).

This, to me, is a bit disconcerting. They clearly weren't aiming high enough with what to do with 700M transistors. What did they expect? That NVidia would just sit around? As much as they talk about the high end getting too expensive, I'm sure the executives at ATI would love to sell a smaller chip at the same price and/or this chip at a higher price. If NVidia can halve G80 and get 60% (due to a higher clock) of the performance when going to 65nm, then ATI will really feel it.

I would really like to see NVidia go to a 4-SIMDs per cluster design (or double the width of each SIMD). It would also be really good if they dropped the "free trilinear" TMU architecture - G84 seems quite happy. I think G80's problem is that the thread scheduling costs so much that they had to cap the number of clusters, so bumped up the TMUs to compensate, but in doing so had to move "NVIO" off-die, because TMU-scaling is very coarse-grained.

G84 doesn't seem to be as effective as G80 per transistor per clock, so I'm not so sure about writing off the filtering ability just for that reason. The good thing about double speed filtering is that you improve texturing performance without increasing the number of registers needed for latency hiding.

Your theory is sort of weird, too. They have to have one TMU quad per cluster with their current design (they actually call it a "texture processing cluster"), so if they didn't have this "cap" then what? If they were planning on fewer TMUs, NVidia would have had fewer clusters, not more.

As for the thread scheduling, G84 could simply double the warp size to match ATI. That would halve their scheduling cost with almost no performance cost. I think they wanted to prove to the world that they could lead technologically in dynamic branching, and that's why they did it. Now they can go back to a more sensible warp size, and I suspect they'll do so with G92.

pakotlar · Jul 5, 2007

nAo said:
I don't believe in masochistic GPU design, I believe in wrong design decisions.

I'm not sure that r600 could be coined masochistic.

nAo · Jul 5, 2007

pakotlar said:
I'm not sure that r600 could be coined masochistic.

I don't believe it is "masochistic" either

no-X · Jul 5, 2007

Razor1 said:
Far Cry definitly uses FP16 not sure about SC3

FarCry uses FP16 blending, not sure about FP16 filtering. R5xx doesn't support FP16 filtering, but it runs FarCry with HDR enabled without any special patch (which will e.g. do FP16 fitlering through a shader). I'm not 100% sure, but every article stated, that FP16 blending was required - not a single mention of FP16 filtering.

Contrary Splinter Cell (SM3.0 path) requires FP16 filtering support, otherwise only SM1.x or SM2.0 paths are allowed.

Razor1 · Jul 5, 2007

I don't know man, I'm going to have to ask em, I know its in the engine, we are using it but not sure about Far Cry and yeah it might not have been used, haven't see a single article that mentioned FP filtering.

Jawed · Jul 5, 2007

Mintmaster said:
Rasterization is cheap. Remember that the 15M transistor Geforce could do 4 pixels per clock, and only a fraction of its transistors would be devoted to rasterization. Going from 32 to 96 samples per clock won't triple performance, but even a 10% increase is probably worth the die space.

Xenos's 8 ROPs, which have essentially no compression functions, cost ~20M transistors and are less functional than D3D10 ROPs. 24 ROPs in G80 prolly cost in the region of 100M transistors. 8 Zs per ROP per clock is a lot.

Texture fetching is very useful. The 8800GTX is has the same fetch rate as R600 and is pretty close in performance, even though it is has a heavy deficit in math throughput and bandwidth.

Texture fetch (as opposed to bilinear or AF filter) is rarely a bottleneck.

Texture filtering may seem like overkill,

You lamented single-cycle trilinear as wasteful back just before G80 launched.

but NVidia basically decided that instead of just free FP16 like ATI, adding double speed 8-bit doesn't cost a whole lot.

Filtering is the single most expensive part of texturing:

Not only that, but the 8-bit "HDR" formats (texture+render target) in D3D10 make fp16 texture filtering (or fp32 filtering) something of an historical blip I suspect, caught between those formats and deferred rendering engines. They'll always have their uses, but devs will use them as a last resort simply because of the space and bandwidth they consume.

Fast filtering could really help with min framerate. Game situations can arise where a different view or simply crouching can drastically increase the AF load, especially when texture resolutions are high (as we hope they will be in upcoming games).

I think alpha-blending and overdraw are what cause the worst framerate minima, things like explosions, clouds of smoke, lots of characters running around the screen, large areas of foliage. I think in comparison, "texture load" is relatively constant in regions of heavy texturing - you don't get a 50% reduction in framerate from crouching.

Here I agree, though I'm not sure what you're saying about Xenos. It was a mistake, or it misled you into what you expected for R600?

Ooh, I don't think it was a mistake, it was a great decision. It also gave the impression that ATI considered the time right to go with 4xAA per loop.

This, to me, is a bit disconcerting. They clearly weren't aiming high enough with what to do with 700M transistors. What did they expect? That NVidia would just sit around?

Precisely. At the same time I think their architecture, specifically the virtualisation and the multiple concurrent contexts, are very costly. Look at the huge size of RV630, 390M transistors

, even if a fair chunk of that is texture cache. So the architecture has a high cost of entry.

As these architectural concepts become more important in future D3D versions, NVidia will have to play ball, just like NVidia had to implement small-batches, out of order threading and decoupled texturing.

As much as they talk about the high end getting too expensive, I'm sure the executives at ATI would love to sell a smaller chip at the same price and/or this chip at a higher price. If NVidia can halve G80 and get 60% (due to a higher clock) of the performance when going to 65nm, then ATI will really feel it.

Yep, though I think 70%+ is more likely, there's a lot of clock headroom...

G84 doesn't seem to be as effective as G80 per transistor per clock, so I'm not so sure about writing off the filtering ability just for that reason.

There's NVIO "on-die" plus the fixed-function overheads: thread scheduling, VP2, etc.

8600GTS at 10.8G texels/s bilinear and 43.2 G zixels/s versus HD2600XT at 6.4G texels/s bilinear and 6.4G zixels/s does look barmy. For all that, 8600GTS comes out about 28% better:

http://www.computerbase.de/artikel/..._xt/24/#abschnitt_performancerating_qualitaet

best-case. So, ahem, G84 v RV630 is a solid win with 4xAA/AF at 1280x1024, but 8800GTS-640 against HD2900XT is a narrower victory at 1600x1200 8xAA/AF:

http://www.computerbase.de/artikel/..._xt/32/#abschnitt_performancerating_qualitaet

The good thing about double speed filtering is that you improve texturing performance without increasing the number of registers needed for latency hiding.

That's the only positive I've heard so far. And G80 is not exactly thread-happy, having "under-sized" register files and extremely slow register file spill to memory.

Your theory is sort of weird, too. They have to have one TMU quad per cluster with their current design (they actually call it a "texture processing cluster"), so if they didn't have this "cap" then what? If they were planning on fewer TMUs, NVidia would have had fewer clusters, not more.

I'm thinking of an alternate history where G80 was 12 clusters, with a 1:1 TA:TF ratio, so more ALU capability and less TF.

But I suspect this wasn't practical, because thread scheduling hardware is just gobbling up die space, or because a 12:6 crossbar twixt clusters and ROPs/MCs might have been too costly.

1:2 TA:TF just looks like the easiest to construct. But it forced NVIO off-die. I have a theory that a year earlier, G80 was planned to be 1:1, but they revised upwards to get the performance scaling they desired. It was the simplest "fix" to an under-performing part.

As for the thread scheduling, G84 could simply double the warp size to match ATI. That would halve their scheduling cost with almost no performance cost. I think they wanted to prove to the world that they could lead technologically in dynamic branching, and that's why they did it. Now they can go back to a more sensible warp size, and I suspect they'll do so with G92.

G8x architecture was set too far back for the "prove to the world" thing. Don't forget that a batch in G80 is actually 16 objects in size. A warp is two batches married, because it makes the more complex register operations of pixel shaders kinder on the register file. Pixel shaders will tend towards vec2 or scalar operands, while vertex shaders will tend towards vec4 operands.

There are 2 parameters in warp size: clocks per instruction and SIMD width. The problem with making the SIMD wider, say 16 objects, is that the register file needs to get twice as wide. G80 fetches 16 scalars per clock (and 16 constants per clock and 16 scalars per clock from PDC). All of these would have to be doubled if the SIMD gets wider.

So I suspect it's easier for NVidia to go with a longer instruction duration, 4 clocks in future GPUs. This also has the side-effect of reducing the read after write etc. hazards against the register file, which also makes it easier to fetch operands for those tricky corner cases of MAD+MUL co-issue. At this point the batch size and warp size will coincide: 32 objects, and that will be that for the foreseeable future, I suspect.

Jawed

swaaye · Jul 5, 2007

I was about to suggest that maybe these new cores aren't even necessarily targeting games primarily, but perhaps GPGPU. But then I remembered Eric's comments that GPGPU was hardly a consideration back when they were drawing R6x0 up. So, never mind.

Sir Eric Demers on AMD R600

leoneazzurro

Razor1

leoneazzurro

swaaye

Entirely Suboptimal

Arun

Unknown.

Arnold Beckenbauer

tEd

Casual Member

Razor1

Jawed

Razor1

Jawed

Jawed

nAo

Nutella Nutellae

Mintmaster

pakotlar

nAo

Nutella Nutellae

no-X

Razor1

Jawed

swaaye

Entirely Suboptimal

Similar threads