AMD: R7xx Speculation

Status
Not open for further replies.
It seems the HD 3850 is limited by its 256 MB of memory, while X1950 and HD 3870 are happily blasting away with 512 MB.
 
Silent_Buddha,
Yes, I am aware of that. But neither MS nor DX10 prevent you from using parts of an API through a hack. So, even though a game technically doesn't support DX10.1 would not mean - AFAIK - that they cannot use this.

But my question was whether or not both UE3-hacks do in fact use this per-sample-access to the depth buffer to enable AA.

In STALKER and Geforce 8s case, that is true AFAIK.
UE3 being partially deferred, want to draw parallels to that?

Using DX10+ shader resolve does seem not worth it when a seperate shaderpath has to be done, and in DX10 code execution itself it does seem like hacks aren't given much freedom. No forced TSAA, no custom filters, no funky SSAO shaders...

@Lukfi
But that still doesn't explain how the XTX almost caught up with the 3870 with AA when without it, it loses by 10 frames.
 
HD2900XT is 13% faster than X1950XTX. Based on clocks it should be 14% faster.

HD3870 is slower than HD2900XT on that page, whereas here it is the same speed:

http://www.computerbase.de/artikel/...e_9800_gx2/9/#abschnitt_clive_barkers_jericho

26.1fps for HD3870 (but HD2900XT not listed).

This page:

http://www.computerbase.de/artikel/...hd_3850_x2/8/#abschnitt_clive_barkers_jericho

shows no recent improvement in RV670 performance.

Meanwhile, HD3850 512MB is clocked 2.7% faster than X1950XTX, but HD3850's memory is 82.8% of X1950XTX's memory. 22.5fps for HD3850 is 97.4% of X1950XTX's 23.1.

Jawed
 
Lukfi: No, I don't think so. As seen in one of Jawed's links, the difference seems to be smaller than 1%.

Jawed:
I was under the impression, you wanted to be shown an instance, where RV670 was faster than R580 with AA applied - hence my question, if you'd limit that to specific conditions and/or models. You didn't bother to comment on that, so there you have it. Personally, I don't care about this very much, as almost anything can be shown by cherrypicking reviews and or benchmarks.

edit:
Besides - what's your point? That this is an exception from the rule? I'll give you that. That this benchmark is not sound? Maybe - I don't know, but CB has changed test platforms between my link and your two.
 
Last edited by a moderator:
=>Quasar: Of course I was referring to the chart in your link. I know that 256 MB is usually not a problem for Radeon cards (as opposed to GeForces), but apparently Jericho doesn't like 256MB cards, whether they have good memory management or not.

=>Tchock: You're right, it's strange. Seems the particular game is tailored for R580 and not for R600/RV670...
 
Just for kicks & since I'm currently reading a good book, I have just done a quick bunch of 3Dmark06 runs on my 3870 with various AA/AF configs.
E6600 @stock, 3670 @stock, Cat 8.5

no AF/AA 9883
16* AF/8*AA 8098
8* AF/no AA 8552
16* AF/no AA 8164
no AF/4*AA 9456
no AF/8*AA 9449
no AF/24*AA 8903

This was just one run of each config & I have stuff open in the background so not by any means scientific but there seems to be a pretty clear pattern.
WTF even 8X AF is more expensive than 8X AA? This is unacceptable, 8X or 16X AF are the most important settings for increasing overall visual quality (I have a FW900 CRT so I can get away without AA or with only 2X), if this doesn't improve in the 4000 series then no buy from me.
 
Hmmm do any reviews corroborate that? It would be ironic if all the hype around the AA performance hit was due to AF being turned on at the same time.

Would have been nice if B3D had done an analysis of the performance hit for enabling various levels of AA and AF on R600 or RV670 similiar to what was done for G80. You know it would be the only analysis of that type...further improving B3D's reputation and raking in the page hits ;)
 
AF is a form of texture filtering, which was horrible in R600 and RV670; which is why AF sucks on HD2xxx and 3xxx.

Hopefully, the rumors we're hearing are true and texture units have been indeed doubled in the RV770; but if all the rumors are true that would mean a twice increase in computational power (1GFLOPS) which looks like texture units could still be a bottleneck. :S
 
Jawed:
I was under the impression, you wanted to be shown an instance, where RV670 was faster than R580 with AA applied - hence my question, if you'd limit that to specific conditions and/or models. You didn't bother to comment on that, so there you have it. Personally, I don't care about this very much, as almost anything can be shown by cherrypicking reviews and or benchmarks.
I presume you meant RV670 slower than R580.

The lack of AA-only tests and the use of immature drivers certainly causes a lot of grief if one wants to try to understand an architecture. It'd be nice if _xxx_ could point us in the direction of the compelling evidence he has for "a slower AA implementation".

Take the FEAR results in the test you linked. They show HD2900XT as only 11% faster at 1280 16xAF/4xAA. But 13% faster at 1600. CPU limited? Driver-threading problem? Is R6xx more CPU-limited at lower resolutions than R5xx? Vista versus XP problem? etc. We know these variables are in the mix.

There were clear cases in the early days (erm, months) of R600 where performance was way below X1950XTX. Sadly the latter has dropped off most benchmark charts so it's hard to track. It would be nice to have some results that we can discuss in technical terms.

A 4xMSAA resolve shader:

Code:
struct samples 
 { 
   float4 colorA     : color0; 
   float4 colorB     : color1; 
   float4 colorC     : color2; 
   float4 colorD     : color3; 
 }; 
 
float4 main(samples IN) : COLOR 
{ 
    return  (IN.colorA + IN.colorB + IN.colorC + IN.colorD) * 0.25; 
}

Code:
    ps_3_0
    def c0, 0.25, 0, 0, 0
    dcl_color v0
    dcl_color1 v1
    dcl_color2 v2
    dcl_color3 v3
    mov r0, v0
    add r0, r0, v1
    add r0, r0, v2
    add r0, r0, v3
    mul oC0, r0, c0.x
runs in 4 cycles.

That shader isn't actually of the right form because it's using a vertex attribute to provide the samples to the shader - whereas in reality four registers (r0...r3, say) would be populated with the samples (since R600 can dump data directly into the register file - I presume this is the "special path between RBEs and shaders" that's alluded to). When the correct shader is executed R600's throughput would be 16 pixels per clock.

I'm an HLSL noob so apologies for the code :LOL: If R580 is faster than 16 pixels per clock in resolving 4xMSAA then R600 loses.

To be honest I don't know R580's resolve rate for 4xMSAA data. Potentially it's complicated by the fact that a render target is a mixture of single-sample tiles and 4-sample tiles. It's also complicated by the format of the samples, int8 or fp16.

Jawed
 
HD2900XT is 13% faster than X1950XTX. Based on clocks it should be 14% faster.

HD3870 is slower than HD2900XT on that page, whereas here it is the same speed:

so both the 2900xt + hd3870 are slower per clock than a 1900xt, thats sad so much for architectural improvements
 
Jawed: your shader to resolve a 4X render target can probably be much simpler, assuming bilinear filtering is enabled it can be carried on in one clock cycle (and just one tex2D instruction)
 
Jawed: your shader to resolve a 4X render target can probably be much simpler, assuming bilinear filtering is enabled it can be carried on in one clock cycle (and just one tex2D instruction)
Aha, interesting.

In the past AMD people have been adamant that texturing hardware isn't used. Or, at the very least, that texture fetches aren't performed in order to get samples into the shader hardware.

As far as I can tell the RBE sample de-compression system must be used to extract samples for MSAA resolve - so it's a question of whether a route from the RBEs to L2 texture cache is available...

Well, unless a patent document turns up, we'lll prolly never know the route that samples take.

Jawed
 
Anyone know of a game screenshots comparison of HQ and regular AF on R6xx? What about shimmer during motion?

Curious if HQ AF is of any benefit for IQ.

Jawed
 
Whoa, 2900 and 3800 series take massive hits with 8X HQ or better AF, almost 50% with 16X HQ! The biggest hit for 8800GT with 16X HQ AF is 20%. They better have fixed this for 4800.

What is the difference between HQAF and regular AF? And which do most reviewers use?

Regular AF doesn't appear to be much performance hit, and if that's what reviewers use it's not affecting benchmarks.
 
Jawed said:
Anyone know of a game screenshots comparison of HQ and regular AF on R6xx?
AFAIK default quality is the same as HQ (default is the same as HQ on R5xx anyway). The only thing that might cause shimmering AFAICS is having CatAI set to high.
 
Status
Not open for further replies.
Back
Top