The LAST R600 Rumours & Speculation Thread

Status
Not open for further replies.
Well, I think it's worth noting R580 was really held back by texturing. If r600 has 32 TMU's that's double. Plus add a little more for some more clock.

So I think something like 100%+ a little bit is as much faster as R600 could hope to be in general, than R580.

Which means it'll be solidly with G80 and all, but it's unlikely to be a performance stunner.

g80 isnt quite 2x r580 except in a few games.
 
Well, we know that AMD/ATI's unified architecture is heavily influenced by the XB0X360 graphics chip. And we know that Xbox360 is closely associated with HD DVD, which in turn is the HD format of choice of pr0n. Nvidia, on the other hand, is in the Sony PS3 camp, which is tied to BR, which is "not so much" on the HD pr0n front. What does this mean for R600 TMUs? Well, I think the facts speak for themselves, do they not? ;)

No, really, that's enough about naughty TMU's for awhile. . . .back to feeds & speeds, plz. :cool:


Considering Nvidia sent me an HD DVD drive to test out with the G80. I dont paticularly think they are in support of either. But more interested in HD video quality. Whatever makes their product look good will likely be what they support.

chris
 
Well, as I recall saying somewhere or other recently (ahem!), they've had 3 loop * 2x AA since 9700Pro. . .which had 128MB of RAM and ~20 GB/s of bw. R600 looks likely to have on the order of 8x more of each! They're going to bump to 8x msaa (an increase of 33% over that venerable 9700 Pro) and quit? I'm not buying that right now. The obvious answer would be a bump to 3 * single-cycle 4x for 12x msaa. But possibly that's too obvious and they have something else up their sleeves.

My question was more like if 8x MSAA is good enough or not. If your monitor can support resolutions way beyond 1600*1200, the most sensible sollution is to use the highest possible resolution and the highest possible AA amount performance allows. If you're running in 1920 and beyond it would make more sense to use something like 4xAA and alas if the game has also a shitload of alpha tests, invest your resources in transparency AA.

I've no idea what it has or hasn't, but it takes a wee bit more than a simple equasion like ok we have the bandwidth, our ROPs can do single cycle 4, so let's loop as much as bandwidth allows.

Given it's extreme bandwidth figures it might slaughter the G80 in very high resolutions with 8xMS + Transparency AA. Besides even if R600 supports more the only other modes it would make sense to compare performance in would be 2x,4x and 8x MSAA. If ATI supports 12x MSAA after all I'll run my head against the wall again if I see performance comparisons to CSAA and yes any reviewer going through such an attempt deserves to be shot.
 
Considering Nvidia sent me an HD DVD drive to test out with the G80. I dont paticularly think they are in support of either. But more interested in HD video quality. Whatever makes their product look good will likely be what they support.

chris
You do realise that geo was joking, dont you? ;) (But yes Nvidia is vocal of their support to BDA which is lacking for HD-DVD)
 
Not really no. Geo's jokes are often missed by me. Didnt seem obvious to me.

Chris
 
My question was more like if 8x MSAA is good enough or not. If your monitor can support resolutions way beyond 1600*1200, the most sensible sollution is to use the highest possible resolution and the highest possible AA amount performance allows. If you're running in 1920 and beyond it would make more sense to use something like 4xAA and alas if the game has also a shitload of alpha tests, invest your resources in transparency AA.

That's an interesting point. Did we ever get to the bottom of whether NV's Transparency AA had a hardware component or not? At any rate, AMD might have slipped in some hw support for Adaptive AA.

Well, we'll see on what AA modes they end up with. I can sense the marketers groaning/grinding their teeth at being on the short-end of a "16x" vs "8x" checkbox war tho.
 
That's an interesting point. Did we ever get to the bottom of whether NV's Transparency AA had a hardware component or not? At any rate, AMD might have slipped in some hw support for Adaptive AA.

No idea. Considering that Transparency AA is also possible since some time ago on NV4x0, if there's any HW support in either/or G7x/G8x it shouldn't account for a lot. I'd rather guess that it's mostly a SW affair. No idea bout ATI and in extension R600; considering though that in my former post I had mostly transparency supersampling in mind, I wouldn't be particularly surprised if the excessive bandwidth aids for a significant difference even if it's also in it's majority a software affair (well take "software" with a grain of salt, since an algorithm isn't software per se).

Well, we'll see on what AA modes they end up with. I can sense the marketers groaning/grinding their teeth at being on the short-end of a "16x" vs "8x" checkbox war tho.

I'm usually somewhat allergic to half-assed sollutions. However wherever performance doesn't allow more than 4xMSAA, I just use 16x (not Q obviously heh). Unless a game is overloaded with alphas adding TSAA on top isn't usually a problem. It's a very interesting trade off and I wonder how it could evolve in the future.
 
Ahem..

Excuse me.. what other planets did WE travel to .. besides in our dreams..??

The harsh reality is that probably none of the current board members will be alive when someone actually sets foot on any other planet...


Even the Moon is in doubt :LOL:

Oh my. And I just came here to read some expert speculation and graphics card excogitation. :D The humor is clearly value added.
 
That's an interesting point. Did we ever get to the bottom of whether NV's Transparency AA had a hardware component or not? At any rate, AMD might have slipped in some hw support for Adaptive AA.

When NVIDIA added Transparency anti-aliasing support for NV4x to their drivers, I did some testing and found no difference in the level of performance hit using TAA between NV4x and G7x-based boards. With that in mind, I'd say it's almost certainly software based.
 
  • Like
Reactions: Geo
When NVIDIA added Transparency anti-aliasing support for NV4x to their drivers, I did some testing and found no difference in the level of performance hit using TAA between NV4x and G7x-based boards. With that in mind, I'd say it's almost certainly software based.

Just abit off topic. Which driver(s) enabled TRAA on NV4x based cards?
 
I believe TAA is implemented in the drivers simply redrawing the same stuff multiple times jittering or changing someway the AA samples...
 
Staggered Striped Register File?

Last night I realised a possible way to configure the register file in order to effect thread packing:

b3d78.gif

The 4 columns, delineated by yellow borders, are 4 memory banks. Each bank has one read port per operand.

The 4 rows, delineated by magenta borders, are 4 independent threads.

Each row corresponds to a quad of, e.g., pixels. So each thread holds 4 rows making 16 pixels in total. I've labelled the 16 pixels in a thread 0-F, i.e. hexadecimal.

Jawed
 
So this is how a normal fetch is done, four pixels at a time, e.g. r0.rgba:​

b3danim01.gif

It will take 16 clocks to process 64 pixels.​

This is how a scalar fetch is done, 16 pixels at a time, e.g. r0.r:​


b3danim02.gif

Note how each bank is only being accessed once per clock. This takes 4 clocks to process 64 pixels.​

Here's how a two component fetch looks, 8 pixels at a time, e.g. r0.rg:​

b3danim03.gif

This takes 8 clocks to process 64 pixels.​

Jawed​
 
Now for the fun bit, when dynamic branching applies. This shows a scalar fetch for 16 pixels per clock, e.g. r0.r:

b3danim04.gif

Note I've darkened the quads that fail the dynamic branching test. So this take 3 clocks to execute the 9 quads (36 pixels) that have at least one pixel that passes the dynamic branching test.​

This is a two component fetch, using the same pass/fail dynamic branching, e.g. r0.rg:​

b3danim05.gif

Because I've only shown 4 threads to keep things simple, the processing rates shown here are worse than normal. In reality there'd be many more threads that require fetching, e.g. 128. So, while it takes 5 clocks here to process 8 quads, the "missing" threads that don't fit in the diagram would make this more effective, i.e. 5 clocks would process 10 quads.​

The overall effect of the SSRF is that the "effective batch size" for dynamic branching is 1 quad. It is still possible for 3 pixels in a quad to fail the dynamic branching test. But this is much better than the worst case of a conventional 16-pixel per thread GPU, where 15 pixels fail the test.​

Also, note that I've shown a thread size of 16 and a bank size of 4. These could be scaled up, e.g. a thread size of 32 with banks of 8, i.e. an effective batch size of 8.​

(All this counting in fours has got me putting 4 sugars in my tea :LOL: )

Jawed​
 
Status
Not open for further replies.
Back
Top