AMD: R7xx Speculation

Mintmaster · May 21, 2008

Humus said:
Fetch4 might be rare, but point sampling is not particularly rare. In a modern engine I'd guess that typically 25% or more of the fetches are point sampled. Most table lookups and screen-space based fetches are point sampled.

True, but does additional point-sampling ability help? We'd need to be texture fetch limited when doing point samples and less so when doing bilinear samples for ATI's architectural choices to make sense.

Entropy · May 21, 2008

It is unfortunate that AMD seems to have pulled the 4850 SKU that was supposed to have 4850 Pro GPU clocks and GDDR5, and the same TDP as the same part with GDDR3. It offered the best performance/Watt of the alternatives.

mao5 · May 21, 2008

R870 is 40nm with 2000sp?

http://www.pczilla.net/en/post/16.html

Arun · May 21, 2008

Heh, that COULD be right, however that seems to me like someone who just scaled 800 SPs by slightly more than the gate density improvement of 40nm, which is 2.35x... Of course, that's unrealistic in practice because power scaling is substantially lower than the perf/mm² improvement, as I point out here: http://www.beyond3d.com/content/news/636
EDIT: Of course, if you optimized much more for power (and thus had only a low clockrate increase), it might be doable; I explain that in the piece too but I thought it was important to point it out explicitly here.

Arun · May 21, 2008

Mintmaster said:
True, but does additional point-sampling ability help? We'd need to be texture fetch limited when doing point samples and less so when doing bilinear samples for ATI's architectural choices to make sense.

Well, clearly it would help in texture-limited shaders that combine both bilinear and point sampling at different points of the program, wouldn't it?

Jawed · May 21, 2008

Arun said:
Well, clearly it would help in texture-limited shaders that combine both bilinear and point sampling at different points of the program, wouldn't it?

I presume there are 3 cases that relate to 2D sampling:

post-process, e.g. tonemapping where there's a 1:1 relation between texels and screen pixels
BRDFs where texture coordinates are used for the fetch but no filtering is performed
filtered sampling for regular surface textures

As far as I can tell 2 and 3 are commonly combined in a given pixel shader.

As far as I can tell BRDF sampling occupies the filtering pipeline in R6xx even though no filtering is required, because the addressing must be performed as though the BRDF is a surface texture - and the point-samplers that are normally used for vertex fetch in R6xx cannot do this complete address calculation.

Is that right?

Jawed

ShaidarHaran · May 21, 2008

Jawed said:
I'm looking forward to the day when there's no dedicated texture-filtering hardware in at least one GPU :smile: But for the time being it seems the balance in terms of die size is to keep it fixed-function.

This may be purely because of the range of SKUs that an architecture needs to cover, something like a 10-fold range in performance.

e.g. on a high end GPU with 2000 ALU lanes at 1GHz there might not be any need for dedicated TF, but on the $30 GPU, a couple of hundred lanes, even at 1GHz, won't be enough.

As to the actual cost of TF, one of these days perhaps we'll have a thread that tries to get to the bottom of it. I don't know how to split-out the cost of TF from the rest of a TU. I'm hazarding a guess that the whole lot is in the region of 125M transistors in R670 (caches, thread arbitration, instruction issue, point addressing, filtered addressing, fetching point samples, fetching for bilinear, filtering). A fair amount of the TU needs sizing up in order to increase the TA:TF ratio.

Needless to say, I'm pessimistic about the degree of architectural change in R7xx. ATI's designed a set of knobs (SIMD and TU width, SIMD count, RBE count, MC count, MC width) and will frobnicate them for R7xx.

Jawed

mao5 said:
R870 is 40nm with 2000sp?

http://www.pczilla.net/en/post/16.html

Arun said:
Heh, that COULD be right, however that seems to me like someone who just scaled 800 SPs by slightly more than the gate density improvement of 40nm, which is 2.35x... Of course, that's unrealistic in practice because power scaling is substantially lower than the perf/mm² improvement, as I point out here: http://www.beyond3d.com/content/news/636
EDIT: Of course, if you optimized much more for power (and thus had only a low clockrate increase), it might be doable; I explain that in the piece too but I thought it was important to point it out explicitly here.

Ok now, this is just getting ridiculous. Every time someone with some credibility says something about a possible future GPU configuration, it pops up somewhere else on the net within a day or so.

Either that or Jawed is some kind of GPU prophet

On to the analysis:
2000 SPs isn't out of the question, even for a single chip configuration. I think the 40/45nm shrink of rv770 very well could have 800 SPs, it may be a bit larger than rv770 but I think ATi may well gamble and attempt to move their ASPs back up at the high-end with the successive generation (being hd58x0, presumptively) betting on the low price and aggressive performance profile of their upcoming hd48x0 lineup. It's not much of a gamble really. I think they can count on this one.

Jawed · May 21, 2008

ShaidarHaran said:
Either that or Jawed is some kind of GPU prophet

I don't think so...

On to the analysis:
2000 SPs isn't out of the question, even for a single chip configuration. I think the 40/45nm shrink of rv770 very well could have 800 SPs, it may be a bit larger than rv770 but I think ATi may well gamble and attempt to move their ASPs back up at the high-end with the successive generation (being hd58x0, presumptively) betting on the low price and aggressive performance profile of their upcoming hd48x0 lineup. It's not much of a gamble really. I think they can count on this one.

2000 lanes (400 elements) is about 1200M transistors I reckon

Including 32 TUs and 32 RBEs and 256-bit MCs it could sneak in at under 2 billion transistors.

Ah, tis fun to spew numbers like this, no real meaning and with 40nm so far off.

Jawed

Mart · May 21, 2008

Arnold Beckenbauer said:
We know, that R600 doesn't have 320 SPs and actually there are no "64 5D ALUs", but 4 Vec16-ALUs. But all marketing guys want us to think the R600 has 320 SPs, but G80 has 128 SPs only...

Could you please explain a bit about the "4 Vec16 ALUs"? I only discovered a few months ago that R600 wasn't 320SPs but just 4 SIMD arrays with 16 5D ALUs each, but now you're saying it's not even that. Or is each ALU in your "Vec16" one of these 5D ALU's? If so, how does this work? :???:

Jawed · May 21, 2008

Once upon a time threads would be started for this kind of stuff:

http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~125849,00.html

Anyway, it seems GDDR5 is worth some marketing mileage:

“The days of monolithic mega-chips are gone. Being first to market with GDDR in our next-generation architecture, AMD is able to deliver incredible performance using more cost-effective GPUs,” said Rick Bergman, Senior Vice President and General Manager, Graphics Product Group, AMD. “AMD believes that GDDR5 is the optimal way to drive performance gains while being mindful of power consumption. We’re excited about the potential GDDR5 brings to the table for innovative game development and even more exciting game play.”

GDDR5 for Stream Processing
In addition to the potential for improved gaming and PC application performance, GDDR5 also holds a number of benefits for stream processing, where GPUs are applied to address complex, massively parallel calculations. Such calculations are prevalent in high-performance computing, financial and academic segments among others. AMD expects that the increased bandwidth of GDDR5 will greatly benefit certain classes of stream computations.
New error detection mechanisms in GDDR5 can also help increase the accuracy of calculations by indentifying errors and re-issuing commands to get valid data. This capability is a level of reliability not available with other GDDR-based memory solutions today.

Let's hope the GPUs will be able to use that bandwidth...

Jawed

3dilettante · May 21, 2008

The bandwidth argument could be powerful, if the coarser granularity is accomodated.

GDDR5's CRC for data transmission isn't much of a reliability talking point, though.
It's necessary to get acceptable error rates at the very high clocks the RAM will be pegged to, but that's more like treading water than getting anywhere new, and it doesn't really address the bigger reliability concerns graphics cards have for the reliability-conscious.

AMD apparently can't pay its proofreaders anymore, or at least that's the problem I've "indentified".

Arnold Beckenbauer · May 21, 2008

Mart said:
Could you please explain a bit about the "4 Vec16 ALUs"? I only discovered a few months ago that R600 wasn't 320SPs but just 4 SIMD arrays with 16 5D ALUs each, but now you're saying it's not even that. Or is each ALU in your "Vec16" one of these 5D ALU's? If so, how does this work?

It's not correct what I've said.
Much better:

aaronspink said:
G80 doesn't contain SCALAR processors, it contains multiple SIMD processors just like every graphics chip out there, 16 of them if data is to be believed, each with an 8 wide SIMD array.

RV670 is 4 processors with 5 parallel arrays of 16 wide SIMD.
...

silent_guy · May 21, 2008

3dilettante said:
AMD apparently can't pay its proofreaders anymore, or at least that's the problem I've "indentified".

They only error-proofed the transmission, not the content itself?

Mat3 · May 21, 2008

The days of monolithic mega-chips are gone.

I hope that's a strong hint of something better than just crossfire on a card.

3dilettante · May 21, 2008

silent_guy said:
They only error-proofed the transmission, not the content itself?

Good one.
It's nice to know AMD can still supply the raw material for its own marketing-undermining snark.

sebbbi · May 21, 2008

Humus said:
Fetch4 might be rare, but point sampling is not particularly rare. In a modern engine I'd guess that typically 25% or more of the fetches are point sampled. Most table lookups and screen-space based fetches are point sampled.

25% is a pretty good estimate for current generation games. In the next generation games point sampling performance is becoming even more important as deferred shading gets more popular and the amount of various screen space post process effects is increasing rapidly.

McElvis · May 21, 2008

'Final' Specs

Not sure if this has been posted yet, but:

Final Specs
http://www.techreport.com/discussions.x/14763

Mintmaster · May 21, 2008

Arun said:
Well, clearly it would help in texture-limited shaders that combine both bilinear and point sampling at different points of the program, wouldn't it?

That's actually a subset of the condition in my statement. I said texture limited when doing point samples, not when exclusively doing point samples.

Anyway, I guess the point samplers' main purpose is to assist vertex shading. Maybe the additional cost was offset by some routing simplifications elsewhere.

Pete · May 22, 2008

3dilettante said:
AMD apparently can't pay its proofreaders anymore, or at least that's the problem I've "indentified".

Please, try to be more discreet about your findings. You might embarrass someone over there.

Slyne · May 22, 2008

Pete said:
Please, try to be more discreet about your findings. You might embarrass someone over there.

Whoa, took me a while to find what you were referring to, even though it is in plain sight at the beginning of the page (happens to be the last thing I read).

AMD: R7xx Speculation

Mintmaster

Entropy

mao5

Arun

Unknown.

Arun

Unknown.

Jawed

ShaidarHaran

hardware monkey

Jawed

Mart

Jawed

3dilettante

Arnold Beckenbauer

silent_guy

Mat3

3dilettante

sebbbi

McElvis

Mintmaster

Pete

Moderate Nuisance

Slyne

Similar threads