AMD: R7xx Speculation

Magnum_Force · Mar 13, 2008

turtle said:
Or to break it down some more:

3870= 4 arrays of 16 shaders totaling 64 shaders, each shader with a 4+1 MADDs, or 80 MADDs per array = 320 sp total
4870 =5 arrays of 32 shaders totaling 160 shaders, each with a 4+1 MADDs, or 160 MADDs per array = 800 sp total

With g92 you'd be talking:

128 shaders, each shader having a 2+1 (MADD+MUL) = 384 units.

Of course that's layman's and not taking into account the shader clockspeed is higher (~double) on an nvidia part, and that the MUL is not used in general shading, but rather for special function which (iirc) ATi does with their MADDs. I would say look at g92 as 256 MADDs, but that wouldn't be giving it fair credit for the work the MUL is doing, or the fact it's used in CUDA (and perhaps will be used for physX.)

At any rate, it's a serious boost.

So 5 arrays...Are we thinking 20 ROPs?

Just Off topic(ish) for a second...

I always thought that G80/G92 did 1 madd (plus a mul) per stream proc, not 2??

I worked out in another thread that RV670 at 775mhz x 320 Stream Proc puts out 248000 a second (I dunno what you'd call the number, shader calcs per second ( times a million for mhz converstion??)

I figured G92 for only 1 calc per stream proc so that puts it at: 1650mhz (8800GTS 512mb shader clock) x 128 = 211200.

If it does 2 shaper calcs per SP that is 422400 !!!

Thats almost double RV670. That can't be right can it??

If it is, then what is the bottleneck of the architecure?? the Rops?? bandwidth??
It surely isn't the TMU's.

I most likely wrong on this so can someone please explain it to a simpleton like myself.

mczak · Mar 13, 2008

Jawed said:
This is moronic. There's no point in making all five support transcendental functions, for a start. It would hugely increase the size while bringing no appreciable performance benefit.

Yes, but having 5 simple units would make more sense. Transcendental functions could be done without explicit hardware support (or maybe some cheap support only so approximation can be done faster).
In any case, I wouldn't expect the shader units themselves to get changed so significantly for a chip which is believed to be a beefed up refresh part...

trinibwoy · Mar 13, 2008

Magnum_Force said:
I figured G92 for only 1 calc per stream proc so that puts it at: 1650mhz (8800GTS 512mb shader clock) x 128 = 211200.

Based on what we know of G80/G92 this is as close as you'll get to a true performance metric for shader calcs. Even if the MUL was co-issued you couldn't say that shader calcs were doubled. All you could say was that MULs were doubled. But that's all moot because the missing MUL is still missing to this day

Magnum_Force · Mar 13, 2008

Thanks Trinibwoy, I was just getting confused with the terminology.

Turtle is talking about floating point Op's with his numbers, not amount of MADD/MUL's.

Appreciate the help though

Jawed · Mar 13, 2008

mczak said:
Yes, but having 5 simple units would make more sense. Transcendental functions could be done without explicit hardware support (or maybe some cheap support only so approximation can be done faster).

I suggested this a long time ago - you'd want look-up tables for each lane and then use repeated MADs (e.g. to produce a result every 4 clocks). But the look-up tables are still relatively costly so they'd be anything but "simple units".

Then you get into the question of why have 5 and then you get into questions of register file organisation, batching, clause-scheduling (ALU instructions are issued in groups of a maximum of 32 slots) etc. It would be a complete re-design.

In any case, I wouldn't expect the shader units themselves to get changed so significantly for a chip which is believed to be a beefed up refresh part...

Yeah, this is the most maddening thing, people are expecting ATI to do some kind of about-face as if they've made some grave mistake. Compounded by the moronic idea that G80 is a scalar architecture.

Jawed

aaronspink · Mar 13, 2008

INKster said:
I don't think you've understood it correctly.

That's (4+1) * 32 * 5 arrays = 800 stream processors.
HD3870/HD3850 is (4+1) * 64 = 320 stream processors.

Oh god will the graphics people please stop trying to rename everything under the sun so it seems like its better than it is.

"The new G9999 with 400000000000000 Uberness units!"
"what is an 'uberness' unit?"
"oh that, you see its this really awesome thing that buffers and strengthens the clock"
"oh god, you are using clock buffers as a feature!"

Seriously people an ALU is an ALU. A processor is a processor. A SIMD ALU is a SIMD ALU. I don't care what fancy feature it has a 5 wide SIMD ALU IS NOT 5 "stream processors". Nvidia is just a guilty of this, if not more so, than ATI, but its all just BS.

We all know that the number of actual processors in both the 670 and G92 are close to single and low double digits!

Aaron Spink
speaking for myself inc.

compres · Mar 14, 2008

aaronspink said:
Oh god will the graphics people please stop trying to rename everything under the sun so it seems like its better than it is.

"The new G9999 with 400000000000000 Uberness units!"
"what is an 'uberness' unit?"
"oh that, you see its this really awesome thing that buffers and strengthens the clock"
"oh god, you are using clock buffers as a feature!"

Seriously people an ALU is an ALU. A processor is a processor. A SIMD ALU is a SIMD ALU. I don't care what fancy feature it has a 5 wide SIMD ALU IS NOT 5 "stream processors". Nvidia is just a guilty of this, if not more so, than ATI, but its all just BS.

We all know that the number of actual processors in both the 670 and G92 are close to single and low double digits!

Aaron Spink
speaking for myself inc.

I though the 670 alus were MIMD, yes?

LordEC911 · Mar 14, 2008

LMAO, getting all bent out of shape about nothing.
5 arrays = 5 Shader blocks/groups/cluster (don't know what everyone else calls them) of 32 shaders each. 1 shader is 4+1, or whatever Dave was hinting at earlier.

Basically this picture but with an extra "array" and with double the shaders in an "array."

Looking at that picture, IMO, it would make sense that with an extra array they might add another quad of ROPs for a total of 20.
And the way AnandTech explains the architecture, which may or may not be wrong, is that each of the 4 Shader "arrays" is connected to it's own quad TMU. So if there are 5 "arrays" shouldn't they add another TMU block, so it might be 40 TMUs?

trinibwoy · Mar 14, 2008

aaronspink said:
We all know that the number of actual processors in both the 670 and G92 are close to single and low double digits!

It's true that they are all just SIMD processors of varying widths. But how exactly would you market the differences between G80 and G71 or R600 and R580? I'm sure you appreciate the added value there. IMO it's not something that should be ignored simply because "they're all SIMD".

I for one think Nvidia's approach of filling their processors with scalar operands from different pixels/vertices/threads etc instead of independent instructions from the same object is bloody fantastic and deserves to be highlighted in some way. Of course some would argue that it isn't true scalar and there's still some dual-issue, instruction reordering and other wizardry to perform in the compiler but it's disingenuous to dismiss the elegance of the whole setup.

If calling it a "stream processor" or "scalar ALU" gets the message across that's fine with me. Same goes for R600 as its co-issue flexibility far outshines R580 in its ability to handle all possible operand combinations including 5 independent scalar instructions per ALU.

Having said that, the term "scalar" if it must be used certainly doesn't mean the same thing when comparing G8x with R6xx but it's still a nice easy way of describing the ALU improvements this generation. But of course B3D is the place to wax pedantic and set the unwashed masses straight so I digress

trinibwoy · Mar 14, 2008

LordEC911 said:
Looking at that picture, IMO, it would make sense that with an extra array they might add another quad of ROPs for a total of 20.
And the way AnandTech explains the architecture, which may or may not be wrong, is that each of the 4 Shader "arrays" is connected to it's own quad TMU. So if there are 5 "arrays" shouldn't they add another TMU block, so it might be 40 TMUs?

IIRC TMU's in R6xx aren't tied to a particular array but to a particular quad location across the chip. So TMU quad #1 serves shader quad #1 in ALL arrays. The number of TMU quads is tied to the number of quads in each array ergo #TMU's = #shaders per array.

With the current 5 array * 32 shader rumour that points to 32 TMU's, a nice round number!

LordEC911 · Mar 14, 2008

trinibwoy said:
IIRC TMU's in R6xx aren't tied to a particular array but to a particular quad location across the chip. So TMU quad #1 serves shader quad #1 in ALL arrays. The number of TMU quads is tied to the number of quads in each array ergo #TMU's = #shaders per array.

With the current 5 array * 32 shader rumour that points to 32 TMU's, a nice round number!

That makes sense. Thanks for clearing that up.
I guess AnandTech just worded this badly- (or I misinterpreted/understood it)

AnandTech.com said:
Rather than a small number of SPs spread across eight groups, our block diagram shows R600 has a high number of SPs in each of four groups. Each of these four groups is connected to its own texture unit, while they share a connection to shader export hardware and a local read/write cache.

LoL. Some, in this thread, even typed out pictures that I guess didn't register for me...
Edit- They didn't register because they were made to picture older rumors, not this one.

Dave Baumann · Mar 14, 2008

trinibwoy said:
I for one think Nvidia's approach of filling their processors with scalar operands from different pixels/vertices/threads etc instead of independent instructions from the same object is bloody fantastic and deserves to be highlighted in some way.

Errm, eh?

Jawed · Mar 14, 2008

Dave Baumann said:
Errm, eh?

I think he means that ATI's filling of a "processor" with scalar operands from both various pixels and various components per pixel is inelegant.

Jawed

Dave Baumann · Mar 14, 2008

I'm wondering where the thoughts of working on different threads or objects in the same SIMD / shader grouping comes from.

trinibwoy · Mar 14, 2008

Jawed said:
I think he means that ATI's filling of a "processor" with scalar operands from both various pixels and various components per pixel is inelegant.

I would appreciate it if you wouldn't put words in my mouth. You can scurry along now....thanks pardner

Dave said:
I'm wondering where the thoughts of working on different threads or objects in the same SIMD / shader grouping comes from.

I intentionally put pixel/vertex/thread all together to make it clear that by "thread" I meant an individual primitive. Not a complex "object" or "batch". Is it not the case that both R6xx and G80 issue instructions to 16 different primitives per-clock per processor (well 8x2 for G80 but you know what I mean)? The major difference being that with R6xx - as Jawed so snidely pointed out - multiple instructions and/or operands are issued together?

The point being of course that it's not just sufficient to have 16 objects to work on in R6xx...you also need enough operands and/or instructions available in order to "fill" all slots in the ALU. With G80 this second requirement isn't nearly as onerous. And though I wasn't trying to draw a comparison, since Jawed asked - yes I do find it more elegant in that respect.

Jawed · Mar 14, 2008

trinibwoy said:
I would appreciate it if you wouldn't put words in my mouth. You can scurry along now....thanks pardner

Just returning the favour.

Jawed

Dave Baumann · Mar 14, 2008

trinibwoy said:
I intentionally put pixel/vertex/thread all together to make it clear that by "thread" I meant an individual primitive. Not a complex "object" or "batch". Is it not the case that both R6xx and G80 issue instructions to 16 different primitives per-clock per processor (well 8x2 for G80 but you know what I mean)? The major difference being that with R6xx - as Jawed so snidely pointed out - multiple instructions and/or operands are issued together?

The point being of course that it's not just sufficient to have 16 objects to work on in R6xx...you also need enough operands and/or instructions available in order to "fill" all slots in the ALU. With G80 this second requirement isn't nearly as onerous. And though I wasn't trying to draw a comparison, since Jawed asked - yes I do find it more elegant in that respect.

The primary difference is that G80 serializes all instructions on each object, whereas, while R6xx can do that, it parallizes as well to get maximum utilization. However, you should also consider what, if any, restrictions there are in the serialization.

3dcgi · Mar 14, 2008

LordEC911 said:
That makes sense. Thanks for clearing that up.
I guess AnandTech just worded this badly- (or I misinterpreted/understood it)

LoL. Some, in this thread, even typed out pictures that I guess didn't register for me...
Edit- They didn't register because they were made to picture older rumors, not this one.

You misinterpreted the Anandtech quote. There are 4 SIMDs and each is composed of 4 groups that connect to 4 TMUs. Too many 4's.

Ailuros · Mar 14, 2008

Dumb question: why does it have to be 5 clusters?

R6x0/RV67x0=

4 clusters * 16 ALUs = 64 ALUs * 5D = 320SPs or 80SPs/cluster

Why can't it be:

4 clusters * 32 ALUs = 128 ALUs * 5D = 640SPs or 160SPs/cluster

CarstenS · Mar 14, 2008

Dave Baumann said:
The primary difference is that G80 serializes all instructions on each object, whereas, while R6xx can do that, it parallizes as well to get maximum utilization. However, you should also consider what, if any, restrictions there are in the serialization.

I was under the impression, that it should read rahter like this:

The primary difference is that G80 just has to serialize[strike]s[/strike] all instructions on each object in one pixel-prozessor, whereas, while R6xx can under certain circumstances, such as no dependancy between scalars, do that to achieve similiar effectiveness, it generally also has to parallize[strike]s[/strike] as well to get maximum utilization.

[my bolding/additions]

Please tell me, if/where 'Im wrong.

Ailuros said:
4 clusters * 32 ALUs = 128 ALUs * 5D = 640SPs or 160SPs/cluster

The cheapest, but not the most efficient way of which would be serializing a second MADD to all 320 ALUs.

AMD: R7xx Speculation

Magnum_Force

mczak

trinibwoy

Meh

Magnum_Force

Jawed

aaronspink

compres

LordEC911

trinibwoy

Meh

trinibwoy

Meh

LordEC911

Dave Baumann

Gamerscore Wh...

Jawed

Dave Baumann

Gamerscore Wh...

trinibwoy

Meh

Jawed

Dave Baumann

Gamerscore Wh...

3dcgi

Ailuros

Epsilon plus three

CarstenS

Moderator

Similar threads