AMD: R7xx Speculation

MfA · Mar 14, 2008

The problem with calling what GPUs do SIMD is that people will tend to think it works in the same way as desktop processors (which have no MD input parameters for load & store).

trinibwoy · Mar 14, 2008

Quasar said:
I was under the impression, that it should read rahter like this:

The primary difference is that G80 just has to serializes all instructions on each object in one pixel-processor, whereas, while R6xx can under certain circumstances, such as no dependancy between scalars, do that to achieve similiar effectiveness, it generally also has to parallizes as well to get maximum utilization.

Click to expand...

That's how I tend to look at it as well. You can make the case for higher ALU density with ATI's approach but I don't believe the ability to process vector instructions should be highlighted as an advantage. At least I don't recall any cases where R6xx proves more efficient per flop.

The cheapest, but not the most efficient way of which would be serializing a second MADD to all 320 ALUs.

Isn't that what G71 did? It's probably hard enough to find co-issue opportunities for R6xx....don't think they would want to add dual-issue on top of that!

CarstenS · Mar 14, 2008

trinibwoy said:
Isn't that what G71 did? It's probably hard enough to find co-issue opportunities for R6xx....don't think they would want to add dual-issue on top of that!

Yeah, and i sure hope, this is not what AMD is going for with RV770.

Razor1 · Mar 14, 2008

trinibwoy said:
That's how I tend to look at it as well. You can make the case for higher ALU density with ATI's approach but I don't believe the ability to process vector instructions should be highlighted as an advantage. At least I don't recall any cases where R6xx proves more efficient per flop.

This is interesting, actually with the g80 vs the r600 in geometry performance (which use alot of vector instructions) in Dx9 it certianly does have a huge advantage and fairs well in Dx10 too, but for Dx9 performance I'm not fully convinced the g80 is doing load balancing, because of its compartively poor performance vs Dx10.

trinibwoy · Mar 14, 2008

Razor1 said:
This is interesting, actually with the g80 vs the r600 in geometry performance (which use alot of vector instructions) in Dx9 it certianly does have a huge advantage and fairs well in Dx10 too, but for Dx9 performance I'm not fully convinced the g80 is doing load balancing, because of its compartively poor performance vs Dx10.

TBH I don't know what numbers you're referring to here but G8x's synthetic geometry performance isn't shader bound. See G94 vs G92. Do you have a link demonstrating what you're talking about re DX9 vs DX10?

Kanyamagufa · Mar 14, 2008

Seems like the "800 stream processors" rumor is picking up speed...

Tech Report link which leads to vr-zone and finally finishes up at Chiphell.

Razor1 · Mar 14, 2008

trinibwoy said:
TBH I don't know what numbers you're referring to here but G8x's synthetic geometry performance isn't shader bound. See G94 vs G92. Do you have a link demonstrating what you're talking about re DX9 vs DX10?

http://www.digit-life.com/articles3/video/rv670-part2-page8.html (Dx10)

http://www.digit-life.com/articles3/video/rv670-part2-page5.html (Dx9)

Sorry should have added vertex shader performance in my intial statement too.

trinibwoy · Mar 14, 2008

Razor1 said:
http://www.digit-life.com/articles3/video/rv670-part2-page8.html (Dx10)

http://www.digit-life.com/articles3/video/rv670-part2-page5.html (Dx9)

Sorry should have added vertex shader performance in my intial statement too.

Those DX10 tests are geometry shader and the DX9 are vertex shader. Performance on the latter is supposedly limited by triangle setup on G8x.

Take a look at a more recent review using the same tests - http://www.digit-life.com/articles3/video/g94-part2.html and note how vertex and geometry shader performance stacks up between RV670, G94 and G92.

Ailuros · Mar 14, 2008

Kanyamagufa said:
Seems like the "800 stream processors" rumor is picking up speed...

Tech Report link which leads to vr-zone and finally finishes up at Chiphell.

If RV670/SP = RV770SP and at the same time the rumours about a 250mm2 die size under 55nm should be true, the answer to that riddle is roughly the same as when somebody asks how you can fit an elephant into a refrigerator. You open the refrigerator, put the elephant inside and close the refrigerator.

Jawed · Mar 14, 2008

Jawed said:
I'm gritting my teeth, as I think this rumour is one of those dreams, but 160:32 is a 5:1 ALU:TEX ratio

I've just thought, two RV770s (R780 is what I like to call it) could have 400SPs each arranged in 5 SIMDs. Then 32 TUs is 16 on each RV770 and 32 RBEs would arise the same way.

Did anyone suggest this already?

Jawed

no-X · Mar 14, 2008

How many transistors contains one 5D ALU? R5xx ALUs with appropriate reg. array was about 2M. R6xx ALU is less powerful in theoretical flops, but more effective (scalar). Could we assume 2-3M for one 5D ALU + reg. array?

One R5xx TU + one R5xx ROP cost together about 8M. R6xx's TMUs are significantly beefier, so 6M only for TU could be close to reality.

RV770 is rumoured to be ~200M transistors larger than RV670. I'd be very surprised, if all those transistors would be used only for one SIMD. I think 96 5D ALUs (+64-96M) and 32 texture units (+96M) isn't unrealistic possibility (still waiting for Ortons future)

Anyway, why to use high-speed GDDR5 modules for 16TUs GPU? That would be quite expensive overkill...

hoom · Mar 14, 2008

Because if all 5 units had the same capabilities the compiling/scheduling would be easier.

I suggested this a long time ago - you'd want look-up tables for each lane and then use repeated MADs (e.g. to produce a result every 4 clocks). But the look-up tables are still relatively costly so they'd be anything but "simple units".

Then you get into the question of why have 5 and then you get into questions of register file organisation, batching, clause-scheduling (ALU instructions are issued in groups of a maximum of 32 slots) etc. It would be a complete re-design.

This was moronic when I said it but when you said it with some extra explanation & caveats its not?

Note that I didn't specify that they would be full blown trancendental units.
You put those words in my mouth.

Obviously would need to compromise on either number of units or complexity of units.

Shtal · Mar 15, 2008

Kanyamagufa said:
Seems like the "800 stream processors" rumor is picking up speed...

Tech Report link which leads to vr-zone and finally finishes up at Chiphell.

I'm hoping that ATI be able to follow-up in the foot-steps of Intel Core2 on Nvidia Athlon64; figurative speaking "Radeon 4870 Core2"

Just guessing - Edit: Let assume if the rumors are correct that RV770 will have 160SP's, then by how much are we really looking in performance boost over GF8800Ultra ??

ATI Core2 RV770 @ approx ~825-875MHz GPU
55nm tech
32TMU's
16 ROP's
crossbar 4x64 or 8x32 = 256bit GDDR5 memory 2200MHz X2 = 4400MHz effective ~144GB memory bandwidth.

My calculation is RV770 will be faster then RV670 by 2.5x-3x times, then it puts 0.5x-1x times faster then GF8800Ultra.

Jawed · Mar 15, 2008

hoom said:
This was moronic when I said it but when you said it with some extra explanation & caveats its not?

Note that I didn't specify that they would be full blown trancendental units.
You put those words in my mouth.

Obviously would need to compromise on either number of units or complexity of units.

I don't think it's a good idea because the lookup tables are expensive.

Method and system for approximating sine and cosine functions

That's why NVidia did all that work to find a low cost amalgamation of transcendental/attribute-interpolation - to "slightly" expand the size of the lookup tables, but get a relatively big payoff (as well as amalgamated use of some of the logic).

If you can work out the relative costs of ATI's transcendental unit (note extra tables required for RCP, LOG etc.) and a "simplified, slow" transcendental unit with no dedicated multipliers, then it'd certainly be interesting...

But it's also worth remembering the T unit in R6xx also has other functionality over the X,Y,Z,W units: integer multiplication/division; integer bit shifting; type conversion.

Jawed

Jawed · Mar 15, 2008

no-X said:
How many transistors contains one 5D ALU? R5xx ALUs with appropriate reg. array was about 2M. R6xx ALU is less powerful in theoretical flops, but more effective (scalar). Could we assume 2-3M for one 5D ALU + reg. array?

Sounds like a reasonable size to me. Say 3M.

One R5xx TU + one R5xx ROP cost together about 8M. R6xx's TMUs are significantly beefier, so 6M only for TU could be close to reality.

I'd edge to about 8M, there's L2 cache too.

RV770 is rumoured to be ~200M transistors larger than RV670. I'd be very surprised, if all those transistors would be used only for one SIMD. I think 96 5D ALUs (+64-96M) and 32 texture units (+96M) isn't unrealistic possibility (still waiting for Ortons future)

I've completely forgotten about that. Will have to rummage for that, later.

Overall I've lost patience with these rumours - we're in the pipeline arms-race phase

Anyway, why to use high-speed GDDR5 modules for 16TUs GPU? That would be quite expensive overkill...

I'm presuming the RBEs will have double the per-clock Z performance and that'll have a stronger influence on bandwidth.

Jawed

Ailuros · Mar 15, 2008

no-X said:
Anyway, why to use high-speed GDDR5 modules for 16TUs GPU? That would be quite expensive overkill...

GDDR5 starts at 1.6GHz afaik; that gives 102.4GB/s bandwidth on a 256bit bus.

aaronspink · Mar 15, 2008

compres said:
I though the 670 alus were MIMD, yes?

you tell me, do they have separate instruction pointers?

aaronspink · Mar 15, 2008

trinibwoy said:
It's true that they are all just SIMD processors of varying widths. But how exactly would you market the differences between G80 and G71 or R600 and R580? I'm sure you appreciate the added value there. IMO it's not something that should be ignored simply because "they're all SIMD".

I'm not ignoring the improvements, I'm just wailing against the bs terminology being used. What it really comes down to is they have some better scheduling algorithms using effectively similar SIMD ALU arrays. ie, they each one is still being fed by a single instruction pointer.

I for one think Nvidia's approach of filling their processors with scalar operands from different pixels/vertices/threads etc instead of independent instructions from the same object is bloody fantastic and deserves to be highlighted in some way. Of course some would argue that it isn't true scalar and there's still some dual-issue, instruction reordering and other wizardry to perform in the compiler but it's disingenuous to dismiss the elegance of the whole setup.

but all they are really doing is batch processing of several pixels/vertexes using the same instruction stream. nothing really scalar about that.

aaronspink · Mar 15, 2008

Dave Baumann said:
I'm wondering where the thoughts of working on different threads or objects in the same SIMD / shader grouping comes from.

from marketing of course, like I said, I'm surprised that the graphics vendors haven't touted clock buffers as a feature and given them some nonsense name!

Aaron Spink
speaking for myself inc.

aaronspink · Mar 15, 2008

MfA said:
The problem with calling what GPUs do SIMD is that people will tend to think it works in the same way as desktop processors (which have no MD input parameters for load & store).

its probably still more accurate that the impression of calling them stream processors. Which btw, is already a name taken in the history of computers and computer architecture, and the very concept of a processor REQUIRES an independent instruction pointer!

Having scatter/gather and parametrized vector loads and stores isn't original either, see the late 60s and 70s!

Aaron Spink
speaking for myself inc.

AMD: R7xx Speculation

MfA

trinibwoy

Meh

CarstenS

Moderator

Razor1

trinibwoy

Meh

Kanyamagufa

Razor1

trinibwoy

Meh

Ailuros

Epsilon plus three

Jawed

no-X

hoom

Shtal

Jawed

Jawed

Ailuros

Epsilon plus three

aaronspink

aaronspink

aaronspink

aaronspink

Similar threads