AMD: R9xx Speculation

EduardoS · Apr 3, 2010

MfA said:
Also there are multipliers in series ... the current dependent ops only have shared addition.

Cypress can do multiplies in series too, two pairs, they match precisely the patent pointed by Jawed.

But on the paper there is a fifth multiplier, I mean, a squarer, could a squarer be much simpler and faster than a multiplier?

Dave Baumann · Apr 3, 2010

rpg.314 said:
AMD's cpu's are on SOI for last >~7 years. Is that expertise transferable to bulk, even partly?

Oh, believe me, the two groups coming together is having lots of benefits.

rpg.314 · Apr 3, 2010

Dave Baumann said:
Oh, believe me, the two groups coming together is having lots of benefits.

Well, good for you.

MfA · Apr 3, 2010

EduardoS said:
Cypress can do multiplies in series too

Which instruction would that be?

Jawed · Apr 3, 2010

[I notice you ninja-edited your posting.]

Yes, I know that multiples of the LUTs are required, one per estimated function. 5x something tiny is still something small.

Mintmaster said:
You need to add the data paths to feed the LUT data into the ALUs, and also need to use the high bits of 'x' to index into the LUT. That's probably an extra pipeline stage.

T already does this and then follows that with 5 multiplications and a 4-input adder. All in one logical clock cycle (8 physical cycles).

With a LUT private to each lane, you need to distribute the 5 high bits of the mantissa to address each distinct LUT. This is trivial as each lane requires the mantissa anyway, i.e. the operand is duplicated to all the lanes.

Getting the data from the LUT to the multiplier is a non-issue, the LUT's right there in the single pipeline it's providing its data to.

There are just too many loose ends to do what you're proposing, and in the end you won't gain anything.

Let's say that T consumes 24% of the current ALU transistor count, leaving the other 4 lanes as 19% each. That's about as small a difference as I can make it, while keeping nice round numbers. Because T also does int32 MUL and ftoi/itof, the overhead for T in comparison with the other lanes is likely to be notable (it's not called the Rys unit for nothing).

By deleting T you save:

~20% of the ALU transistor count (assuming that transplanting T to XYZW has an overhead)
three operand fetch paths from the 12-operand collector into T
eight 32-bit registers in the PS intra-pipeline circular buffer (forwarding-path delay line)
the register store path

And then, just for our amusement, earlier in the thread:

http://forum.beyond3d.com/showpost.php?p=1414576&postcount=414

"less stream processors than Cypress"

That wouldn't happen to be 1280 would it?

~20% area saving for the ALUs is not to be sniffed at.

With ALU utilisation typically peaking at 75-80%, deleting this unit isn't generally going to cost any performance. Rightmark 4.0's Mineral and Fire shaders will get slower, admittedly. prunedtree's SGEMM will also slow down a smidgen.

It wasn't until Evergreen that DOT3 was implemented without killing a lane, or that it can do two DOT2s in parallel. Both of these look "obvious" in retrospect (I've been moaning about the waste of DOT3 for years). Deleting T has an obvious look about it, too, but overall I suspect it could be too gnarly, but can't qualify specifically why.

Jawed

hoom · Apr 3, 2010

There's also Istanbul.

45nm.
On 45nm there are plenty of manufacturers for quite a while (Intel, AMD, IBM Power7, Cell).

On smaller than 45nm as far as I know there are only TSMC, Intel & flash manufacturers, maybe some RAM, ARM & FPGA now but a bunch of that is gonna be on TSMC 40nm & is way smaller or simpler than RV740.

Jawed · Apr 3, 2010

EduardoS said:
But on the paper there is a fifth multiplier, I mean, a squarer, could a squarer be much simpler and faster than a multiplier?

http://www.ac.usc.es/arith19/sites/default/files/3670a039-session2-paper1.pdf

The squarer in the patent document I linked is 16 bit.

Clearly, ATI already has a "fast squarer" in order to feed 2 successive multipliers (one of which is followed by another multiplier, making that 3 multiplications in sequence).

So the longest sequence of operations is 3 multiplies and an add all within 8 physical clocks. Preceded by the float-to-fixed converter.

Jawed

MfA · Apr 3, 2010

Unless you run staggered you can't really count the 8 clocks as your space to do it in, because other instructions are coming in and wanting their pipeline slots on the multiplier too.

Jawed · Apr 3, 2010

The Islands Kauai and Cozumel are at lattitude ~22/20 north, while Ibiza is at lattitude ~39 north.

So, ahem, perhaps there's another island at lattitude ~22 north we're looking for, to make up a set of 3 southern islands. And then there'd be another 2 islands at ~39 north to make up a set of 3 northern islands

(Who says the sets are 3 in size?...)

Jawed

Jawed · Apr 3, 2010

MfA said:
Unless you run staggered you can't really count the 8 clocks as your space to do it in, because other instructions are coming in and wanting their pipeline slots on the multiplier too.

How is this different from running 8 RCPs on successive clocks through the T unit?

Jawed

EduardoS · Apr 3, 2010

MfA said:
Which instruction would that be?

MUL_PREV and MUL_IEEE_PREV.

MfA · Apr 3, 2010

It's resources aren't shared.

For the 4-wide solution lets say the input for the second multiplication for RCP is ready on the 5th clock and the first runs from 1-4 ... we have a couple of RCPs and it's followed by a simply MAD, what can the pipeline do for the pipe which was running the second multiplier? Bugger all for 4 cycles, since it still has incoming data from the primary multiplier for those cycles ...

Given what Eduardos says though, it seems that's indeed pretty much what they are already doing anyway.

So I'm guessing that the multiply in y indeed starts a few cycles later all the time, and the one in x a few cycles later still ... so staggered, sorta.

Jawed · Apr 3, 2010

Finally, of course, there's no absolute reason why a transcendental has to be done in a single cycle. e.g. as two cycles, with the squarer in the first cycle. Perhaps that's what ASIC_ALU_REORDER is about. Older chips have XYZW instructions issuing alongside transcendentals, etc., but this design requires the compiler to choose whether to issue those instructions before or after the transcendental.

The first-cycle's squarer leaves 3 lanes for other instructions. Since 4 instructions can be issued alongside T in Evergreen and older, that would leave a single possible instruction to be issued after the completion of the transcendental.

The compiler needs to know this as it affects the way operands as both source and destination in a single instruction in XYZW are used.

It also has implications on the PV "register", as a two-cycle transcendental instruction would invalidate PV in this design, but a single-cycle version of the instruction has no impact on PV in older GPUs.

Jawed

Jawed · Apr 3, 2010

MfA said:
So I'm guessing that the multiply in y indeed starts a few cycles later all the time, and the one in x a few cycles later still ... it's kinda running the pipelines staggered, but not very elegant. Without the support for 2 dependent ops they could probably get the entire pipeline latency down to 4 cycles and half the workgroup size :/

I believe that texturing data to/from the TMUs runs staggered. I can't remember where I saw this, but the whole architecture appears to have a nice 4-cycle stagger across it.

Jawd

Triskaine · Apr 3, 2010

hoom said:
On 45nm there are plenty of manufacturers for quite a while (Intel, AMD, IBM Power7, Cell).

On smaller than 45nm as far as I know there are only TSMC, Intel & flash manufacturers, maybe some RAM, ARM & FPGA now but a bunch of that is gonna be on TSMC 40nm & is way smaller or simpler than RV740.

TSMC's 40nm node was called "45nm" while it was developed and renamed to 40nm in 2007 for marketing reasons. Also, that 40nm moniker says diddly-squat about the actual feature size or transistor performance. It is entirely legitimate to compare TSMC's 40nm to AMD's or Intel's 45nm node, because there is nothing in it that makes superior or gives it a special advantage.

Mintmaster · Apr 3, 2010

Jawed said:
T already does this and then follows that with 5 multiplications and a 4-input adder. All in one logical clock cycle (8 physical cycles).

But those are structured differently from the multiplies in the ALU units. As you mentioned to me, you can do two multiplies in series in the other ALUs, so the first multiplication starts very soon. In the T units, you have to get the LUT value and you need to do a partial square and cube before starting the multiply, the LUT coefficients don't need full IEEE to do those multiplies, and don't need another multiply after that (so they are further down the pipeline than in the other ALUs). Note how the other ALUs don't let you do an add after two serial muls, so that's not a viable path to accommodate the square/cube.

The structure is totally different. Look at Fig. 7 in the patent you linked to.

Getting the data from the LUT to the multiplier is a non-issue, the LUT's right there in the single pipeline it's providing its data to.

It still needs a cycle. The squaring and cubing probably need two.

With ALU utilisation typically peaking at 75-80%, deleting this unit isn't generally going to cost any performance.

What makes you think the T-unit idling is responsible for the last 20-25%? Or that the other ALUs are idling while the T-unit is working? It absolutely will cost performance.

~20% area saving for the ALUs is not to be sniffed at.

Your figure is arrived at with fairytale accounting. Much of what is eliminated is simply transplanted, like the LUTs, squarer, cuber, exponent processing (exp, log, div). You're going to increase the longest path, thus requiring slower clock speeds, to squeeze all this into the framework of the main ALUs. I don't even think it's possible at all. Maybe it could be done with the modified pipeline I suggested earlier, as that is more accommodating to dependent math, but as it is a separate T just makes sense.

When you take into account these costs instead of just looking at the T-unit savings, and also look at the total size of the SIMD engine, you'll only save a few percent from the cost of the latter. That's unlikely to be worth the loss of throughput.

Finally, of course, there's no absolute reason why a transcendental has to be done in a single cycle. e.g. as two cycles, with the squarer in the first cycle.

Well now you're reducing throughput even more. With the current architecture, in two cycles you can do 1 or 2 transcendentals and 9 or 8 regular ops. With your modification, you can only do one transcendental (and maybe one regular op alongside the square/cube cycle).

I can't remember where I saw this, but the whole architecture appears to have a nice 4-cycle stagger across it.

Whole SIMD staggering is actually isomorphic to the quasi-scalar architecture I was proposing. It has the same requirement of needing more active batches. Four cycles of stagger between x and y, y and z, and z and w, for example, would have 20 total cycles of latency before proceeding to the next instruction group, so you'd need 5 active wavefronts instead of 2.

I'm pretty sure that ATI isn't doing this, though. The restrictions on dependent math are a big hint, IMO.

neliz · Apr 3, 2010

Holy sh*t! Southern Islands has been hiding in plain sight for over half a year!
http://www.hwinfo.com/

LEXINGTON XGL/XT, BROADWAY XGL/PRO/LP/XT, MADISON XT/PRO/LP/XT GL/PRO/LP GL, PARK PRO/XT/LP/PRO/XT GL/LP GL.

We know what the later parts became, we never knew what the first part was.

Alexko · Apr 3, 2010

Uh? Lexington is an avenue in Manhattan. It's probably just a mobile Cypress that AMD is keeping to itself until there's any sort of competition from NVIDIA.

Babel-17 · Apr 3, 2010

Well, southern part of Manhattan Island, if that's any consolation.

MfA · Apr 3, 2010

Mintmaster said:
Whole SIMD staggering is actually isomorphic to the quasi-scalar architecture I was proposing. It has the same requirement of needing more active batches. Four cycles of stagger between x and y, y and z, and z and w, for example, would have 20 total cycles of latency before proceeding to the next instruction group, so you'd need 5 active wavefronts instead of 2.

I'm pretty sure that ATI isn't doing this, though. The restrictions on dependent math are a big hint, IMO.

Not by that much, but still, they almost certainly have an increasing number of empty (delay only) pipeline stages on the front end of the z/y/x channels.

AMD: R9xx Speculation

EduardoS

Dave Baumann

Gamerscore Wh...

rpg.314

MfA

Jawed

hoom

Jawed

MfA

Jawed

Jawed

EduardoS

MfA

Jawed

Jawed

Triskaine

Mintmaster

neliz

GIGABYTE Man

Alexko

Babel-17

MfA

Similar threads