AMD: R9xx Speculation

As you need some lookup tables for a SFU (transcendentals, div, sqrt and so on, most use some kind of a LUT), I really doubt it. And as I said, I've read from it, that there still is a t unit.
As you can see from figure 7:

http://v3.espacenet.com/publication...T=D&date=20050331&CC=US&NR=2005071401A1&KC=A1

4 tables are needed.

Years ago I speculated about deleting the unit completely and using just a sequence of MADs and ADDs to produce the result.

We can see, already, that ATI is able to do dependent instructions within a single cycle across pairs of lanes in Cypress, so transcendentals would be an extension of that. In fact it looks like an extremely close match for Evergreen's current capability.

Btw., the possibility to do a 32bit integer multiply with the combined xyzw ALUs was already mentioned in an Evergreen presentation. But a documentation for that is still missing (as the mul_int24 instruction in IL).
I haven't comes across the 32-bit integer mul thing you're referring to, or didn't realise it was staring back at me :LOL:

Jawed
 
As you can see from figure 7:

http://v3.espacenet.com/publication...T=D&date=20050331&CC=US&NR=2005071401A1&KC=A1

4 tables are needed.

Years ago I speculated about deleting the unit completely and using just a sequence of MADs and ADDs to produce the result.

We can see, already, that ATI is able to do dependent instructions within a single cycle across pairs of lanes in Cypress, so transcendentals would be an extension of that. In fact it looks like an extremely close match for Evergreen's current capability.


I haven't comes across the 32-bit integer mul thing you're referring to, or didn't realise it was staring back at me :LOL:

Jawed

Aren't the conditionals done in t unit too? How would you do them if you got rid of t unit?
 
Can ATI do a dumb shrink of RV870? Would it be easier or more difficult than Fermi? Why?

Honestly I dont think dumb shrink from 40nm to 28nm is even possible, both AMD and NV will have to redesign chips. NV will have a tougher time IMO, as they usually do on cutting-edge process (hence their reluctance to do so), plus they have far more kinks to work out than AMD (unless new tech in SI will be a challenge).
 
Ditching the T-unit would mean breaking the VLIW width -- how would that reflect on the compiler base and all the optimizations gained through the years?
 
I haven't comes across the 32-bit integer mul thing you're referring to, or didn't realise it was staring back at me :LOL:
Have a look on page 10 of the SigGraph09 presentation.
Or to save you the effort of downloading,
Evergreen_VLIW.png
 
It'd be a pretty serious change for the compiler to drop T. Less units are simpler though. It would also reduce the operand dependency analysis to feed up to 15 operands when only 12 can be fetched.

I've got that presentation, just didn't pay much attention to the details there :oops:

Jawed
 
We can see, already, that ATI is able to do dependent instructions
They're hardly all created equal ... a single precision FP iteration of Newton-Rhapson or whatever is long latency. You would have to run the pipelines staggered, with each pipeline running instructions one delayed from the prior.
 
They're hardly all created equal ... a single precision FP iteration of Newton-Rhapson or whatever is long latency. You would have to run the pipelines staggered, with each pipeline running instructions one delayed from the prior.
Curious: did you look at the patent?

It's evaluating a 3rd order LaGrange polynomial using 5 multiplies (a squarer, not really a multiply of course, and 4 multiplies) and a 4-input adder.

It is more effort than Cypress: the squarer and the extra input for the adder + the more serpentine inter-lane stuff. I presume some intrinsic precision can be sacrificed in the final adder since the 3 terms require 20, 16 and 12 bit precision.

Anyway, I'm always over-optimistic about these things, so I expect to be disappointed.

Jawed
 
Maybe one has a different viewpoint if one sits in Canada ;)

:LOL:

Southern Islands codenames have nothing to do with them being somewhere "in the south" though.

Why is that? Southern Islands refers to pokemon - gotta collect 'em all? Or Something to do with Singapore? What's the connection of the names of islands in the northern hemisphere to a group codename of Southern Islands? Misdirection? Are SI and NI code names for the same thing, like Eyefinity had multiple code names, to see where the leaks were coming from?
 
Jawed, it's just not worth it. If you were going to do this, you'd still need the LUTs and you'd have to have special data paths to feed the regular ALUs the appropriate values. You'll only save a few multiplier/adder transistors, most of them being fixed point, and that doesn't really buy you anything.
 
The LUTs are "tiny" and required whatever implementation of this algorithm you choose.

[0039] The present invention overcomes the limitations of prior art techniques by providing a relatively inexpensive implementation of LaGrange polynomials while simultaneously providing substantially similar or better accuracy than these prior art techniques. For example, the implementation of the third-order LaGrange polynomial illustrated in FIG. 7 is capable of providing up to 23 bits of precision for reciprocal, square root functions, exponential, sine, cosine and logarithmic functions within the valid ranges for these functions. In such an implementation, for each function implemented, the point table 710 comprises 32 point values of at least 24 bits of accuracy; the first slope table 704 comprises 32 constant values of at least 20 bits of accuracy; the second slope table 706 comprises 32 constant values of at least 16 bits of accuracy; and the third slope table 708 comprises 32 constant values of at least 12 bits of accuracy. Likewise, the first multiplier 712 comprises a 20-bit multiplier; the second and third multipliers 714, 716 comprise 16-bit multipliers; the fourth and fifth multipliers 718, 720 comprise 12-bit multipliers; and the adder 722 comprises a 26-bit adder.
You'd also spread them around, one LUT per lane.

As for the lane timing and movement across lanes, well there's already a substantial amount of that in place with dependent MUL MUL and dot product and friends. Not to mention that the double-precision and sub-normal support in the ALU already means there's a beefy final adder in place.

But yeah, I'm pessimistic too...

Jawed
 
You need to add the data paths to feed the LUT data into the ALUs, and also need to use the high bits of 'x' to index into the LUT. That's probably an extra pipeline stage.

There are just too many loose ends to do what you're proposing, and in the end you won't gain anything.
 
TSMC is particularly bad at complex chips on new cutting edge processes.
Bad compared to who??? :runaway:

Right now Intel is the only other foundry with complex < 45nm chips on the market. (unless flash counts as complex?)
TSMC has RV740 on the market for a year now & its more than twice the transistor count of Clarkdale.
i7 980X is only just out & so expensive it might as well not be.
 
It is more effort than Cypress: the squarer and the extra input for the adder + the more serpentine inter-lane stuff.
Also there are multipliers in series ... the current dependent ops only have shared addition.
 
Why is that? Southern Islands refers to pokemon - gotta collect 'em all? Or Something to do with Singapore? What's the connection of the names of islands in the northern hemisphere to a group codename of Southern Islands? Misdirection? Are SI and NI code names for the same thing, like Eyefinity had multiple code names, to see where the leaks were coming from?

It has nothing to do with Singapore though.
 
Bad compared to who??? :runaway:

Right now Intel is the only other foundry with complex < 45nm chips on the market. (unless flash counts as complex?)
TSMC has RV740 on the market for a year now & its more than twice the transistor count of Clarkdale.
i7 980X is only just out & so expensive it might as well not be.

There's also Istanbul. It doesn't have as many transistors as GF100, but it's still pretty complex.
 
If ati goes with 40 nm for SI, then there is a real chance of nv risking 28nm and landing a KO. Even if they don't have a new process anymore, they still have a new pipeline arch - Charlie seems to suggest that it has been done already though - and nv's GDDR5 MC should have been fixed by that time - it's been more than a year now. Overall, a good chance that the competitive situation will reverse itself.

However, considering TSMC's 40nm track record, 2Q11 seems best bet for 28nm gpu's from both though.

No offense, but what planet do you live on?

ATI has substantially more VLSI expertise and is almost always first to a new process - precisely because NV is very conservative with new process tech. There's no way that NV would make a crazy bet like this...

That's not to mention the fact that now that ATI and AMD are under the same roof, ATI's advantage in VLSI will accelerate even more. While AMD doesn't have the most compelling CPUs in the world, they certainly have considerable expertise in VLSI and work closely with Global Foundries. That experience is definitely going to prove helpful when moving to new process technologies.

DK
 
ATI has substantially more VLSI expertise and is almost always first to a new process - precisely because NV is very conservative with new process tech. There's no way that NV would make a crazy bet like this...
I am not arguing against any of that. Perhaps I should make my point more clearly.

All I am saying is

a) They decided to go (whether by choice or due to circumstances) for 40nm while it was clearly not mature enough. To me, it is not unbelievable that over the last ~2 years nv has been building up on their VLSI experience to cancel ATi's process expertise. No idea if that is enough time for it. May be you can enlighten us.

b) This feeds from a). It is in nv's interest to try to work on a shrink for gf100 @28nm. If it is delayed, well they of course can't help it, but if it works, then it could reverse a lot of momentum. For that upside, it *MAY* be worthwhile for them to make this bet. In view of ATi's choice of 40nm for this fall, I'd say the chance's for this are low.

c) The downside is of course that it will compete for engg resources with the Bx spin. No idea how well is that coming along, if at all.

d) If they aren't going to risk a 28nm shrink around that time, then the no-B1 rumours seem odd. That would suggest that they are trying for, say, a 20SM chip this fall. They do seem to have some space left in the reticle to fill. Alternatively, they could be trying to up their MC speeds for fermi2.

That's not to mention the fact that now that ATI and AMD are under the same roof, ATI's advantage in VLSI will accelerate even more. While AMD doesn't have the most compelling CPUs in the world, they certainly have considerable expertise in VLSI and work closely with Global Foundries. That experience is definitely going to prove helpful when moving to new process technologies.
AMD's cpu's are on SOI for last >~7 years. Is that expertise transferable to bulk, even partly?
 
Back
Top