AMD: R7xx Speculation

Jawed · Jun 20, 2008

Geo said:
So they're saying ATI made this decision roughly just after R580? That's interesting. That would be after the discussions with AMD began (Dec 2005), tho before they were consummated (July 2006). And possibly nearly simultaneous with when the decision was made to go forward with the AMD marriage.

Maybe we could rationalise it as "hmm, R580 is as big as we ever want to make a chip, we know R600 is looking way too big..." How suddenly does a company come to appreciate the economics of these big GPUs? I can't imagine that it actually was sudden, these GPUs overlapped too much.

Jawed

Mintmaster · Jun 20, 2008

3dilettante said:
When would this merging be determined? The contexts can be completely separate and completely independent. Currently, there doesn't appear to be any such merging going on.

I guess I'm not really sure what you're asking. I take it that you want to run multiple programs simultaneously so that you don't need as many threads running the same program to maintain utilization, right? You can pools the thread and make an uber-program to achieve the same thing.

RV770 has 10 SIMD arrays

Okay, I got it now. Had a little brainfart there...

3dilettante · Jun 20, 2008

Jawed said:
What I'm wary of is that making the TUs interconnect many:many with the L1s is just way more complexity than necessary. With a crossbar between L1 and L2 there's already a many:many.

It doesn't need to be a crossbar, but at the same time the memory request bus doesn't seem worth mentioning if all it did was link things 1:1.

I suppose LDS is for inter-element sharing and perhaps is blind to context.

Because R6xx issues clauses of ALU instructions (from 1 to 128 instructions) which are "atomic", when the sequencer performs a constant cache fetch the cache is set up for the entire duration of the clause.

It might need to be aware of at least 2 contexts, if the SIMD is still alternating threads every cycle.

In theory, since ATI virtualises the register file, it's possible to share all extent contexts'. Looking at the newer RV770 diagrams there's no "memory read/write cache" (which I presume was the mechanism for moving register file data to/from video memory) as there was in R6xx, so presumably that's what GDS is doing. But GDS looks rather isolated. Bit confusing.

I'm not sure which slides are wholly accurate. The extremtech images omitted the memory read/write cache that sat between the hub and shader export, while pcinlife had it there.
PCinlife had a crossbar between the L1s and the TUs.
On the other hand, extremetech's diagram for RV770 gave the chip 1600 ALUs.

Which one screwed up more?

3dilettante · Jun 21, 2008

Mintmaster said:
I guess I'm not really sure what you're asking. I take it that you want to run multiple programs simultaneously so that you don't need as many threads running the same program to maintain utilization, right? You can pools the thread and make an uber-program to achieve the same thing.

But that means you know everything that will be running ahead of time, and that everything running is something you've made. In the future that might not be the case.

What if the gamer is running folding@home or some future computation program at the same time that a game is running your graphics code and game physics.
It may be stupid, but still possible.
In the case of current physics code, it's running on CUDA, so your graphics code currently isn't even allowed to interact with it correctly.

If that trend of multiple separate programs takes off, no branching written ahead of time can capture it.

Jawed · Jun 21, 2008

CarstenS said:
Strange - until very recently i was (supposed to be...) under the impression, that a ring bus MC was to deliver a much more efficient memory architecture than an old-fashioned crossbar-controller.

Yep, the lack of a routing hotspot was supposed to be key.

Now we have a hub which is, if anything, a routing hotspot.

Though being able to put something else in the centre of the die, instead of some of the MC (as we see in R5xx) was an improvement in R600's ring bus, a facet of full distribution.

There's an interesting patent relating to making interconnects travel long distances:

Integrated Circuit Chip With Repeater Flops and Method for Automated Design of Same

Is that involved in connecting the hub to the MCs?

Jawed

CarstenS · Jun 21, 2008

3dilettante said:
If that's the case, RV770's serial performance would be double that of an 800 ALU design.
Wouldn't a shader that was only a string of dependent incrementing adds pull up a higher sum per pixel?

Something like this?

GPU Bench 1.2.1 for HD4850 said:
--------------------------------------------------------------------------
Instruction Issue
--------------------------------------------------------------------------
512 98.2528 ADD 4 64
512 98.2656 SUB 4 64
512 98.2708 MUL 4 64
512 98.2651 MAD 4 64
512 98.2654 EX2 4 64
512 98.2673 LG2 4 64
512 49.5622 POW 4 64
512 122.7652 FLR 4 64
512 122.7732 FRC 4 64
512 98.2657 RSQ 4 64
512 98.2700 RCP 4 64
512 39.4741 SIN 4 64
512 39.4738 COS 4 64
512 93.9361 SCS 4 64
512 99.8046 DP3 4 64
512 99.7913 DP4 4 64
512 78.8841 XPD 4 64
512 98.2601 CMP 4 64

--------------------------------------------------------------------------
Scalar vs Vector Instruction Issue
--------------------------------------------------------------------------
512 76.9105 ADD 1 40
512 76.8444 ADD 4 40
512 76.8567 SUB 1 40
512 76.8543 SUB 4 40
512 76.9362 MUL 1 40
512 76.9019 MUL 4 40
512 76.8410 MAD 1 40
512 76.8566 MAD 4 40

GPU Bench 1.2.1 for HD2900 XT said:
--------------------------------------------------------------------------
Instruction Issue
--------------------------------------------------------------------------
512 46.7418 ADD 4 64
512 46.7418 SUB 4 64
512 46.7434 MUL 4 64
512 46.7433 MAD 4 64
512 46.7433 EX2 4 64
512 46.7430 LG2 4 64
512 23.5639 POW 4 64
512 58.4136 FLR 4 64
512 58.4162 FRC 4 64
512 46.7431 RSQ 4 64
512 46.7439 RCP 4 64
512 23.3827 SIN 4 64
512 23.3827 COS 4 64
512 45.3481 SCS 4 64
512 47.4739 DP3 4 64
512 47.4726 DP4 4 64
512 37.5172 XPD 4 64
512 46.7423 CMP 4 64

--------------------------------------------------------------------------
Scalar vs Vector Instruction Issue
--------------------------------------------------------------------------
512 47.4468 ADD 1 40
512 46.2910 ADD 4 40
512 47.4481 SUB 1 40
512 46.2924 SUB 4 40
512 47.4470 MUL 1 40
512 46.2918 MUL 4 40
512 47.4463 MAD 1 40
512 46.2924 MAD 4 40

Mintmaster · Jun 21, 2008

3dilettante said:
If that trend of multiple separate programs takes off, no branching written ahead of time can capture it.

So you envision a scenario, say next gen, where running 20 separate programs simultaneously isn't enough, and you need 60?

Jawed · Jun 21, 2008

Mintmaster said:
Now that we have better die shots, I found that the ALUs on RV770 occupy either 28.0% or 25.3% of the die, depending on whether that similar looking sliver next to the left of the 4x10 array is redundancy or not. Feel free to update your numbers.

I'll have another go at the numbers over the weekend.

This is my current theory on the layout of a SIMD:

In my opinion, redundancy (1 in 17) needs to be localised, because the mechanism for redundancy is a "bit-shift" to channel operands around the dud lane. So each ALU is self-contained in terms of redundancy.

Each of the MAD-only ALUs has register file, the shiny stuff I reckon. The T ALU doesn't have any register file, but it does have look-up tables. The Sequencer, obviously, has per hardware thread status and also cache (LDS too, I guess).

Of course it would be funny if I've read the colours wrong and the dark bits are memory and the bright bits are logic...

Jawed

randomhack · Jun 21, 2008

unleashonetera.com site seems a little updated now.
curiously i can only see the updated site in opera browser.

AlphaWolf · Jun 21, 2008

randomhack said:
unleashonetera.com site seems a little updated now.
curiously i can only see the updated site in opera browser.

Works fine in FF3 for me. I guess the site is targeted at europe as all the buy links are european.

randomhack · Jun 21, 2008

Btw what is Advanced Video Transcode? On one of the card descriptions I see it can accelerate video transcoding?
edit : that should read Accelerated Video Transcode
edit : i see i might be living a couple of years in the past.

WaltC · Jun 21, 2008

dizietsma said:
Incontrovertibly right. The first opterons where released back in 2003 or so and 5 years later the vast majority of desktops still uses 32bits. As for milking and innovators, AMD milked the K8 for far far too long.

I think I mentioned very clearly that I thought AMD had milked the K8 for too long, so we agree on that. As far as "64-bits on the desktop goes" there's no doubt that Intel is now firmly a "64-bits on the desktop" believer, without a doubt...

Without x86-64, I think Core 2 would have been a dud. The point was that Intel was wrong about "nobody needing it" or wanting it, for that matter, and the matter seems closed for debate, imo. I can't imagine why you might think it wouldn't be.

Jawed · Jun 21, 2008

Arun said:
then they're really occupying 40% of the die while they occupy 26% of the die on GT200... 0.4*260=104 & 0.26*595=155 - now, if you shrink that by 19%, it becomes 125. At least that's my estimate, if anyone has a better one please let me know.

http://forum.beyond3d.com/showpost.php?p=1178574&postcount=4052

I said 40.8% of the die for RV770's SIMDs.

If RV670's SIMDs were the same size, then 4 of these SIMDs would amount to 42mm2, which is 22% of RV670. Except, of course that RV770 SIMDs are smaller (by what percentage?) and they also have extra memory inside, for LDS.

Anyway, it seems likely to me that RV670's SIMDs occupy less than 30% of the die. All the indications are that AMD has chopped out a lot of TU functionality. I think this, along with the extra Z capability per RBE (and the deleted fog unit?), really mucks up scaling comparisons.

Then there's the hub instead of ring...

Jawed

Jawed · Jun 21, 2008

3dilettante said:
The future might require that the GPUs start pulling shaders from different contexts, or the trend for running non-graphics code concurrently with the graphics code may take off.

As far as I can tell CAL and CUDA both support more than one kernel running simultaneously.

Both architectures support F@H while doing 3D graphics (e.g. Vista Aero).

We're looking at the following types of kernel in D3D11 I reckon:

Control Point
Vertex
Geometry
Pixel
General Computation

Jawed

mczak · Jun 21, 2008

Jawed said:
This is my current theory on the layout of a SIMD:

Interesting. Scaling the simd width would require the alu blocks to get redesigned (ok if that's fully automated not a problem) though with this organization. It definitely would make sense wrt redundancy though.
I don't quite understand however the split of T-MAD and T-Trans unit, that doesn't look right. Also, T-Trans doesn't look quite regular enough to me for something 16-wide.

nAo · Jun 21, 2008

Jawed said:
As far as I can tell CAL and CUDA both support more than one kernel running simultaneously.

Both architectures support F@H while doing 3D graphics (e.g. Vista Aero).

We're looking at the following types of kernel in D3D11 I reckon:

Control Point

Vertex

Geometry

Pixel

General Computation

Jawed

Tesselation adds 2 new (different) programmable stages on DX11..

Jawed · Jun 21, 2008

Mintmaster said:
I thought it's 32 if you want everything going at full speed, as instruction changes happen every 4 clocks from what I remember.

No a hardware thread in G80 is contains 16 elements, with instructions lasting 2 clocks. That's how vertex shaders are executed.

Isn't it 64 for similar reasons as above?

RV610/20 are 16.

GT200, I think, has a 32-element hardware threads. I have to admit my first quick dose of the CUDA 2.0 documentation, where it describes coalescing and memory pages sent me packing, but I think somewhere in there (or the operand fetch waterfalling discussion) should be the true hardware thread size...

Jawed

Jawed · Jun 21, 2008

3dilettante said:
It doesn't need to be a crossbar, but at the same time the memory request bus doesn't seem worth mentioning if all it did was link things 1:1.

How about this, each TU fetches data from:

anywhere in GDS
anywhere in vertex cache
its private L1

So it's one:many for GDS and vertex cache and one

ne for texels.

It might need to be aware of at least 2 contexts, if the SIMD is still alternating threads every cycle.

Yeah, you're right. I was just trying to isolate a hardware thread from all the others running on the SIMD, in which case the pairing is invisible.

I'm not sure which slides are wholly accurate. The extremtech images omitted the memory read/write cache that sat between the hub and shader export, while pcinlife had it there.
PCinlife had a crossbar between the L1s and the TUs.
On the other hand, extremetech's diagram for RV770 gave the chip 1600 ALUs.

Which one screwed up more?

I suspect they're all AMD diagrams, but the ones at Extremetech are very recent. AMD seems to have decided to go for a black background for the "aggressive" marketing of RV770.

But yeah, you're right, the confusion on the location of the crossbar is annoying. I dunno, since RV770 seems to be more and more different the closer you look I guess we'll just have to keep our fingers crossed.

As this is so important for CAL it should become clear.

Jawed

Mintmaster · Jun 21, 2008

Jawed said:
This is my current theory on the layout of a SIMD:

The problem with that type of layout is the crazy data movement that needs to happen across blocks. Think about swizzling and dot products and channel replication.

It makes a lot more sense to keep a fragment's data in one place. I'd bet that each of the blocks has X,Y,Z,W,T in there for a quad. The register files are out in the corners, and in the center there is some sharing for derivative instructions, table lookups for transcendentals, etc.

In my opinion, redundancy (1 in 17) needs to be localised, because the mechanism for redundancy is a "bit-shift" to channel operands around the dud lane. So each ALU is self-contained in terms of redundancy.

Maybe the 55nm process is mature enough that they just skipped a lot of the redundancy, or limited it to register files and parts of the shader core instead of its entirety.

Each of the MAD-only ALUs has register file, the shiny stuff I reckon. The T ALU doesn't have any register file, but it does have look-up tables. The Sequencer, obviously, has per hardware thread status and also cache (LDS too, I guess).

I think some of those T-labelled regions are texture units. I don't see why the TU quads wouldn't show up on the die with a nice regular pattern.

Jawed · Jun 21, 2008

mczak said:
I don't quite understand however the split of T-MAD and T-Trans unit, that doesn't look right.

The T unit has no register file so that's why the MAD looks different.

Also, T-Trans doesn't look quite regular enough to me for something 16-wide.

You can get an idea of the size of a transcendental unit here:

Method and system for approximating sine and cosine functions

Though that's not a complete unit.

The adder at the bottom in that patent document may be the adder from the MAD unit, dunno. Also, T does int32 MUL which no other unit does.

Jawed

AMD: R7xx Speculation

Jawed

Mintmaster

3dilettante

3dilettante

Jawed

CarstenS

Moderator

Mintmaster

Jawed

randomhack

AlphaWolf

Specious Misanthrope

randomhack

WaltC

Jawed

Jawed

mczak

nAo

Nutella Nutellae

Jawed

Jawed

Mintmaster

Jawed

Similar threads