AMD/ATI Evergreen: Architecture Discussion

mczak · Feb 23, 2010

Psycho said:
Regarding the initial benchmarks it would have been nice to see the 4890 @850/600 for a truly "half cypress", instead of that extra bandwidth limited cypress, which doesn't really do anything in the comparision to 4890.

I'd really like to see a configuration with disabled simds but higher clock vs. one with all simds with lower clock, so they have the same flops. The article says that L2 bandwidth looks a little low, and there could be other things scaling only with clocks but not simds (like setup), but I'm really wondering how much of a difference this actually makes. Granted, such a configuration would also give the higher clocked one an unfair advantage in rop throughput, but that probably shouldn't make too much of a difference.

Ethatron · Feb 23, 2010

mczak said:
I'd really like to see a configuration with disabled simds but higher clock vs. one with all simds with lower clock, so they have the same flops. The article says that L2 bandwidth looks a little low, and there could be other things scaling only with clocks but not simds (like setup), but I'm really wondering how much of a difference this actually makes. Granted, such a configuration would also give the higher clocked one an unfair advantage in rop throughput, but that probably shouldn't make too much of a difference.

If they make the Compute article they can make a test configuration like this in OpenCL:

- 8 opencl gpu "processors" @ 900MHz
- 9 opencl gpu "processors" @ 800MHz
- 10 opencl gpu "processors" @ 720MHz

If you want to measure if MHz or scheduling has an advantage, the classic question of scaling. As you clock the entire chip it would be nessesary to calculate out the reduction of cache-speeds, which may not be possible.
On the other hand if more "processors" are faster or equal (to their less numbered higher clocked counterparts) despite having slower caches and higher cache-pressure, the scaling question is answered even without making math-saltos.

If you repeat the test with the 4k series and you detect a disparity in scaling you could assume Cypress has a better/worst cache system than 4k (assuming the xyztw-unit did not change) per "processor".

I think with OpenCL it's really possible to answer some fundamental architecture questions of Cypress. Can't wait (to busy to program that myself

) for the "promised" article.

argor · Feb 23, 2010

Rys said:
They render fine for me in IE8 and IE8 64-bit? Do you have something that fiddles with Flash content installed?

they do not all appear rite for me only one appears rite on linux

As an aside, the current graph renderer that I wrote way back when will go at some point, and be replaced by something non-Flash.

maybe write in carvarns / Processing.js but as this is beyond3d maybe webgl :smile:

Rys · Feb 23, 2010

Given the glacial rate we do things around here, WebGL might be a usable thing when I get round to it

rpg.314 · Feb 23, 2010

B3D's 7th commandment: Thou shall not choose Flash over WebGL.

:runaway:

Gubbi · Feb 23, 2010

argor said:
they do not all appear rite for me only one appears rite on linux

On Linux, I'm using the 10.0 r45 64 bit beta flashplayer in Firefox, which mangles the graphs. Opera 10 uses the stock 32bit flash player and it displays the graphs fine.

I'm in the "Flash is evil" camp.

Cheers

rpg.314 · Feb 24, 2010

In the absence of WebGL, I suggest using flot over flash.

EDIT: this is what you can do with javascript today.
http://buildbot.pypy.org/plotsummary.html

Ethatron · Feb 24, 2010

Let's move over here, talking NI now.

Jawed said:
Ethatron said:

Oh. I still was thinking CPU, I really have to change my thinking a bit now. I understand now that basically x,y,z,w,t are a little bit like an ALU-pool that can be connected/configured by the VLIW-instruction in quite a flexible way. The DOT4 instruction (for example) is not simply a serial product of it's inner products, it's really an instruction to create a specific configuration of the network of ALU nodes to accomplish the DOT4 not only in 1 clock in-order but also more exact. Right so far?

Click to expand...

Yep.

Okay, this brings up a truckload of questions and ideas.
The VLIWs I suppose are too complex to be decoded into completely distinct signal-sets, I suppose the bits in the VLIW almost directly map to pathway on/offs.
Wouldn't this be a clear incentive to explore the VLIW-instruction space? Trying to detect VLIW-configurations which are not documented but work (because they follow the mechanics of VLIW-instruction expressions)
It is interesting to think out that basically any permutation of all ALUs available in the pool could be expressed and executed as VLIW-instruction.
Including possible rerouting of t-unit outputs into the other ALUs inputs. Something like MULSIN.
If it's not possible yet, it's definitely a great way to generalize the current architecture, making it extreme powerfull.

Jawed said:
Ethatron said:

I think I still thought out-of-order, means I thought you get the same throughput of a single DOT4 instruction with multiple equivalent MUL/ADD instructions (using distinct outputs though).
Okay, my brain fires up, better wait for confirmation (GPUs == in-order?). Later the hailing of ATI's x,y,z,w,t concept. No wonder Fermi has so monstrous dimensions ...

Click to expand...

Processing of ALU instructions is strictly in-order in ATI.

Okay, that's another think is possible to modify just a little for big effect.
Though I still can't really connect what I know now with the assembler-output:

Code:

     60  x: MUL_e       T0.x,  PV59.z,  R8.x      
         y: MUL_e       T1.y,  PV59.z,  R6.x      VEC_021 
         z: MUL_e       T0.z,  PV59.z,  R7.x      VEC_102 
         w: ADD         ____,  PV59.w,  T0.z

In theory the three MULs within this x,y,z,w,t-block are uncorrelated, which means there could be a throughput of 1/3, doing all three MULs in a single clock (there must be 4 multipliers to support the 1 clock DOT4). There could even be a 1/4 throughput (if the assembler realizes that T0.z is temporary and trashed directly afterwards), because the last ADD can be integrated into a MULADD, leading to a single clock for the entire operation.

So, what I don't really understand is, in which relation the identifiers in front of the line are with the identifier on the register.

The destinations-registers appear all to be identical named as the identifier in front, with the t-unit it's different:

Code:

    120  x: MUL_e       R5.x,  R1.x,  PV119.z      
         t: MUL_e       R27.x,  R0.x,  PV119.z

So what I wonder is if all this identifier-thing is basically the assembler-expression of the wiring to apply between the ALUs. With "____" being a buffer-less (the value does not go to the register-file and does not receive $100

) wiring.

I suspect to make the shader-internal function OOO is not really as simple (in terms of additional transistors) as speak it out, but it's a very local change with a possibly huge effect.
Once OOO is there the calculations are basically wire-limited, you could technically do a DOT4 explicit as MUL/ALL instructions if you'd have enough wires x,y,z,u,v,w,a,b.

Well, this is just crazy outburst without having a deep understanding how a particular shader-unit x,y,z,w,t exactly looks and behaves like (I mean a real logics plan and a FSM-description).

Jawed said:
There's some debate about whether NVidia's GPUs actually re-order ALU instructions - I think they do.

Jawed

It's not determinable? Let's say you see a value pulled out of the cache before the suppose to follow-up write to global memory (inversion)?

Jawed · Feb 24, 2010

Ethatron said:
Okay, this brings up a truckload of questions and ideas.

I should have referenced the R700 ISA too:

http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf

since the Evergreen one is missing huge amounts of stuff. The R600 ISA, too:

http://developer.amd.com/gpu_assets/r600isa.pdf

The VLIWs I suppose are too complex to be decoded into completely distinct signal-sets, I suppose the bits in the VLIW almost directly map to pathway on/offs.

Not sure what you're saying. It's just a variable-length VLIW instruction format that can contain some literals and drives the hardware directly.

When you look at a complete program you need to bear in mind that it contains two different kinds of instructions: control flow instructions and clause instructions. CF instructions string together the clauses and they also use/manipulate predicates to form loops etc. and they fire off certain kinds of memory operations. Clauses contain either ALU instructions or texturing/vertex-fetch instructions.

Wouldn't this be a clear incentive to explore the VLIW-instruction space? Trying to detect VLIW-configurations which are not documented but work (because they follow the mechanics of VLIW-instruction expressions)

The designers have explored the space a little, as double-precision, variations on dot product and the interpolation instructions have been added over time. I don't think there's much scope for a programmer to mess about with machine code as so far no-one has cracked writing their own binaries independent of AMD's IL compiler.

You can do a bit of ISA archaeology with the 3 documents you now have

It is interesting to think out that basically any permutation of all ALUs available in the pool could be expressed and executed as VLIW-instruction.
Including possible rerouting of t-unit outputs into the other ALUs inputs. Something like MULSIN.
If it's not possible yet, it's definitely a great way to generalize the current architecture, making it extreme powerfull.

It's really a matter of payback on these other refinements, I reckon.

Another side to this is a discussion of the optimal lane count. There's a lot of discussion about all this stuff...

Code:
Okay, that's another think is possible to modify just a little for big effect.
Though I still can't really connect what I know now with the assembler-output:

Code:

60 x: MUL_e T0.x, PV59.z, R8.x y: MUL_e T1.y, PV59.z, R6.x VEC_021 z: MUL_e T0.z, PV59.z, R7.x VEC_102 w: ADD ____, PV59.w, T0.z

In theory the three MULs within this x,y,z,w,t-block are uncorrelated, which means there could be a throughput of 1/3, doing all three MULs in a single clock (there must be 4 multipliers to support the 1 clock DOT4). There could even be a 1/4 throughput (if the assembler realizes that T0.z is temporary and trashed directly afterwards), because the last ADD can be integrated into a MULADD, leading to a single clock for the entire operation.

I'm not sure what you're saying really - the hardware can do 5 MADs per clock on every clock cycle.

Code:
So, what I don't really understand is, in which relation the identifiers in front of the line are with the identifier on the register.

The destinations-registers appear all to be identical named as the identifier in front, with the t-unit it's different:

Code:

120 x: MUL_e R5.x, R1.x, PV119.z t: MUL_e R27.x, R0.x, PV119.z

So what I wonder is if all this identifier-thing is basically the assembler-expression of the wiring to apply between the ALUs. With "____" being a buffer-less (the value does not go to the register-file and does not receive $100 ) wiring.

"____" tells you that the result will only be used in the succeeding instruction. So in instruction 61 you will see somewhere an operand called PV60.w.

Instruction 121 might also refer to the previous instruction, e.g. there might be an operand called PS120. PS always refers to the result in the T lane.

So the PS/PV operand names are referring to an in-pipe circular buffer used specifically to avoid RAW latency. It has to be a circular buffer, because the actual timing of a succeeding instruction is 8 physical cycles and in those 7 other cycles the ALUs will want to use data from "8 physical cycles earlier".

Note that the hardware runs a pair of hardware threads over 8 physical cycles AAAABBBB, i.e. thread A runs a single instruction (e.g. instruction 3) for 4 cycles followed by thread B (which might be instruction 7 from some other kernel). The number of work items that make up a hardware thread is 4 times the width of the hardware. Most ATI GPUs are 16 wide. So in 8 physical cycles 1 logical cycle from two distinct hardware threads is executed. It's a variation of a barrel processor:

http://en.wikipedia.org/wiki/Barrel_processor

The T0 registers are clause temporary registers. Their lifetime is the clause, e.g. 5 instructions bounded by control flow instructions. They're kept in the register file, which is 256KB in size for a SIMD (but each set of 5 lanes has a private 16KB register file). Because a clause, once it starts, is uninterruptible, the T registers (up to 8 in Evergreen, 4 in previous GPUs) take up almost no space in the register file. So this is a way to save overall register file space, leaving more for those registers whose lifetime is multiple clauses or indeed the entire kernel.

The VEC_ modifiers tell the hardware the order in which to fetch registers - there's a nasty bunch of rules about the way register fetches can be timed/ordered. This comes together over 3 out of the 4 physical cycles that are dedicated to the thread (either A or B).

I suspect to make the shader-internal function OOO is not really as simple (in terms of additional transistors) as speak it out, but it's a very local change with a possibly huge effect.

OoO and VLIW are sort of opposite in this context. VLIW increases compiler pain but makes the hardware simpler. OoO has implications for the way registers and other memory are handled, too.

Well, this is just crazy outburst without having a deep understanding how a particular shader-unit x,y,z,w,t exactly looks and behaves like (I mean a real logics plan and a FSM-description).

You'd have to construct your own. All the microcode formats are laid out in painful detail!

It's not determinable? Let's say you see a value pulled out of the cache before the suppose to follow-up write to global memory (inversion)?

There are dependency analysers and operand readiness scoreboarding to handle all this stuff. You can rummage in NVidia patents. Here's the last time this subject came up:

http://forum.beyond3d.com/showthread.php?p=1374360#post1374360

Mike Shebanow's talk is useful:

http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/lectures//Lecture12-MikeShebanow.pdf

The audio is very good, make sure to catch all the Q&A:
http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/lectures/Lecture12.mp3

Sadly the original of this page is no-longer there:

http://66.102.9.132/search?q=cache:...e:courses.ece.illinois.edu&cd=4&hl=en&ct=clnk

so hopefully the Google cache version works. The links off that seem to work. The more up-to-date versions of the course will also be useful, start here:

http://courses.ece.illinois.edu/ece498/al/

Jawed

Ethatron · Feb 25, 2010

Jawed said:
You'd have to construct your own. All the microcode formats are laid out in painful detail!

Okay, that will fill my weekend, thanks.

I'll be back.

bridgman · Feb 26, 2010

There is a basic shader compiler (more of an assembler really) in the open source 3D driver for 6xx/7xx :

http://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/r600

I think r700_assembler.c is the main chunk of code. Might save you some typing anyways...

Arnold Beckenbauer · Oct 24, 2010

http://www.beyond3d.com/content/reviews/55/10
A lot of info about Cypress&Tessellation.

Gipsel · Oct 24, 2010

Arnold Beckenbauer said:
http://www.beyond3d.com/content/reviews/55/10
A lot of info about Cypress&Tessellation.

So AMD was right after all that the tessellation implementation in the Evergreen series is not limited to 1/3 tri/clock. It's just that some buffer sizes appear to be quite a bit on the low side.

3dcgi · Oct 24, 2010

Gipsel said:
So AMD was right after all that the tessellation implementation in the Evergreen series is not limited to 1 tri/clock. It's just that some buffer sizes appear to be quite a bit on the low side.

Did you mean 1/3 tri/clock because it is limited to 1 tri/clock as a peak theoretical rate.

Gipsel · Oct 24, 2010

3dcgi said:
Did you mean 1/3 tri/clock

Yes, I wanted to type 1/3 tris/clock of course.

fellix · Oct 25, 2010

Gipsel said:
So AMD was right after all that the tessellation implementation in the Evergreen series is not limited to 1/3 tri/clock. It's just that some buffer sizes appear to be quite a bit on the low side.

Weird, but isn't this similar case to the G80 and G92 wimpy GS performance with high amplification?

AMD/ATI Evergreen: Architecture Discussion

mczak

Ethatron

argor

Rys

Graphics @ AMD

rpg.314

Gubbi

rpg.314

Ethatron

Jawed

Ethatron

bridgman

Arnold Beckenbauer

Gipsel

3dcgi

Gipsel

fellix

Similar threads