AMD/ATI Evergreen: Architecture Discussion

Discussion in 'Architecture and Products' started by Dave Baumann, Jan 19, 2010.

  1. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    I'd really like to see a configuration with disabled simds but higher clock vs. one with all simds with lower clock, so they have the same flops. The article says that L2 bandwidth looks a little low, and there could be other things scaling only with clocks but not simds (like setup), but I'm really wondering how much of a difference this actually makes. Granted, such a configuration would also give the higher clocked one an unfair advantage in rop throughput, but that probably shouldn't make too much of a difference.
     
  2. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    946
    Likes Received:
    413
    If they make the Compute article they can make a test configuration like this in OpenCL:

    - 8 opencl gpu "processors" @ 900MHz
    - 9 opencl gpu "processors" @ 800MHz
    - 10 opencl gpu "processors" @ 720MHz

    If you want to measure if MHz or scheduling has an advantage, the classic question of scaling. As you clock the entire chip it would be nessesary to calculate out the reduction of cache-speeds, which may not be possible.
    On the other hand if more "processors" are faster or equal (to their less numbered higher clocked counterparts) despite having slower caches and higher cache-pressure, the scaling question is answered even without making math-saltos. :)
    If you repeat the test with the 4k series and you detect a disparity in scaling you could assume Cypress has a better/worst cache system than 4k (assuming the xyztw-unit did not change) per "processor".

    I think with OpenCL it's really possible to answer some fundamental architecture questions of Cypress. Can't wait (to busy to program that myself ;) ) for the "promised" article.
     
  3. argor

    Newcomer

    Joined:
    Nov 25, 2008
    Messages:
    96
    Likes Received:
    0
    they do not all appear rite for me only one appears rite on linux


    maybe write in carvarns / Processing.js but as this is beyond3d maybe webgl :smile:
     
  4. Rys

    Rys Graphics @ AMD
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,182
    Likes Received:
    1,579
    Location:
    Beyond3D HQ
    Given the glacial rate we do things around here, WebGL might be a usable thing when I get round to it :lol:
     
  5. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    B3D's 7th commandment: Thou shall not choose Flash over WebGL.

    :runaway:
     
  6. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    On Linux, I'm using the 10.0 r45 64 bit beta flashplayer in Firefox, which mangles the graphs. Opera 10 uses the stock 32bit flash player and it displays the graphs fine.

    I'm in the "Flash is evil" camp.

    Cheers
     
  7. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    #87 rpg.314, Feb 24, 2010
    Last edited by a moderator: Feb 24, 2010
  8. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    946
    Likes Received:
    413
    Let's move over here, talking NI now.

    Okay, this brings up a truckload of questions and ideas.
    The VLIWs I suppose are too complex to be decoded into completely distinct signal-sets, I suppose the bits in the VLIW almost directly map to pathway on/offs.
    Wouldn't this be a clear incentive to explore the VLIW-instruction space? Trying to detect VLIW-configurations which are not documented but work (because they follow the mechanics of VLIW-instruction expressions)
    It is interesting to think out that basically any permutation of all ALUs available in the pool could be expressed and executed as VLIW-instruction.
    Including possible rerouting of t-unit outputs into the other ALUs inputs. Something like MULSIN.
    If it's not possible yet, it's definitely a great way to generalize the current architecture, making it extreme powerfull.

    Okay, that's another think is possible to modify just a little for big effect.
    Though I still can't really connect what I know now with the assembler-output:

    Code:
         60  x: MUL_e       T0.x,  PV59.z,  R8.x      
             y: MUL_e       T1.y,  PV59.z,  R6.x      VEC_021 
             z: MUL_e       T0.z,  PV59.z,  R7.x      VEC_102 
             w: ADD         ____,  PV59.w,  T0.z      
    
    In theory the three MULs within this x,y,z,w,t-block are uncorrelated, which means there could be a throughput of 1/3, doing all three MULs in a single clock (there must be 4 multipliers to support the 1 clock DOT4). There could even be a 1/4 throughput (if the assembler realizes that T0.z is temporary and trashed directly afterwards), because the last ADD can be integrated into a MULADD, leading to a single clock for the entire operation.

    So, what I don't really understand is, in which relation the identifiers in front of the line are with the identifier on the register.

    The destinations-registers appear all to be identical named as the identifier in front, with the t-unit it's different:

    Code:
        120  x: MUL_e       R5.x,  R1.x,  PV119.z      
             t: MUL_e       R27.x,  R0.x,  PV119.z   
    
    So what I wonder is if all this identifier-thing is basically the assembler-expression of the wiring to apply between the ALUs. With "____" being a buffer-less (the value does not go to the register-file and does not receive $100 :) ) wiring.

    I suspect to make the shader-internal function OOO is not really as simple (in terms of additional transistors) as speak it out, but it's a very local change with a possibly huge effect.
    Once OOO is there the calculations are basically wire-limited, you could technically do a DOT4 explicit as MUL/ALL instructions if you'd have enough wires x,y,z,u,v,w,a,b.

    Well, this is just crazy outburst without having a deep understanding how a particular shader-unit x,y,z,w,t exactly looks and behaves like (I mean a real logics plan and a FSM-description).

    It's not determinable? Let's say you see a value pulled out of the cache before the suppose to follow-up write to global memory (inversion)?
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I should have referenced the R700 ISA too:

    http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf

    since the Evergreen one is missing huge amounts of stuff. The R600 ISA, too:

    http://developer.amd.com/gpu_assets/r600isa.pdf

    Not sure what you're saying. It's just a variable-length VLIW instruction format that can contain some literals and drives the hardware directly.

    When you look at a complete program you need to bear in mind that it contains two different kinds of instructions: control flow instructions and clause instructions. CF instructions string together the clauses and they also use/manipulate predicates to form loops etc. and they fire off certain kinds of memory operations. Clauses contain either ALU instructions or texturing/vertex-fetch instructions.

    The designers have explored the space a little, as double-precision, variations on dot product and the interpolation instructions have been added over time. I don't think there's much scope for a programmer to mess about with machine code as so far no-one has cracked writing their own binaries independent of AMD's IL compiler.

    You can do a bit of ISA archaeology with the 3 documents you now have :razz:

    It's really a matter of payback on these other refinements, I reckon.

    Another side to this is a discussion of the optimal lane count. There's a lot of discussion about all this stuff...

    I'm not sure what you're saying really - the hardware can do 5 MADs per clock on every clock cycle.

    "____" tells you that the result will only be used in the succeeding instruction. So in instruction 61 you will see somewhere an operand called PV60.w.

    Instruction 121 might also refer to the previous instruction, e.g. there might be an operand called PS120. PS always refers to the result in the T lane.

    So the PS/PV operand names are referring to an in-pipe circular buffer used specifically to avoid RAW latency. It has to be a circular buffer, because the actual timing of a succeeding instruction is 8 physical cycles and in those 7 other cycles the ALUs will want to use data from "8 physical cycles earlier".

    Note that the hardware runs a pair of hardware threads over 8 physical cycles AAAABBBB, i.e. thread A runs a single instruction (e.g. instruction 3) for 4 cycles followed by thread B (which might be instruction 7 from some other kernel). The number of work items that make up a hardware thread is 4 times the width of the hardware. Most ATI GPUs are 16 wide. So in 8 physical cycles 1 logical cycle from two distinct hardware threads is executed. It's a variation of a barrel processor:

    http://en.wikipedia.org/wiki/Barrel_processor

    The T0 registers are clause temporary registers. Their lifetime is the clause, e.g. 5 instructions bounded by control flow instructions. They're kept in the register file, which is 256KB in size for a SIMD (but each set of 5 lanes has a private 16KB register file). Because a clause, once it starts, is uninterruptible, the T registers (up to 8 in Evergreen, 4 in previous GPUs) take up almost no space in the register file. So this is a way to save overall register file space, leaving more for those registers whose lifetime is multiple clauses or indeed the entire kernel.

    The VEC_ modifiers tell the hardware the order in which to fetch registers - there's a nasty bunch of rules about the way register fetches can be timed/ordered. This comes together over 3 out of the 4 physical cycles that are dedicated to the thread (either A or B).

    OoO and VLIW are sort of opposite in this context. VLIW increases compiler pain but makes the hardware simpler. OoO has implications for the way registers and other memory are handled, too.

    You'd have to construct your own. All the microcode formats are laid out in painful detail!

    There are dependency analysers and operand readiness scoreboarding to handle all this stuff. You can rummage in NVidia patents. Here's the last time this subject came up:

    http://forum.beyond3d.com/showthread.php?p=1374360#post1374360

    Mike Shebanow's talk is useful:

    http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/lectures//Lecture12-MikeShebanow.pdf

    The audio is very good, make sure to catch all the Q&A:
    http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/lectures/Lecture12.mp3

    Sadly the original of this page is no-longer there:

    http://66.102.9.132/search?q=cache:...e:courses.ece.illinois.edu&cd=4&hl=en&ct=clnk

    so hopefully the Google cache version works. The links off that seem to work. The more up-to-date versions of the course will also be useful, start here:

    http://courses.ece.illinois.edu/ece498/al/

    Jawed
     
  10. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    946
    Likes Received:
    413
    Okay, that will fill my weekend, thanks. :)
    I'll be back.
     
  11. bridgman

    Newcomer Subscriber

    Joined:
    Dec 1, 2007
    Messages:
    62
    Likes Received:
    123
    Location:
    Toronto-ish
  12. Arnold Beckenbauer

    Veteran Subscriber

    Joined:
    Oct 11, 2006
    Messages:
    1,756
    Likes Received:
    722
    Location:
    Germany
  13. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    So AMD was right after all that the tessellation implementation in the Evergreen series is not limited to 1/3 tri/clock. It's just that some buffer sizes appear to be quite a bit on the low side.
     
    #93 Gipsel, Oct 24, 2010
    Last edited by a moderator: Oct 24, 2010
  14. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    Did you mean 1/3 tri/clock because it is limited to 1 tri/clock as a peak theoretical rate.
     
  15. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Yes, I wanted to type 1/3 tris/clock of course.
     
  16. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Weird, but isn't this similar case to the G80 and G92 wimpy GS performance with high amplification?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...