AMD: R7xx Speculation

Discussion in 'Architecture and Products' started by Unknown Soldier, May 18, 2007.

Thread Status:
Not open for further replies.
  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Maybe we could rationalise it as "hmm, R580 is as big as we ever want to make a chip, we know R600 is looking way too big..." How suddenly does a company come to appreciate the economics of these big GPUs? I can't imagine that it actually was sudden, these GPUs overlapped too much.

    Jawed
     
  2. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I guess I'm not really sure what you're asking. I take it that you want to run multiple programs simultaneously so that you don't need as many threads running the same program to maintain utilization, right? You can pools the thread and make an uber-program to achieve the same thing.

    Okay, I got it now. Had a little brainfart there...
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    It doesn't need to be a crossbar, but at the same time the memory request bus doesn't seem worth mentioning if all it did was link things 1:1.

    It might need to be aware of at least 2 contexts, if the SIMD is still alternating threads every cycle.

    I'm not sure which slides are wholly accurate. The extremtech images omitted the memory read/write cache that sat between the hub and shader export, while pcinlife had it there.
    PCinlife had a crossbar between the L1s and the TUs.
    On the other hand, extremetech's diagram for RV770 gave the chip 1600 ALUs.

    Which one screwed up more?
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    But that means you know everything that will be running ahead of time, and that everything running is something you've made. In the future that might not be the case.

    What if the gamer is running folding@home or some future computation program at the same time that a game is running your graphics code and game physics.
    It may be stupid, but still possible.
    In the case of current physics code, it's running on CUDA, so your graphics code currently isn't even allowed to interact with it correctly.

    If that trend of multiple separate programs takes off, no branching written ahead of time can capture it.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Yep, the lack of a routing hotspot was supposed to be key.

    Now we have a hub which is, if anything, a routing hotspot.

    Though being able to put something else in the centre of the die, instead of some of the MC (as we see in R5xx) was an improvement in R600's ring bus, a facet of full distribution.

    There's an interesting patent relating to making interconnects travel long distances:

    Integrated Circuit Chip With Repeater Flops and Method for Automated Design of Same

    Is that involved in connecting the hub to the MCs?

    Jawed
     
  6. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Something like this?

     
    #4346 CarstenS, Jun 21, 2008
    Last edited by a moderator: Jun 21, 2008
  7. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    So you envision a scenario, say next gen, where running 20 separate programs simultaneously isn't enough, and you need 60?
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I'll have another go at the numbers over the weekend.

    This is my current theory on the layout of a SIMD:

    [​IMG]

    In my opinion, redundancy (1 in 17) needs to be localised, because the mechanism for redundancy is a "bit-shift" to channel operands around the dud lane. So each ALU is self-contained in terms of redundancy.

    Each of the MAD-only ALUs has register file, the shiny stuff I reckon. The T ALU doesn't have any register file, but it does have look-up tables. The Sequencer, obviously, has per hardware thread status and also cache (LDS too, I guess).

    Of course it would be funny if I've read the colours wrong and the dark bits are memory and the bright bits are logic...

    Jawed
     
  9. randomhack

    Newcomer

    Joined:
    Apr 4, 2008
    Messages:
    41
    Likes Received:
    0
    unleashonetera.com site seems a little updated now.
    curiously i can only see the updated site in opera browser.
     
  10. AlphaWolf

    AlphaWolf Specious Misanthrope
    Legend

    Joined:
    May 28, 2003
    Messages:
    8,415
    Likes Received:
    270
    Location:
    Treading Water
    Works fine in FF3 for me. I guess the site is targeted at europe as all the buy links are european.
     
  11. randomhack

    Newcomer

    Joined:
    Apr 4, 2008
    Messages:
    41
    Likes Received:
    0
    Btw what is Advanced Video Transcode? On one of the card descriptions I see it can accelerate video transcoding?
    edit : that should read Accelerated Video Transcode
    edit : i see i might be living a couple of years in the past.
     
  12. WaltC

    Veteran

    Joined:
    Jul 22, 2002
    Messages:
    2,710
    Likes Received:
    8
    Location:
    BelleVue Sanatorium, Billary, NY. Patient privile
    I think I mentioned very clearly that I thought AMD had milked the K8 for too long, so we agree on that. As far as "64-bits on the desktop goes" there's no doubt that Intel is now firmly a "64-bits on the desktop" believer, without a doubt...;) Without x86-64, I think Core 2 would have been a dud. The point was that Intel was wrong about "nobody needing it" or wanting it, for that matter, and the matter seems closed for debate, imo. I can't imagine why you might think it wouldn't be.
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    http://forum.beyond3d.com/showpost.php?p=1178574&postcount=4052

    I said 40.8% of the die for RV770's SIMDs.

    If RV670's SIMDs were the same size, then 4 of these SIMDs would amount to 42mm2, which is 22% of RV670. Except, of course that RV770 SIMDs are smaller (by what percentage?) and they also have extra memory inside, for LDS.

    Anyway, it seems likely to me that RV670's SIMDs occupy less than 30% of the die. All the indications are that AMD has chopped out a lot of TU functionality. I think this, along with the extra Z capability per RBE (and the deleted fog unit?), really mucks up scaling comparisons.

    Then there's the hub instead of ring...

    Jawed
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    As far as I can tell CAL and CUDA both support more than one kernel running simultaneously.

    Both architectures support F@H while doing 3D graphics (e.g. Vista Aero).

    We're looking at the following types of kernel in D3D11 I reckon:
    • Control Point
    • Vertex
    • Geometry
    • Pixel
    • General Computation
    Jawed
     
  15. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    Interesting. Scaling the simd width would require the alu blocks to get redesigned (ok if that's fully automated not a problem) though with this organization. It definitely would make sense wrt redundancy though.
    I don't quite understand however the split of T-MAD and T-Trans unit, that doesn't look right. Also, T-Trans doesn't look quite regular enough to me for something 16-wide.
     
  16. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Tesselation adds 2 new (different) programmable stages on DX11..
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    No a hardware thread in G80 is contains 16 elements, with instructions lasting 2 clocks. That's how vertex shaders are executed.

    RV610/20 are 16.

    GT200, I think, has a 32-element hardware threads. I have to admit my first quick dose of the CUDA 2.0 documentation, where it describes coalescing and memory pages sent me packing, but I think somewhere in there (or the operand fetch waterfalling discussion) should be the true hardware thread size...

    Jawed
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    How about this, each TU fetches data from:
    • anywhere in GDS
    • anywhere in vertex cache
    • its private L1
    So it's one:many for GDS and vertex cache and one:eek:ne for texels.

    Yeah, you're right. I was just trying to isolate a hardware thread from all the others running on the SIMD, in which case the pairing is invisible.

    I suspect they're all AMD diagrams, but the ones at Extremetech are very recent. AMD seems to have decided to go for a black background for the "aggressive" marketing of RV770.

    But yeah, you're right, the confusion on the location of the crossbar is annoying. I dunno, since RV770 seems to be more and more different the closer you look I guess we'll just have to keep our fingers crossed.

    As this is so important for CAL it should become clear.

    Jawed
     
  19. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    The problem with that type of layout is the crazy data movement that needs to happen across blocks. Think about swizzling and dot products and channel replication.

    It makes a lot more sense to keep a fragment's data in one place. I'd bet that each of the blocks has X,Y,Z,W,T in there for a quad. The register files are out in the corners, and in the center there is some sharing for derivative instructions, table lookups for transcendentals, etc.

    Maybe the 55nm process is mature enough that they just skipped a lot of the redundancy, or limited it to register files and parts of the shader core instead of its entirety.

    I think some of those T-labelled regions are texture units. I don't see why the TU quads wouldn't show up on the die with a nice regular pattern.
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    The T unit has no register file so that's why the MAD looks different.

    You can get an idea of the size of a transcendental unit here:

    Method and system for approximating sine and cosine functions

    Though that's not a complete unit.

    The adder at the bottom in that patent document may be the adder from the MAD unit, dunno. Also, T does int32 MUL which no other unit does.

    Jawed
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...