AMD: R9xx Speculation

Discussion in 'Architecture and Products' started by Lukfi, Oct 5, 2009.

  1. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Ah yes didn't see there's also a HD6800 I completely agree.

    Goes a bit against the idea of releasing whole series in a short timeframe, but I guess it would make sense. Arguably, Cypress could benefit the most from a refresh (scales badly compared to Juniper, and the slow tesselation should be much less of a problem for Juniper and below). Whatever AMD took out of the initial Cypress for time to market reasons they could also put back in (it's unclear to me if those scratched parts would have also affected the lower end parts).
    Indeed the 68xx on 40nm with 4 times the flops of gtx480 seems rather unlikely - would be twice the flops of Cypress, that would be one massive chip on 40nm.
    If however 67xx is on 40nm and 68xx on 28nm, I'm missing the "pipe cleaner" part on 28nm. I don't see AMD starting on 28nm with such a massive chip, but maybe there could be some 66xx series or so before the 68xx on 28nm?
     
  2. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,451
    Likes Received:
    471
    If the HD6770 is ~420mm² at 40nm, it would measure about 220mm² at 28nm. That could be enough for 256bit GDDR5 interface (without sideport). 28nm part could later replace the inital 40nm model as slightly faster HD6790.
     
  3. leoneazzurro

    Regular

    Joined:
    Nov 3, 2005
    Messages:
    518
    Likes Received:
    25
    Location:
    Rome, Italy
    And what if the new chip has less shaders but different clock domains for shaders and the rest of the chip? I mean, if really it has double the FLOPs with less SP it should have crazy high clocks. And to improve tessellation dramatically, it should have an approach similar to Fermi's.

    So, it could be maybe a chip with SIMD clusters with a setup and tessellation engine for each cluster (maybe four "big" clusters with more than one SIMD inside? :???: ) and the ALUs are not scalar but 4-way VLIW with clock domain that is 2x the texturing-ROPs-etc.
    Also the scheduling should have changed substantially, I think.

    Hmm.. a lot of rumors and a lot of possibilities...
     
  4. ferro

    Newcomer

    Joined:
    Apr 8, 2005
    Messages:
    130
    Likes Received:
    0
    Location:
    The Netherlands
    I think neliz got his information from tweakers.net rumors. A rough translation/interpretation from tweakers.net:

    • Hybrid Evergreen/NI
    • Much improved stream processor architecture
    • Much improved tesselation unit with 3 or 4 times Cypress performance
    • Enhanced rasterizer for improved efficiency
    • Improved UVD unit
    • Improved cache architecture for better GPGPU performance
    • 6600 has 1 "SP module", 40nm, planned for Q4 2010
    • 6700 has 2 "SP modules", 40nm, 10-20% faster than GF100 with 512SP, 400-440mm2, planned for Q3 2010
    • 6800 has 4 "SP modules", 28nm, 512 bit memory bus, planned for Q1 2011
     
  5. CRoland

    Newcomer

    Joined:
    Jan 19, 2010
    Messages:
    114
    Likes Received:
    0
    Quite a lot of changes for a supposedly unplanned generation.
     
  6. TKK

    TKK
    Newcomer

    Joined:
    Jan 12, 2010
    Messages:
    148
    Likes Received:
    0
    My take on the matter of SI using 4D instead of 5D is (something like this was actually suggested by someone else already, not sure wether it was here or on S|A though):
    SP-functionality is getting removed from the T-unit, which now serves as a pure Special Function Unit instead of being a jack-of-all-trades. This reduces the 'official' number of SPs by 1/5th per SP cluster.

    Number of SIMDs per block increased from 10 to 12, resulting in a total of 24 for the fastest 40nm part. This would result in 'only' 1536 SPs (slightly below Cypress), but 20% more TMUs and ex-T-units-now-SFUs.

    Add those uncore improvements to the equation, and ~20-40% more performance (depending on game and workload) compared to Cypress seem absolutely possible.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    24 could be reasonable I suppose, 1536 lanes. That would leave real world FLOPS pretty much unchanged, the ALUs would take about the same space as Cypress's (i.e. assuming a 20% saving per SIMD by deleting T, as I described earlier) and all the extra die space would be dedicated to the swathe of efficiency improvements - the stuff that was left out of Cypress due to the 40nm problems :smile:

    Though I wouldn't bet against only 20 SIMDs, 1280 lanes.

    All this in a "HD6770" :grin: now that would be neat. EDIT: Hmm, if the modules are 8 SIMDs, then we could be looking at 16 SIMDs, 1024 lanes.

    I wonder if Hecatoncheires was the original name for Evergreen + 1, but when 40nm troubles arose AMD split Evergreen + 1 into SI and NI: SI being on a process older than 28nm (was going to be 32nm, but is now 40nm) and NI being 28nm. So many different ways to argue this :lol:

    Jawed
     
  8. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    You can't just increase clock (a lot) for the alus without changing them (very substantially), hence that seems unlikely. Might not be worth it (based on transistor count, power) anyway.

    That wouldn't be that impressive, though I guess might be "good enough".
    Hmm, I wonder what it's missing today.
    Might also fix the supposedly low internal bandwidth problem I guess, hence might be beneficial to graphics too?
    That would look quite good if true.

    That would make sense since nvidia did something similar (no more mul/interpolation done in sfu), though nvidia also reduced the number of sfus (if t unit loses normal operations, that'll give a normal/sfu rate of 1:4, nvidia has 1:8 now).
    The only "non-special" instructions which can only be done in the t unit currently are those related to the 32bit int multiplies I think. I'd guess if the t unit is going to be less jack-of-all-trades it'll lose the multiplier. 32bit int muls though should be doable by combining the normal units.

    I wonder how that compares in area-efficiency. I guess the expectation is this will perform better per transistor count? In any case losing a tiny bit of flops shouldn't be that big of a deal (we've seen Fermi being faster overall with half the flops). Should also make utilization of ALUs a bit higher I guess.

    Uncore could probably be more important than the shader changes as outlined above imho, though I guess no matter how you look at it if it's 25% bigger it should indeed be also quite a bit faster.
     
  9. racca

    Newcomer

    Joined:
    Apr 3, 2010
    Messages:
    51
    Likes Received:
    0
    There's no way you could've been right.

    1. SIMD is NOT just ALUs, it's bundled with (quite large) caches, registers, TMUs, etc. If you take the ALU out of T-units, it could only save you much less than 20% even just for the ALU parts -- it's 4xALU+(ALU+SFU)->4xALU+SFU, you'd be lucky to get a 5% shrink per SIMD core. Counting 20% increase in SIMD units, you should spend around 15% more on the shader core

    2. 20 SIMD would be unacceptable, unless AMD can find a way to allow SI to run at a much higher clock, ie 25%+, there's no way 1280sp parts could outperform HD5870/GTX480, let alone 1024sp. So what's the point in having those parts if they can't even beat 5870.
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    The SIMDs are ALUs, registers, LDS and buses to get data in and out. I'm specifically excluding TMUs and L1s - in fact LDS should be excluded since it's fixed size per core regardless of the ALU count. You can call the whole lot clusters or cores.

    My guesstimate of 20% saving going from XYZWT to XYZW is excluding registers and LDS (which wouldn't change) but includes savings in routing and buffering that T requires.

    On the other hand you're right, I'm mistakenly ignoring the fact that, say, 24 SIMDs adds TMUs, registers and all the other stuff. So on that side of the equation 24 SIMDs would see an overall increase in the area consumed by the cores, i.e. cutting into the supposed ~20% die size budget increase.

    It's hard to tell but the cores in Cypress are probably in the region of 30-40% of the die. But that's pretty woolly - RV770's cores are ~41% of the die, while the SIMDs are ~29%. Anyway, increasing the count of TMUs, L1s and other stuff that scales with core count would cost some extra area.

    In RV770 registers are 29% versus 71% for the ALUs when looking at just the computation part of the ALUs (i.e. excluding redundancy and LDS).

    If this is really HD6770, then it's a cut-down part which, due to the demise of 32nm, is a bit larger than it should have been... In my view that makes it likely to have no more ALUs. And if I'm right about the way things pan out with an XYZW configuration, performance won't be notably affected.

    We'll be arguing about its ALU count right up until the moment we know precisely what it is :razz:

    The real replacement for HD5870, HD6870, theoretically only appears once 28nm is ready. Something like 18 months after HD5870?

    Though this does raise the question: "what happened to the refresh of HD5870?". Dunno, still mulling that over.

    Jawed
     
  11. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    There was supposed to be an Evergreen refresh in March.
     
  12. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    Any idea what happened to that?
     
  13. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    A launch turned fizzle?
     
  14. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    Sure, but if the refresh is ready, why not release it?
     
  15. GZ007

    Regular

    Joined:
    Jan 22, 2010
    Messages:
    416
    Likes Received:
    0
    Does it make a sense to waste money on a refresh with the same troubling tsmc 40nm :?: Wasnt the refresh suposed to be on the TSMC 32nm and with it the refresh was killed too and changed to SI :?:
     
  16. leoneazzurro

    Regular

    Joined:
    Nov 3, 2005
    Messages:
    518
    Likes Received:
    25
    Location:
    Rome, Italy
    It depends, the rumors say that indeed the shaders are indeed changed, and if the SP count is lower than Cypress but FLOPS are much higher then it would be plausible to suppose that ALU clock is much higher. But to run all the chip at high speed seem unlikely, henche the idea that there could be a different clock domain for shaders. Or maybe they are EXTREME+ shaders, who knows :grin:
     
  17. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    My problem is that at the size of their register sets they are very similar to caches already ... the access method differs somewhat, the access of caches takes more pipeline stages and power for instance, but at the size of GPU's register sets they start behaving more like caches than "normal" registers in access time (exemplified by the PV/PS registers). It's not like the "normal" register sets like say Larrabee has. I'm not convinced however that what Larrabee is doing makes sense, I think the majority of L1 cache accesses will be pushing and popping while running shaders.

    If the GPU approach makes sense, but you start adding caches isn't there another compromise ... might it be possible to just combine the registers with the cache in a unified pool? (Along with local storage.)

    Lets say that instead of normal registers we give each SC a window on the cache? The method of access would simply use direct indexing, so it would bypass the normal tag comparisons etc. which make the accesses expensive time and power wise. NVIDIA already did this for local storage, why not for registers as well?
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Isn't this more an artifact of having a VLIW architecture? Standard pipelines also have pipeline registers and bypass, a VLIW would just expose what a more complex chip does automatically.
     
  19. jaredpace

    Newcomer

    Joined:
    Sep 28, 2009
    Messages:
    157
    Likes Received:
    0
    My guess is a "SP module" is a block of 10 SIMD. 4x16x10 = 640 SP in a Module. 6700 = 1280 SP and 6800 = 2560 SP. Edit: If a block of 12, then 1536 SP & 3072 SP.
    :eek:
     
  20. LordEC911

    Regular

    Joined:
    Nov 25, 2007
    Messages:
    877
    Likes Received:
    208
    Location:
    'Zona
    I was thinking the samething. ~420mm2 on 40nm w/ 2.8-3bil trannies. Shrink to 28nm in Q1 and effectively cut the die size almost in half.

    I would say 12SIMDs makes more sense with neliz's comment of 4x the flops of, my interpretation, GTX480.
    The real question is, will 6800 be a single 28nm GPU, 3072SPs part w/ a 512bit bus on 400-480mm2? Or is it just a dual GPU part of a 6700 shrinked to 28nm?
     
    #700 LordEC911, Apr 23, 2010
    Last edited by a moderator: Apr 23, 2010
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...