AMD: R9xx Speculation

Discussion in 'Architecture and Products' started by Lukfi, Oct 5, 2009.

  1. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,439
    Likes Received:
    280
    If you want to call the first shipping product the full node then there's no arguing with that beyond arguing the definition of a node.

    32nm was most definitely not an optical shrink. The only full node and optical shrink combo TSMC has shipped in recent years was 65/55 nm. For 55nm customers used 65nm synthesis libraries and PD tools shrunk the design.
     
  2. Tchock

    Regular

    Joined:
    Mar 4, 2008
    Messages:
    849
    Likes Received:
    2
    Location:
    PVG
    Err... 400mm^2 on a X700 part is just madness. That's the second biggest chip ATI has ever made.

    It's an 68/5990 I reckon. And the 28nm shrink/Globalfoundries part is a new family.
     
  3. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    my 4x should be 2x :( too busy and posting too hurried.)
     
  4. psolord

    Regular

    Joined:
    Jun 22, 2008
    Messages:
    444
    Likes Received:
    55

    Are you referring to the 6800 flops = 4X GTX 480 flops?

    If so, then it should be 6800 flops = 2X GTX 480 flops, which is exactly the same as the 5870.

    Maybe there will be an increase in efficiency though.

    Just try to find out how many transistors (I really laugh when we call them "trannies") the 6800 will have. This is the most important spec.
     
  5. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Larrabee's approach may or may not be scalable into the future, but it certainly showed that it is possible to unify the 3 pools of memory.
     
  6. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    a 6700 would be faster than GTX480 and 5870. So a 6700 would have around 3TFlop @40nm and a 6800 would have 6TFlop at 28nm but not show up until somewhere in 2011.
    I think "increased efficiency" is the keyword of the coming AMD designs.
     
  7. psolord

    Regular

    Joined:
    Jun 22, 2008
    Messages:
    444
    Likes Received:
    55
    So the "6770" should be something like a 5870 but with increased efficiency that would make it faster!?

    The 5770 was also something like the 4870 but it turned out a bit slower really, quite possibly due to the memory bus.

    This new analogy, may indicate that ATI wants and can turn the scales by quite a bit.

    What worries me are the prices that will come with these products. I have the feeling that ATI is slowly going the Nvidia way! :S
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    They're sort of sized like we see L2s in x86 CPUs - but in ATI they're incredibly fragmented.

    Each ALU (XYZWT) has 4 private 256 entry 128-bit register files. This means there are actually 1280 lickle register files in Cypress - all with independent data paths and cycle mappings (4-way stagger is an extra complication - I think causing an effective 8-cycle latency, which is where PV/PS come in). I presume the addressing is ganged, within a SIMD, though.

    NVidia's design is rather different. It's a single address space (i.e. an operand collector accesses every address) with 16 or 32 banks.

    So, NVidia's design is more like a cache. Or if you prefer it's more of a gather/scatter architecture on every cycle.

    I have minor qualms over Larrabee's register file/L1 marriage, too. I know practically nothing about SSE but Larrabee's vector unit seems to me to be a 4-threaded super-wide SSE on steroids - with a much nicer instruction set. So the 4-way register file is really just a scratchpad - which is really what registers were always about, originally, anyway.

    I used to think along these lines. It's quite a tempting view.

    NVidia's architecture is closer to this. If you're going to do that then you need to be able to stand some pretty high latency. NVidia's design (at least G80-GT200 - still unclear on what Fermi's doing) with operand scoreboarding is closer. But NVidia still sees fit to keep them separate. At least for the time being.

    I can't help thinking that the Larrabee approach is where ATI and NVidia will end up. A key feature in Larrabee is a single operand optionally coming directly from L1 (subject to waterfalling, of course).

    Fundamentally its about bandwidth and hiding latencies of register gather/scatter. Notice that shared memory has lower throughput.

    Larrabee's trade is "narrow" hardware threads (single cycle per intruction, arguably each vector unit is actually natively working on a single work item - though it can also be viewed as 16 work items), software managed fibres and no pretence of ever holding a work item's entire context in registers (though it's technically possible, of course).

    So in Larrabee, L1 and registers both fundamentally act as operand collectors/store-queues, implementing a hierarchy of gather/scatter.

    Jawed
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    Need to stop equating FLOPs with performance. I'm not denying that the next ALU configuration is more efficient, but that doesn't mean that the next chip actually has more FLOPs than HD5870. e.g. HD6770 could be 2.2 TFLOPs theoretical peak and be faster than HD5870 on most games.

    Jawed
     
  10. w0mbat

    Newcomer

    Joined:
    Nov 18, 2006
    Messages:
    234
    Likes Received:
    5
    next week im going to meet someone in kuala lumpur who knows quiet a bit about ati@tsmc. so if charlie is right and there has been a new 40nm ati chip tapeout there i can confirm it (or not). my "contacts" here in singapore are not in the know...
     
  11. racca

    Newcomer

    Joined:
    Apr 3, 2010
    Messages:
    51
    Likes Received:
    0
    I don't think that's the case, I agree (mostly) with you about the SIMDs, but if you'd agree that one SIMD (along with TMUs, etc) per se won't have much impact on die size, neither do ROPs/MCs and quite possibly UTDP -- all of which should account for at least 80% of the die size.

    If SIMD count stays 20, let's say 20SIMD/TMU+MC+ROP+UTDP etc occupies almost the same die area as Cypress, then where does the extra 20% size go? It's highly unlikely the front end could be more than double that of Cypress with reduced raw shader power.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    In my view in order for triangle rate to increase rasterisation rate for small triangles needs to increase.

    My theory is that hardware threads can only accept a single triangle. If that's true then rasterisation rate for small triangles can only increase if the way hardware threads are constructed is also revised. If a hardware thread can accept up to 16 triangles at the rate of 4 triangles per clock (since a thread takes 4 clocks anyway) then rasterisation and barycentric stuff all needs to be at least 4x faster.

    All told that's potentially quite a lot of interdependent new stuff, e.g. hierarchical-Z needs to scale within screen-space tiles, not merely across tiles.

    Then there's also the question of a revised memory architecture. e.g. bigger L2s. L2s that are generalised like in Fermi. etc.

    Also, I reckon it's about time for 8x Z. 4x Z is so 2006.

    --

    I've not seen this patent document before, it's called "UNIFIED TESSELLATION CIRCUIT AND METHOD THEREFOR"

    http://v3.espacenet.com/publication...=A1&FT=D&date=20100304&DB=EPODOC&locale=en_gb

    Might have some clues in it...

    Jawed
     
  13. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,140
    Likes Received:
    577
    Wait ... it's not possible to do indexed accessing of registers right? Threadblocks which aren't a multiple of 64 threads waste register space right?
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    Assuming you're querying ATI's architecture:

    The indexing is per lane ID.

    SRs ("global registers" - registers shared across all hardware threads within a SIMD) are shared across hardware threads, but private to lane ID - e.g. hardware thread 3 lane 12 shares five SRs with lane 12 in 7 other hardware threads (if there are 8 hardware threads in total).

    Lane 28 (same XYZWT ALU as work item 12, but different register file) has 5 SRs that are distinct from lane 12's.

    Sure.

    In fact if you are only running one shader on the GPU (compute shader is the normal case) and you have say 34 GPRs allocated per work item, resulting in 7 hardware threads per SIMD, the hardware can only allocate 7 threads * 64 lanes * 34 registers = 15232 GPRs. The total capacity is 16384 GPRs, so wasting 1152 GPRs (18KB).

    Some of that waste will be clawed back in the use of clause temporary registers. e.g. if 8 clause temporaries are defined, then they will consume 2 hardware threads * 64 lanes * 8 registers = 1024 GPRs.

    So in this example the wastage is reduced to 128 GPRs, 2KB.

    Jawed
     
  15. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,140
    Likes Received:
    577
    That's a rather strange assumption since I quoted a statement about NVIDIA, I meant NVIDIA ...

    For NVIDIA ... it's not possible to do indexed accessing of registers right? Threadblocks which aren't a multiple of 64 threads waste register space right?
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,363
    Likes Received:
    3,944
    Location:
    Well within 3d
    Nvidia has moved further away from unifying register and memory pools. The description of Fermi's ISA change has moved to a more fully load/store architecture, whereas its immediate predecessor had memory operands.

    Why expose every operand access to possible TLB fills and memory faults, or why have the additional complexity in hardware to do this, and then avoid using it most of the time?

    If it weren't for the x86 core, x86 hardware thread context, comparatively miniscule reg file and its reg/memory operand legacy, I wonder if its designers would have skipped over that "feature".

    A memory access is not as cheap as a register file access, for various reasons. It is a much more complex case to get right, and getting it wrong has much bigger consequences for the system in general. The load/store and execution pipelines of even the P55 core are at least somewhat more complex because of this.

    I wouldn't mind accessing that pool of SRAM, perhaps in some kind of linear line access absent the TLB and fault handling part of the pipeline, but those are usually an integral part of the pipeline and not totally removable.
    I would be curious if Nvidia's configurable L1 does somehow convert cache accesses to the shared memory region into something addressed to the physical lines of the cache, though it could just be some kind of creative page mapping, where the cache logic does not bother to keep it coherent.

    I would potentially disagree, if I knew more of the implementation. It's possible that Larrabee would already have store queues as explicit parts of its memory pipeline.
    The L1 and registers are just what they are, and what heirarchy they implement is what any other fully fleshed out memory pipeline can provide with proper software usage.
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    But you mentioned 64, not 32 (which is the "warp" size in NVidia, although 16 is, at least for some GPUs, the hardware thread size), so I thought you were querying ATI's architecture in comparison with NVidia.

    Patent documents indicate that GPRs can be allocated horizontally (a warp's single or multiple registers can occupy all banks) or vertically (each register is allocated along a bank) or mixed.

    I don't know if there's indexing. I have a vague memory of it being a part of the architecture, but that's all.

    I would tend to suspect it's not, thinking more: register allocation in NVidia has always been very tight. Secondly, with Fermi's register spill through the memory hierarchy, they might simply have chosen to use addressing for indexing. Dunno really.

    Any time the domain of execution defines a count of work-items that isn't an exact multiple of the hardware thread size, you'll get register file wastage. Just the same as killed pixels in pixel shading will waste registers.

    Jawed
     
  18. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Has to be. All the the 16 lanes in a thread have to access the same register in their respective banks, with additional banking over the 4 xyzw slots. Independent register fetch from each lane would be quite a bit of overhead. In that sense, it is not very different from nv's scatter gather every cycle.


    They have to. The present segregation b/w private, local and global caches is just too wasteful.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    :oops: I don't understand the distinction you're making.

    This is where you get into a nebulous argument over whether the memory in the operand collector, holding operands for multiple cycles until a warp's worth of operands are all populated, is really the register file :???: In this model the "registers", the constant cache, the shared memory and global memory are all just addressable memories.

    The thing that makes these GPUs different from CPUs is that gather/scatter is essentially a first-class instruction. Or, at least in the future, it is. There's no choice when the whole thing is a SIMD. Historically GPU ALUs have avoided the gather/scatter problem because pixel shading doesn't expose the ALUs to it - the pipeline has been designed to farm out texture-mapping gather and pixel-blending scatter operations.

    Many of these fancy new algorithms (or re-discovered supercomputing principles) push repeatedly on the gather/scatter button at the ALU instruction level.

    G80->G92->GT200 saw progressively increasing register capacity and/or increasing work-items per SIMD. Fermi actually reverses things a little, I think. In other words it seems to me NVidia hasn't really settled on anything.

    Obviously this discussion would be easier with Larrabee to play with. But I trust the Intel engineers to the extent that the originally-presented chip wasn't fundamentally broken in this respect. Though I still strongly suspect texturing is a black hole they've been struggling with.

    One could argue that texturing is still so massively important that it steers GPUs towards large RFs and the ALU-gather-scatter centric argument is merely a distraction, and Intel's stumbling block :razz:

    That's all very well. But GPU performance falls off a cliff if the context doesn't fit into the RF (don't know how successfully GF100 tackles this). So, what we're looking for is an architecture that degrades gracefully in the face of an increasing context.

    The question is: can register files either keep growing or at the least retain their current size, in the face of ever more-complex workloads?

    What happens when GPUs have to support true multiple, heavyweight, contexts all providing real time responsiveness? The stuff we take for granted on CPUs?

    NVidia has a gather unit (the operand collector) that essentially hides a load of mess there (and a store queue). I'm presuming the cache is just coherent+bankset aligned accesses to banked shared memory.

    Sorry, I wasn't trying to say that L1/registers replace a conventional memory interface for the ALUs - I'm simply saying that the way Larrabee is designed, gather/scatter is built upon the workings of L1/registers. This comes back to the SIMD architecture and first-class gather/scatter. Gotta wait to see it in action...

    Jawed
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    I agree it's highly likely. Just haven't seen absolute evidence though.

    Actually ATI is quite different. There's no latency-hiding for gather misses, all the resulting latency is always experienced by the ALUs. NVidia's operand collection hides that latency, generally.

    This is really only relevant to constant buffer fetches and LDS accesses. It's possible to make an indexed register operand cause a stall (indexed read after indexed write, I think) or an SR cause a stall. But that's not a gather/scatter problem per se.

    ATI RF is designed, generally, never to induce variable latency. It's the entire basis of the clause-by-clause execution model.

    Jawed
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...