AMD: R9xx Speculation

Discussion in 'Architecture and Products' started by Lukfi, Oct 5, 2009.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Any particular reason why? They already have texture filtering/decompression in hw. Are you really referring to latency hiding when you say texturing? Or the connection of TMU's to cores?
     
  2. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    For indexed register read, I can see there being gather misses. But why would there be a gather miss for NV, whose reg file has to be statically addressed, AFAIK. Worst case, if there are nasty register access patterns, the compiler could insert NOOPs.
    Fine, but why will there be a variable latency in nv rf, it's all statically addressed.
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Moving to a fully load/store ISA means that instructions that perform computation are explicitly separate from memory-access instructions. As far as the software is concerned, the register memory is more distinct from other memory pools with Fermi than it was prior, and the more robust memory model of modern GPUs would add to the expense of integrating it into every operand access.

    The register file is the big collection of SRAM that holds operands that resides on one side of the ALUs. It is quite distinct physically and distinct in how it is treated.

    I don't see why the operand collector needs to care about memory at all. The ISA is load/store, so all it needs to track is the readiness of the destination register of a given load. No instruction other than the memory access instructions would know of an address, which is much simpler to handle.
    The operand collector would be wasting its time tracking the memory addresses.


    No new high-performance ISA puts memory operands at the ALU instruction level.
    x86 internally splits ALU work off from the memory access because it is such a problem. Register accesses do not generate page faults, access violations, or require paging in memory.

    What needs to be settled? An example architecture with complex ALU instructions that could source multiple operands directly from memory was VAX.

    The x86 core is what it is. Plenty of other architectures don't try to combine memory loads with ALU work, and the P55C core internally cracks the instructions apart anyway.

    I think its mostly a distraction. The register file does very well on its own. The failings GPUs have with spills are more a product of their design. Other designs that degrade more gracefully just spill with loads and stores. It's cheaper and faster than trying to drive a full cache or make the internal scheduling hardware capable of handling memory faults.

    Memory operands save little here.
    The difference between an x86 instruction with a memory operand and a load/store equivalent is a Load and then the ALU instruction (which the x86 does implicitly anyway).
    It saves a bit on the instruction decode bandwidth and Icache pressure, but that is far from the limiting factor for GPU loads, and is not considered a limiting factor without an aggressively OoO speculative processor.

    That is subject to speed and size constraints. It's not better with caches. L1s have stagnated and even begun to shrink.

    Should they?
    If you want latency-optimized performance, you don't design a throughput processor.
     
  4. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Jawed: The current perf cliff with GPUs regs spill/fill is probably a byproduct of lack of r/w caches more than anything else. Fermi already has L1s (it would be interesting to test how Fermi behaves with regs spill/fill compared to a GT200) and it's likely tha AMD GPUs will have r/w caches in a not so distant future (not just for atomics/ROPs..)
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    It's my suspicion that this is the hardest thing to do right at enthusiast discrete GPU performance levels for a brand new architecture (there's now something like a 20x span in performance from IGP to enthusiast level). The cornerstone is basically cache locality (i.e. enough that's fast enough for AF) along with global latency hiding and enough "optimisations" to keep up with the IHVs who've optimised-the-snot out of texturing.

    Throughput (32 filtering units at 1.5GHz+, say) isn't the end of the story.

    Saving lots of render target bandwidth, something that Larrabee does really sweetly, doesn't help much with making texturing work fast, though screen-space tiled rendering is relevant in texture command ordering - something the other GPUs are doing too, though.

    Jawed
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Search: "operand collector". NVidia (at least historically) doesn't work that way at all.

    It's not just the RF, it's all operands (constants, shared memory, global memory) - at least, historically.

    Fermi's different (I understand what 3dilettante's referring to now - gather as an explicit, presumably non-blocking, instruction - not much different from a TEX instruction in this sense) and it may be that a lot of the complexity in NVidia's old-style operand collector, including handling variable latencies, has effectively disappeared. The mechanics of this I don't understand.

    It may be that Fermi is fixed-latency for RF now and that no operand can come from anywhere other than RF. I don't know much about it.

    Maybe that's why you're querying what I said, because you know it's actually fixed-latency?

    Jawed
     
  7. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Ah, I hadn't thought of local/const/global memory while thinking about operand collector. I just had reg file in mind. :oops:

    Well, I think this reduces complexity in a manner somewhat similar to the cisc->load-store architectures.
    It should be fixed-latency now, IIUC.
     
  8. Psoomah

    Banned

    Joined:
    Mar 16, 2010
    Messages:
    5
    Likes Received:
    0
    It's about the money Lebowski.

    AMD currently has a 40nm 5xxx cash cow on their hands at all market segments/price points. Economically, why change that up anytime soon except to ...

    a) move to a smaller, more cost effective node.
    b) respond to competitive pressures from Nvidia.

    If S.I. is on the 40nm node, that would nullify a).

    That leaves what Nvidia will be bringing to the table ... after all, if AMD had reason to believe Nvidia wasn't going to bring much heat for the rest of 2010, they would have little reason to do much at all ... maybe do a lower cost 'hybrid' 5790 part to replace the 5830 (or shove it up to a higher price point) and a 5890~5950(5790x2?) to hit the $500 price point and take back the single gpu crown. It would be very time/resource economical to tape out only one 'hybrid' chip if that is all that was really needed, it would also give them working knowledge and experience of most of N.I.'s architecture at 40nm, paving the road and freeing up resources to concentrate on implementing the full N.I. on GF's 28nm. This would also keep product line confusion to a minimum ... bringing out hybrid S.I. cards that compete with existing 5xxx cards ... why? what would they be called? Better to wait and roll out an entire 6xxx product line over a few months like they did with the 5xxx line.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    So the question is, why not make GPUs that way. Well, Intel is - why wouldn't AMD and NVidia do the same?

    I don't buy the "well, everything looks like x86 to Intel" argument. Besides, in this detail, it seems the only way to go, long-term.

    Well the whole deal with cache hierarchies is their spectacularly non-linear (both good and bad) effects on performance.

    Well, one could argue that Fusion style GPUs will make any such era short lived, but: if I'm video-encoding on the GPU (if it ever becomes worth doing :roll:) I don't want my 3D-accelerated UI to stutter.

    Jawed
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Well, the caches aren't the entirety of all problems. There is too much wastage as well, even on Fermi. If I need 18KB local mem/work-group, then I have to waste 30KB on chip memory. It can't be used for caches or for my own needs. Sure, you could tune the parameters a bit, but it won't change the basic problem of 3 bits of SRAM sitting on-chip, yet segregated. And caches can't help with that problem.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    R600 has a small R/W cache that was supposed to support register spill. The register architecture of R600 is supposed to be fully virtualised.

    For whatever reason it hasn't turned into anything worth having, though. Or, maybe it does work as intended, but performance is still shit :???:

    Yes, I'm definitely interested to see how Fermi works out for real.

    Going back to the fixed-latency RF of Fermi, presuming that that's how it works: Why not just hold this stuff as locked lines in shared memory/L1, and then burst that stuff into the ALUs? Both memories are in the 10s+ KB (RF is 128KB, shared/L1 is 64KB), both are banked 10s of ways.

    Then just have in-pipe registers to deal with cycle-to-cycle RAW.

    Jawed
     
  12. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Verifying the NI uncore in silicon and getting a single gpu to beat 480 with sounds like a mighty big motivation to me.
     
  13. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    IMHO, Reg Files - of today's designs - are too wasteful going forward.
     
  14. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    The NVIDIA occupancy spreadsheet says registers are allocated for 2 warps at a time.
    How would you write anything in CUDA which could make use of that kind of flexibility? If registers are allocated en block and accessed en block it seems unlikely to me that the register file is capable of scatter/gather.
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    My answer would depend on which part you are adressing.
    The weakness of spill/fill for GPUs is the traditionally unclosed write/read path for the shaders, where writes had to go off-chip and then come back again.
    Fermi may change this.
    Until recently, it was not a deal-breaker, though as the loads become more diverse it will become so.

    Improved spill performance may come almost free with the closing of this loop on chip for actual producer/consumer relationships between shaders. A spill would basically make a shader its own consumer, and it would be a less hairy problem to track as this should rarely if ever hit the same kind of problems a true operand read could cause.

    If addressing the feature that a Larrabee vector ALU instruction can include a read from the L1 "for free", then:

    This would be valid.
    I suspect that little feature would not have been included if Larrabee didn't already have x86 as its baseline. The hardware itself goes out of its way to crack the instruction.

    Any significant capacity gain is probably going to happen at the lower and slower cache levels.
    If you limit the view to pools of on-chip memory accessible within a single-digit number of cycles, the GPUs and Larrabee become rather close. It becomes a question of what can physically be fast enough for that time frame.

    Registers still need to be around in some quantity, and it can be wasteful to lean on memory too much given that each access is significantly more expensive.
    It's why I'd like to see a way to access the L1 as a big backup to the reg file, but also magically skip the TLB and exception hardware (it would become like a reg file or scratchpad). Unfortunately, in real life the memory pipe stages are defined in part by that hardware for the cache that you can't freely dispense with it without some loss.


    Perhaps some modest improvements in this regard are possible. Fermi makes claim to this, and it seems Physx may not be a complete frame rate killer like it was before.
    Even so, its context swtich times are still pretty slow compared to a CPU.
    Fusion could handle most consumer loads, and if widespread the app could put the UI on the Fusion GPU and the big work on the card.
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Really? Ha.

    This is just the driver's problem. In a bid to optimise for different scenarios.

    http://v3.espacenet.com/publication...b&FT=D&date=20091215&CC=US&NR=7634621B1&KC=B1

    No idea if Fermi allows this flexbility.

    Jawed
     
  17. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I have been told that - at least on gt200 - the registers per work group are allocated in units of 512. Could be a compiler hack though.
     
  18. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    I'm not so sure ... a fundamental problem with Larrabee's approach is that it can only handle near certain high latency events, the cost of pushing/popping is way too high to use for every type of memory access. Everything else has to be rare.

    In a throughput optimized architecture doesn't that strike you as strange? I would personally want an architecture which is able to deal well with any type of cache miss through vertical multithreading ... Larrabee ain't getting me there, not enough threads.
     
  19. racca

    Newcomer

    Joined:
    Apr 3, 2010
    Messages:
    51
    Likes Received:
    0
    Exactly, if this is truly 67xx, I don't think they can or will simply yank NI core out and replace it with Evergreen shader core. The shader core per se is relatively easy, isn't it? Otherwise they might as well produce something like Cedar-like NI in 40nm to save some R&D.

    Quite unlikely, Fermi-like cache doesn't really translate to real world performance, neither is Cypress cache-bound and GPGPU is not SI's forte anyway.

    Like I said above, improving GPGPU-related performance will be highly unlikely on SI GPUs which is either a stop-over or the next 67x0, ie. 57x0 even removed DP capability.

    On the other hand, SI/NI might share some miracle-worker (MC/ROP) from Llano to drastically reduce or at least mitigate bandwidth requirement. Otherwise even GDDR5+ won't do it on NI, and SI won't even have faster memory parts. At least 20% increase in real-life performance, which should be the minimal expectation 6-month from now, can't just come out of nowhere.
     
  20. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    What is all this rubbish about variable latency RFs? The operand collector is really there to handle banking conflicts and improve bandwidth.

    Nobody makes variable latency RFs AFAIK, because it would cause an absolute shit storm for the compiler and rest of the pipeline. Remember the whole point of RISC was to try and get as many instructions as possible to have 1 cycle latency (or failing that, fixed latency).

    DK
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...