AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Discussion in 'Architecture and Products' started by UniversalTruth, Dec 17, 2010.

  1. tmavr

    Newcomer

    Joined:
    Sep 2, 2010
    Messages:
    10
    Likes Received:
    0
    Can anyone confirm unified virtual address space for SI ?
    Or any of the features related to this slide?
    [​IMG]
     
  2. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    Are you sure about this? And is it fully independent, meaning any ratio is possible, or is there still some constraint?

    Full independence would sure be helpful for CU scaling.
     
  3. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    AMD pulled a fermi, but better.
     
  4. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    If there is some exploition of a relation between the number of SIMDs and instruction latency(*) to keep the scheduling simple, it can't be scaled without taking into account some serious constraints. Anyway, the significant raise of the effective shader power of a CU will raise the effective ALU:TEX ratio anyway (edit: will be the same as GF100/110 type SMs and 25% higher than GF104/114 style SMs).

    (*):
    With the round robin approach of scheduling wavefronts belonging to the individual SIMDs, its' basically required that the instruction latency is an integer multiple of the number of SIMDs. Easiest would be a 1:1 ratio. ;) But 2:1 looks still good (4 vector ALUs, latency 8 cycles; next step to higher ALU:TEX ratio would be 8 SIMDs, anything inbetween would require more sophisticated scheduling).
     
    #284 Gipsel, Jun 17, 2011
    Last edited by a moderator: Jun 17, 2011
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Hmmm, I don't follow. How does scheduling benefit from the relationship between SIMD count and instruction latency? It should just follow the normal mantra of having enough wavefronts available to hide instruction latency. There shouldn't be any hardcoded assumptions at play - that's why they got rid of clauses.

    The number of SIMDs is determined by the number of instruction schedulers/dispatchers, the width of the SIMD and the wavefront size.

    # SIMDS (x) = instruction dispatch rate * wavefront size / SIMD width.

    Since the SIMDs are pipelined you should ALWAYS be able to issue a new wavefront to a given SIMD every x clocks regardless of instruction latency.
     
  6. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Well, that remains to be seen. It looks ok so far on paper, but we don't really know yet how good it'll work in practice...
    Though a 32 CU chip would be quite a monster indeed, basically twice the shader capacity as GF110. Now the raw flop number wouldn't be all that much higher than Cypress/Cayman but if efficiency (not per area but per theoretical throughput) improves to GF110 levels that would be quite a beast. Granted maybe nvidia will be able to fit twice a GF110 on 28nm too...
    Though wouldn't that also need quite a bit more memory bandwidth to really shine? I wonder what we'll gonna see there.


    64 FMA : 4 bilinear samples looks like a 33% increase from 96 FMA : 8 bilinear samples to me...
     
  7. JoshMST

    Regular

    Joined:
    Sep 2, 2002
    Messages:
    467
    Likes Received:
    25
    From Demers' presentation, he addressed that they did a lot of work in general on the caches and memory utilization/efficiency for the architecture. So this is aimed not just at improving APU performance where main memory bandwidth is shared and constrained, but also in the larger discrete units which will have their own dedicated memory.

    I tried my best to cover his presentation and the materials passed to me, but what I was given just had too many holes in it to get a complete overview and understanding: http://www.pcper.com/reviews/Graphi...ecture-Overview-Southern-Isle-GPUs-and-Beyond
     
  8. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    My bad, used the wrong base. So a GF104 style SM has 25% lower ALU/Tex ratio than AMDs CU (or current SIMD engine, at least theoretically but harder to compare) or AMD's ratio is 33% higher.
     
  9. wishiknew

    Regular

    Joined:
    May 19, 2004
    Messages:
    341
    Likes Received:
    9
    Thanks PcPer and hardware.fr! Realworldtech is going to do an article but really disappointed there hasn't been more coverage.
     
  10. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    It's certainly an improvement over previous AMD architectures, but Fermi likely still has more efficient caching (not just because it has more L1, but also since it can likely cover L1 latency with less workgroups in flight).
     
  11. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    You can use a simple round robin scheme for the schedulers (as implied in the presentation) and you are guaranteed to be able to issue a dependent instruction without tracking dependencies between arithmetic instructions. It's the same reasoning as for the two interleaved wavefronts used from R600 through Cayman.
    AMD currently hides the instruction latencies with the execution of two interleaved wavefronts. They don't need to track dependecies between individual instructions at all. They only need to track dependencies between clauses (involving memory clauses, arithmetic ones don't have dependencies on each other, they can be executed simply in order).
    They get rid of the clauses, because control flow (which opens up a new clause) was inefficient (generally all code with a lot of short clauses). Clauses simplify checking for depencies (at least one kind of them). The arithmetic dependies were completely hidden by the interleaved wavefront issue.
    But if you increase the numbers of SIMDs to let's say 6 and you use the same scheduling scheme as suggested (round robin), it's not going to be any faster than putting in just 4 SIMDs ;)

    Wavefront size stays at 64 as stated and the width of the vector ALUs is 16, so every 4 clocks a new instruction can be accepted. If you have more than 4, you can't serve them round robin anymore because you will come by only every 6 cycles for instance (when you have 6 SIMDs). There is basically no easy way to increase that number arbitrarily, you have to double up your scheduling width (twice the number of instructions per cycle) to accommodate 8 SIMDs. Something in between doesn't make much sense.

    Let's go back to 4 SIMDs seen on the slides. The instruction latency comes into play when you think about following the lines of keeping the things simple. If it is 8 cycles for instance, you need to issue the next instruction for that SIMD from another wavefront. Welcome back to the interleaved wavefront scheme (now just for more than just two). Only for a 4 cycle latency you get a "vector back-to-back wavefront issue" as mentioned on the slides and get rid of this interleaving as promised on one of the slides (but one may argue about it, I agree). Everything between 5 and 8 cycles require at least 2 wavefronts/SIMD, 9-12 will requuire 3 and so on. While just marking a wavefront as "busy", i.e. suppressing the issue of an instruction for the next (one or two) times sounds fairly easy, it will nevertheless complicate things a bit (it will have to be checked when considering instructions for issue) and will also raise the minimum number of wavefronts necessary to hide the arithmetic latencies. => Bad

    An old SIMD engine could attain peak rates with only 2 wavefronts (128 data elements). With a 4 cycle latencies a CU could get to peak rates with 4 wavefronts (256 data elements). With an 8 cycle latency (everything in between does not make a difference and is only a possible reason for limiting the clockspeed) you need at least 8 wavefronts (512 data elements) und for 12 cycles you are at 12 wavefronts (768 data elements). For comparison, a GF100 SM needs 576 data elements (18 warps, the instruction latency) in the worst case (absolutely no ILP in the instruction stream present) down to 192 data elements (6 warps) with ILP=4 (does not scale beyond that, obviously the size of the instruction window the Fermi schedulers look at). So especially for lowly threaded problems, there is a strong incentive of keeping the latencies down so one doesn't fall behind Fermi (AMD's CUs probably don't care about ILP at all, as this would again mean one would need to track dependencies between arithmetic instructions). And as the effective size of the register file available for one thread is quite a bit lower for the new architecture (because each SIMD has a separate one, before the instructions in VLIW slot z for instance could access elements from all other banks), that's another incentive to keep the number of threads somewhat in check, i.e. the latencies down.

    The alternative is Fermi: in comparison much more effort for the scheduling to hide the long latencies not only for memory accesses, but also just for plain arithmetic instructions. I would guess AMD tries to save something on the latter part, especially as they claimed that the die real estate of a CU will rise not excessively compared to Caymans SIMD engines (it would have to rise either way with the scalar units and double the LDS and double the L1 which is now R/W).
     
    #291 Gipsel, Jun 17, 2011
    Last edited by a moderator: Jun 17, 2011
  12. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Why should LDS be different from any other resource? If you oversubscribe a resource, performance goes down; that's not complicated nor different from anything else.
     
  13. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I think devs would expect to get good perf even if they use a lot of LDS if the spec says they can do so.
     
  14. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Naa, the spec allows for textures with dimensions of 8192x8192 (or 16kx16k now?). I think no developer would expect a good performance if he really uses that excessively. :roll:
     
  15. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Isn't there a difference between cache and scratchpad memory?
     
  16. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Ah, so you're taking the position that they're not doing any fancy scoreboarding and just relying on the same sort of assumptions they made for the interleaving of clauses. How well is that going to work when you toss variable latency memory transactions in there? If you size your CU based on instruction latencies only, memory latencies could make things go south real fast.

    I agree, think I summed it up in my equation above :)

    Right and that's where I'm missing something. 10 wavefronts per SIMD limits latency hiding to 40 cycles. Isn't that a bit on the low side, especially for an architecture that can be running different apps and kernels in parallel? What happens when a few wavefronts hit long latency memory instructions (> 40 cycles)? Do you get a 4 cycle bubble in the pipeline or does the entire thing stall?

    In contrast, if we look at Fermi:

    Pipeline depth: Fermi: 36 clocks, FSA: 4 clocks (if consecutive issue of dependent instructions is really possible)
    Wavefronts per SIMD: Fermi: 24, FSA: 10
    Latency hiding: Fermi: 48 clocks, FSA: 40 clocks

    I have to think that FSA will be doing some sort of scoreboarding / readiness tracking of its wavefronts and not just blindly doing round-robin issue.
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    It's round-robin amongst the SIMDs, and then it picks amongst the ~10 wavefronts on a SIMD in that cycle.
    Its 40*4 or 160 cycles before you potentially need to worry about the next instruction, and that's only if all other wavefronts are stalled. (edit: to clarify, if all the other wavefronts issue once and then stall after the instruction issue in question)
     
    #297 3dilettante, Jun 17, 2011
    Last edited by a moderator: Jun 17, 2011
  18. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    As I said already, you still have to track memory dependencies of course. You just don't have to care about dependencies between arithmetic instructions anymore. That basically reduces the dimensionality of a scoreboard to a score line. Appears simpler to me ;)
    Are you assuming all 10 Wavefronts per SIMD are issuing memory operations and can't continue until they complete (because the next arithmetic instruction accesses the destination register of a pending memory instruction) at the same time and no wavefront is left doing some calculations? How probable is that? According to the presentation, all wavefronts on one SIMD are evaluated for the possibility to schedule the next instruction from them (which means the scheduler has to check 10 possible instructions for conflicts with pending memory operations for that respective wavefront, but not with all instructions in flight, and it has basically 4 cycles for that check). There is no round robin scheme there. So if any wavefront for the SIMD has a vector instruction left, the SIMD will not be idle.
    Afaik, Fermi is at 18 cycles for single precision instructions done on the vector ALUs. It isn't that bad.
    That would mean that in absolute terms FSA can hide more latency time (because clockrate is significantly lower)? So where is the problem then?
    As explained above, it has to do something for memory instructions of course, but in my opinion not for dependencies between arithmetic instructions.

    What FSA misses compared to Fermi is a load balancing between the SIMDs. With FSA, the register files are pinned to a certain vector ALU, unlike with the VLIW approach (only write port was fixed, one could read each bank of the register file) before and also unlike with Fermis register file (both vector ALUs have access to the whole register file). That needs some complexity and time for collecting the operands which can be shaved off now with FSA (the disadvantage is a smaller effective register file).
    As mentioned already, the Cayman VLIWs apparently support dependent instructions within an VLIW (also multiplications, which is basically the maximum effort). That tells me that the ALUs itself have no problems completing the operations within 4 clock cycles at <= 1GHz. But reading the operands for the operations from the 4 register file banks are distributed over 3 cycles and swizzled to the right VLIW slot (that's where some of the weird rules came from) according to the documentation. That will not be needed anymore, which makes things much more straight forward and also saves some time. A similar problem as sketched for VLIW applies to Fermi, too. Both vector ALUs and also the SFUs need to access the full register file. And opposed to VLIW (where the slot determines the bank to write to) the results have to be routed back. So you need more ports on your register files and a more sophisticated operand collector and result network. And all that costs time (latency).

    Edit:
    Btw., I really think one has to give AMD some credit for naming things what they are. No stupid "scalar" ALUs (besides the one which is really scalar ;)) or such stuff. No, they named SIMD SIMD and a vector ALU is a vector ALU.
     
    #298 Gipsel, Jun 17, 2011
    Last edited by a moderator: Jun 17, 2011
  19. DarthShader

    Regular

    Joined:
    Jul 18, 2010
    Messages:
    350
    Likes Received:
    0
    Location:
    Land of Mu
  20. JoshMST

    Regular

    Joined:
    Sep 2, 2002
    Messages:
    467
    Likes Received:
    25
    We were told directly by Eric Demers that the first products using this technology will be released in Q4 of this year. But it looks like on the APU side that Trinity is based on the VLIW4 of the Cayman chips. So, it could be another 1.5 to 2 years before we see this technology on the APUs. Now, having said that, I am curious if the Brazos update will not in fact utilize the FSA?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...