AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Discussion in 'Architecture and Products' started by ToTTenTranz, Sep 20, 2016.

  1. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    I would think it's time to move this discussion to the generic AMD architecture thread. This discussion has nothing to do with consoles. We are speculating how the current GCN architecture works, and what kind of options AMD has to evolve it for Vega. The Vega rumors that started the 50% fatter CU speculation aren't even about a console product.

    I have also been wondering why the 16 wide SIMD doesn't have 16 wide split register files. Four 16 KB register files (split to serve lanes 0-15, 16-31, 32-47, 48-63).

    If I understood GCN execution model correctly, there either has to be four seperate 16 wide register files (16 KB each) per SIMD or a multi-ported register file that is capable of 3 reads per cycle. I might of course be completely wrong as I am not a hardware engineer. I might have missed something.

    64 wide instructions take 4 cycles to finish (4 cycle latency). However the SIMD starts execution of a new 16 wide instruction every cycle and finishes an 16 wide "partial" instruction every cycle (SIMD throughput of 16xFMADD/cycle). If the SIMD would first fetch whole 64 wide registers (one per cycle), the execution would start at cycle 3 (0 based). The SIMD execution unit is only 16 wide, meaning that the work would be finished on cycle 7. In order to keep the 16 wide FMADD ALU unit filled with work every cycle, the next instruction would need to start 3 cycles before the previous ends. However the 64 wide register fetch would need all lanes of the last instruction to be completed (assuming it has dependency). GCN documents clearly specify that even dependent instructions can be executed one after other with no stalls.

    So I would assume that the register fetches are also 16 wide and pipelined. This would completely hide the latency and result in 100% workload for the FMADD unit. Two consecutive instructions would always be in flight.

    Let's assume four 16 KB register files. Split by 16 lane boundaries. Single read port each. Let's also assume register fetch takes a single cycle and execution takes a single cycle. FMADD is thus 3 register fetch cycles + 1 execute cycle. This is what we get.

    Timeline:
    1. Instruction A (0-15) fetches register 0
    2. Instruction A (16-31) fetches register 0
    2. Instruction A (0-15) fetches register 1
    3. Instruction A (32-47) fetches register 0
    3. Instruction A (16-31) fetches register 1
    3. Instruction A (0-15) fetches register 2
    4. Instruction A (48-63) fetches register 0
    4. Instruction A (32-47) fetches register 1
    4. Instruction A (16-31) fetches register 2
    4. Instruction A (0-15) executes FMADD + stores result
    5. Instruction A (48-63) fetches register 1
    5. Instruction A (32-47) fetches register 2
    5. Instruction A (16-31) executes FMADD + stores result
    5. Instruction B (0-15) fetches register 0
    6. Instruction A (48-63) fetches register 2
    6. Instruction A (32-47) executes FMADD + stores result
    6. Instruction B (16-31) fetches register 0
    6. Instruction B (0-15) fetches register 1
    7. Instruction A (48-63) executes FMADD + stores result
    7. Instruction B (32-47) fetches register 0
    7. Instruction B (16-31) fetches register 1
    7. Instruction B (0-15) fetches register 2
    8. Instruction B (48-63) fetches register 0
    8. Instruction B (32-47) fetches register 1
    8. Instruction B (16-31) fetches register 2
    8. Instruction B (0-15) executes FMADD + stores result

    9. Instruction B (48-63) fetches register 1
    9. Instruction B (32-47) fetches register 2
    9. Instruction B (16-31) executes FMADD + stores result
    10. Instruction A (48-63) fetches register 2
    10. Instruction A (32-47) executes FMADD+ stores result
    11. Instruction A (48-63) executes FMADD+ stores result

    Cycle number in beginning of each line. Steady state marked as bold. Steady state continues as long as we have no stalls (memory waits). In steady state a new (64 wide) instruction is issued every 4 cycles, and one (64 wide) instruction retires every 4 cycles (just like described in the GCN documents).

    Worth noting:
    - One 16 wide FMADD gets executed every cycle. 100% ALU unit usage in steady state.
    - Three 16 wide register fetches per cycle.
    - All three register fetches in the same cycle come from separate 16 KB register files (split by 16 lanes). No need for big multi-ported register file.
    - One 16 wide register write per cycle. Cycling through all four 16 KB register files.

    Conclusion: Four small (fully separate) 16 KB register files per SIMD would work perfectly. I don't see how a single big 64 KB register would work, unless it has 3 read ports. But I might have misunderstood something, as I am not a hardware designer.

    Now let's read AMDs cross lane operation article:
    http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

    If we ignore the new DPP operations, all the older cross lane operations go through LDS permutation hardware. You need an extra register to receive the result and you need to use waitcnt to ensure completion before you read the result. This is similar to all variable length memory ops. Direct access between 16 lane register files is thus not required. LDS permutation hardware takes care of mixing the data. This could be seen as an hint that at least earlier GCN designs could have used separate 16 KB register files (lane / 16).

    The new DPP operations don't use LDS permutation hardware (and don't need waitcnt) but are more limited. There are some operations that move data across 16 lane boundaries, so there must be some kind of data path to fetch data from other 16 KB register files. DPP requires two wait states (= 2x NOP if no independent instructions). Two wait states = 8 cycles (NOP = 4 cycles like every other instruction). Maybe there's some new additional permute hardware inside the SIMD that does this operation, but it is either pipelined itself and/or a bit further away to require extra cycles. Thus separate 16 KB register files would still work.

    If this is all true, then GCN needs to execute the FMADD (ALU part) in a single cycle. It needs to be written back to the register file immediately as the next instruction might fetch it on the next cycle. I don't know whether this is possible, and if this is possible how big limit it would cause to the clock rate.

    DISCLAIMER: If I understood something wrong, please fix my wrong assumptions about hardware engineering :)
     
    #61 sebbbi, Sep 24, 2016
    Last edited: Sep 24, 2016
  2. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    If I'm understanding this correctly, there are registers for the CU and SIMD. The SIMD local registers would be multi-ported, but pulling in data from the CU registers would be single ported.

    The 4 clock cadence generally rules this out. Lanes 48:63 would need to pass data to 0:15 for the subsequent instruction. This could occur without actually writing out the values if no other lanes depended on the data. A 16 thread wave executing each cycle should be able to pass an output as an operand for another lane without writing out the value. I'm not 100% on this, but I've seen it in circuits before. It seems likely if they mention fast 16 lane transfers.

    As for the clocks, I don't believe the SIMDs are gated independently. There was mention of doing this with the variable SIMD sizes though. Speculating here, but my thoughts are the bottleneck lies in the ACE/Schedulers polling metrics from the CUs. Programmable logic would pile up a bunch of transistors exacerbating propagation delay. If locked to cache frequencies that could back everything up. Clocking SIMDs asynchronously would provide a significant boost in compute, but not necessarily address the bandwidth problem.
     
  3. gamervivek

    Regular Newcomer

    Joined:
    Sep 13, 2008
    Messages:
    699
    Likes Received:
    210
    Location:
    india
    If 64CUs is the big Vega then HBM2 on smaller Vega doesn't make much sense. And if AMD are repeating Fiji on 14nm then it's likely the geometry peformance and die size compared to Polaris 10 would be more like Pitcairn-Tahiti than Pitcairn-Hawaii.

    http://www.hardware.fr/articles/951-8/performances-theoriques-geometrie.html

    Anyway, two likely Vega based cards on zauba:

    [​IMG]

    And a 6980:00 spotted on gfxbench though the only given scores for ALU2 are way low than a Polaris 10.

    https://gfxbench.com/compare.jsp?benchmark=gfx40&D1=AMD+6980:00
     
  4. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Why there would be separate SIMD and CU registers? Or do you mean scalar registers (SGPR)? A single SIMD executes a single wave from the start to the end. There are no way for waves to communicate directly with each other (except LDS of course). There is no reason for anybody outside the SIMD to access its own VGPRs.
    Why would lanes 0:15 need input from lanes 48:63? Each lane represents a single thread (in hlsl or glsl). There is no way to communicate between threads except for LDS, or cross lane operations in SM 6.0 / OpenCL 2.0. Cross lane operations on AMD hardware go through LDS permute hardware (and require variable cycles + waitcnt sync). DPP requires two extra cycles, so no direct single cycle way to fetch registers across 16 lane boundaries exist. See: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
     
  5. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Likely to avoid a huge crossbar or allow independent clocking. Waves couldn't communicate with each other outside of LDS. A thread should be able within a wave. A single wave with 16 threads would open up some different possibilities as all are active simultaneously. The 4 cycle stride is just to hide latency. Ideally a 16 wide SIMD has 16 wide waves.

    Someone outside the SIMD could theoretically access the VGPRs of an inactive wave. It's not a usual occurrence, but I believe it's how they were doing the prefetching that came with Polaris. In that case the scalar unit was theoretically reading ahead and executing texture fetches for upcoming instructions. It's also possible lanes could be reorganized, although not while the wave is active. This is all theoretical Vega stuff based on some papers I've seen floating around.

    A single SIMD should be switching waves as they stall. At which point they get dumped back to CU registers and schedule in the next ready wave. I believe this step is happening transparently by the scheduler.

    Generally you wouldn't as each lane should be an independent thread, but a lot of micros have the capability. It's simply the result of the source and destination reading the same pins simultaneously. The ISA is abstracting all of this. So the output of instruction 1 becomes the input of instruction 2 in a single clock and you would likely need to repeat the instruction. If assuming a 64 thread wave it would be impractical unless threads 0:15 were passing data to 16:32 etc and executing the same instruction for several cycles. A 16 thread wave however is another matter. If you created a wave with only 16 threads the same threads should be active each cycle. I don't see the capability mentioned with the DPP capabilities, but it would be under DPP Modifier. It may not be possible with the level of granularity I suggested, or they simply didn't expose all the patterns. It's possible the crossbar isn't large enough on all hardware. "Full crossbar in group of four" over two cycles would be 16 lanes so they may have encapsulated the capability. Someone with knowledge of the hardware beyond the ISA document would have to answer this. For all intents and purposes the two cycle wait I'd think is sufficient.
     
  6. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    All control flow decisions and waiting decisions are being made by the scalar unit using SGPRs (outside SIMD). Scalar unit has no access to VGPRs of the SIMDs. If you want to move data from vector to scalar register, you need to use s_waintcnt to ensure data is available before you read it. No cycle guarantees in data move time -> no tight coupling needed.

    GCN4 documents and slides only mention instruction prefetch. I would assume it is just a simple linear prefetch (next icache lines, similar to CPUs). No mention about complex speculation or any kind of data prefetch in the documents. Data prefetch on modern CPUs is done by analyzing cache line access patterns (linear, strided, backwards/forwards). Prefetchers do not execute real code ahead of time or anything fancy like that. I would be really surprised if AMD added super fancy stuff like this as a minor architectural step (GCN3->GCN4). Also GPUs don't need data prefetch as much as CPUs do, since the excess parallelism allows cheaper latency hiding with no need to burn excess bandwidth (guessing is never 100% accurate).
    Lane reorganization/packing by execution mask (branches) would be a huge change to the whole GPU SPMD architecture, especially to the register files. Nobody else (NV / Intel) has yet managed to do this. If Vega has it, it is a completely new architecture. In this case, this discussion is moot. It would change everything.
    Up to 10 waves per SIMD are ready to execute. SIMD switches the wave every cycle if needed. I haven't seen any stalls by repeated wave switch (as long as at least one wave is ready to run all the time). Copying big amount of registers back and forth would inevitably cause stalls that would be visible by low level performance profiling tools.

    I fail to see any advantages in having shared CU (VGPR) registers (in additiong to the 64 KB register files per SIMD). No documents mention a shared CU register file either. Cross lane ops don't need a shared register file (slow enough to avoid need for direct register access). Even if each SIMD would have a fast L0 register file to hold just the current wave's registers, it would still be beneficial to split the VGPRs by SIMD (64 KB each as seen in documents) instead of having a single big 256 KB per CU register file.
     
    Heinrich4, liolio and BRiT like this.
  7. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    http://people.engr.ncsu.edu/hzhou/ipdps14.pdf

    Section VI of that document describes using a scalar for prefetching.
    In this execution paradigm, we adapt a recent work on
    using CPU to prefetch data for GPU [40] to generate the
    scalar kernel. This collaborative execution paradigm is used
    when a performance profiler shows that a SIMT kernel
    under-utilizes the GPU memory bandwidth and suffers from
    stall cycles due to memory accesses. The key differences
    from the previous work [40] are two folds. First, in our
    proposed architecture, the scalar unit and the SIMT unit
    share the L1 cache and the L2 cache, which simplifies the
    control mechanism for prefetch timing. Second, one scalar
    thread prefetches for one SIMT thread block whereas in the
    previous work, a CPU thread prefetches data for all the
    thread blocks on the GPU.
    I'm assuming this is roughly how the prefetch is working. It would likely be a relatively simple algorithm compiled for the scalar unit. It wouldn't necessarily have direct access to the registers, but the compiler duplicated part of the stream in advance. It likely only kicks in with compute and different access patterns. It's also possible the sharing isn't exposed for security reasons.

    Generally yes, but the paper I mention models the scalar as an additional lane of the SIMT with an independent command stream. The flexible SIMD paper model was using 64+1 lane waves with the possibility of the scalar unit executing vector code serially with a large cadence. This was primarily to address divergence, but would suggest shared register space.
    The instruction multiplexer (I-MUX) in Fig. 2 can
    also be configured such that scalar instructions are issued by
    the warp scheduler in the SIMT unit. With this configuration,
    our proposed architecture can operate in the same way as the
    GCN architecture, where the scalar operations are embedded
    in the SIMT/vector instruction stream and encoded with a
    scalar operation type.

    The above paper is what I've been basing a fair amount of the Vega speculation on. The paper was written by some of AMD's senior hardware guys a few years ago. Yeah it would change a lot, which is part of the reason I've been looking into it.

    http://www.freepatentsonline.com/20160085551.pdf
    Not saying this is the current implementation, but each SIMD would have an independent register file and require the ability to dump the registers back for transfer to a different SIMD following divergence. It's possible the diagram I saw was referencing L0 or temporary register separate from the register file. It wasn't really explicit on where the data was going. I'd agree some sort of stall would be visible, unless there was a mechanism to page it out independently or it held multiple waves. As for the sizes of the caches I'm uncertain.

    I don't believe it would be in addition to as much as a method to transfer a wave between SIMDs and track other data. Repacking might be an option as well. The CU registers might be SPGR or something else for tracking various metrics by the scheduling hardware. The above paper would seem to indicate the transfer is a requirement. I'm not sure that's exposed in any ISA or current hardware, but the hardware may be capable.

    It also seems likely the SIMD VGPRs would be split in upcoming designs. If SIMDs were clocked independently of each other that would be a necessity. I know Polaris doubled a lot of cache sizes, but their exact arrangement I'm unsure.
     
    BRiT and sebbbi like this.
  8. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Aha! Seems that I was talking about the current GCN, and you were talking about a brand new AMD architecture design described in that paper. As I said, being able to combine/split/pack waves depending on a branch would change everything. I agree that in this kind of architecture registers had to be frequently moved across the CU. And this kind of architecture would likely also get performance benefits from a fatter CU.

    None of us has programmed an architecture like that, so it remains to be seen how big gains it provides. You could create much longer shaders with complex loops and branches and subfunctions. Would be fun. But at the same time that kind of architecture would be less power efficient for simple linear code, such as most pixel shaders. Static register allocation is power efficient (assuming simple shaders with low register count and not many branches).

    (Could a moderator move this discussion to the AMD vega speculation thread)
     
    Heinrich4 and BRiT like this.
  9. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Theoretical architecture as they may or may not have actually made it. That paper was from 2012 as I recall, so elements of it could exist in GCN3/4 which is what I've been suggesting. The scalar stuff could be in Polaris, but the instructions aren't enumerated. So we can't really test or explore it yet without modifying a compiler and guessing instruction codes. GFX IP9 could expose the capabilities if the hardware is there. Might be a matter of the software stack maturing.
     
    BRiT likes this.
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,069
    Likes Received:
    2,739
    Location:
    Well within 3d
    Some of the speculation about Scorpio focuses on its time frame being late enough to come after Vega. Aside from Scorpio's general TFLOP count, there's not enough detail to rule out CUs with similar resources. However, an ISA mismatch like what exists with Polaris versus the PS4 could readily apply to Vega, which would be one more ISA revision beyond what is already incompatible with the consoles.

    That would be the 4-way banking of each SIMD register. A register would be a single row of the register file, and it could be subdivided into 16-wide banks. Full independence may not be necessary since the row address could be decoded once and reused for the next cycle's bank access.

    One item that might be different is that the ALU execution and writeback to the register file could be on separate cycles.
    If we go with the assumption that a whole cycle is dedicated to a read access, there may not be enough time for the ALU operation to complete and finish writing to a register in cycle 3.
    Forwarding within the SIMD could satisfy a dependent instruction, so it wouldn't be strictly necessary that the destination register be updated before the next clock cycle.
    The wait states for DPP might stem from some kind of complex behavior at the retirement phase.
    DPP might have a separate buffering stage that collects operands and does not pull from the SIMD's forwarding bus, and if the write back to the register file completes on a different cycle it might have an inconsistent view of the register file.

    That being said, it still might be be possible that the register file is not strictly single-ported, or it could be banked more heavily than 4 in order to reduce conflicts with other pipelines that can read/write to the registers.
    A pure FMADD loop can be satisfied without penalty with the minimum port count, but could units trying to read addresses or return values hit bank conflicts unless there's some extra access?
     
  11. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Yes. If the register can be directly forwarded to the next instruction, then the store could happen in the next cycle. In this case however there would be both a read and a write occurring to the same 16 KB register file at the same cycle. You could however delay the write by 4 cycles instead of 1 (there is always a free slot at i+4). No problems whatsoever. Sounds reasonable. In this case we would have a full cycle to perform the 16 wide FMADD ALU operation.
     
  12. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,784
    Likes Received:
    1,509
  13. Newguy

    Regular Newcomer

    Joined:
    Nov 10, 2014
    Messages:
    256
    Likes Received:
    112
    ImSpartacus likes this.
  14. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,665
    Likes Received:
    4,327
    Fudzilla? Weren't they spreading the news that Vega 10 was being pushed forward to October 2016?

    Here it is:
    http://www.fudzilla.com/news/graphics/40662-amd-allegedly-pulls-vega-launch-forward


    Makes some sense that Vega 10 - if it's a 1:2 FP64 chip - would come to professional market first, though. Their most recent DP-heavy solution is the 2.5 year-old W9100 with a Hawaii.
    The P100, although it seems to still be vaporware (don't know of any of those cards being shipped anywhere) has about twice the DP throughput at a lower power consumption.


    EDIT: Meant Vega 10, not Vega as family of GPUs.
     
    #74 ToTTenTranz, Sep 29, 2016
    Last edited: Sep 29, 2016
  15. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Most of Vega lineup won't be 1:2 FP64 as it's a waste for most applications. As the article stated, it will likely be a high end server processor for specific markets. Same concept as all the Pascals with P100 variants being the only boards with fast FP64. I'm not sure that card would have any uses outside of a professional market.

    It might not be as large as P100, but they may also be planning on integrating it with Zen which could have interesting applications.
     
  16. Esrever

    Regular Newcomer

    Joined:
    Feb 6, 2013
    Messages:
    594
    Likes Received:
    298
    It would make sense that big Vega would be half DP and be server oriented. However I don't think AMD has ever released a server GPU before the desktop counterpart before.
     
  17. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    I doubt big Vega is server oriented, just that it's going there first. Cost isn't an issue there if constrained by component production. Not to mention the margins will be better. I'm still expecting these things to show up on an interposer with Zen, regardless of size, and also get attached to discrete cards. Going off the rumors linked, I'm not sure "big" Vega is actually big. Specs looks like a Fury with some architectural tweaks and moved to 14nm finfet. Compared to a nearly 600mm2 chip that's not exactly huge. Truely "big" would make more sense as a dual V10 packed on an interposer.

    Bonus points for 4 GPUs and 8 stacks of HBM2 on both sides of the interposer! :cool2:
     
  18. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,665
    Likes Received:
    4,327
  19. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    severs with p100 won't ship till early 1Q 2017, this has been know for quite some time.

    But it does make sense to work on Vega in the HPC space first.
     
  20. Esrever

    Regular Newcomer

    Joined:
    Feb 6, 2013
    Messages:
    594
    Likes Received:
    298
    Didn't other leaks also claim 12TF SP performance? That would make for 6TF DP at half rate. That would more than double the W9100.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...