AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Discussion in 'Architecture and Products' started by iMacmatician, Apr 10, 2014.

Tags:
  1. LordEC911

    Regular

    Joined:
    Nov 25, 2007
    Messages:
    790
    Likes Received:
    74
    Location:
    'Zona
    Yes, I said "supposed" because some other people that were discussing it were trying to pass it off as a PI document.
    I just skimmed through it and saw that it was from this month, but also saw the 2013/7 reference in the url.

    I also wouldn't put it past those aforementioned trolls to take an old document they don't understand, update a few things (dates, images, and a few code names) and try to pass it off as documentation on the next-gen.
     
  2. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    863
    Likes Received:
    264
    “DPP” – Data Parallel Processing allows VALU instructions to access data from neighboring lanes

    This is related to "ballot", and similar to ddx/ddy for compute shaders, but without d(erivative). No more going over the LDS when you can peek directly into your siblings VGPRs. I'm sure it isn't going to be exposed in HLSL, but the compiler could detect the round-trip and place this instructions. I can imagine dpp and ddx/y share silicon.
     
  3. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    702
    Likes Received:
    588
    Location:
    55°38′33″ N, 37°28′37″ E
    The quoted link is available right on http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/
     
  4. LordEC911

    Regular

    Joined:
    Nov 25, 2007
    Messages:
    790
    Likes Received:
    74
    Location:
    'Zona
  5. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    The GCN diagram is certainly wrong. Only 32 KB LDS among lots of other issues (it looks like their old VLIW GPU design). This seems to be already discussed in the previous page.
    GCN implements ddx/ddy using cross lane operations (4 lane crossbar). AMD has a public presentation about the lane swizzles (4 lane crossbar, reverse, broadcast, swap). By combining these you can do everything. See slides 42-43 here: http://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah

    OpenCL 2.0 already exposes the cross lane operations (called work group functions). Intel and AMD both support them (NVIDIA doesn't yet have OpenCL 2.0 drivers). OpenCL supports (work_group_) all, any, broadcast, reduce, inclusive_scan and exclusive_scan. These compile to GPU specific cross lane operations.

    More info:
    http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/

    HLSL (DirectX 11.2) doesn't yet support these operations. This is unfortunately since these operations both simplify algorithms and make them faster.
     
    #486 sebbbi, Mar 12, 2015
    Last edited: Mar 12, 2015
    homerdog likes this.
  6. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    Nvidia supports a single shuffle instruction that does arbitrary communication between lanes in a warp (without using any shared memory) since Kepler. Certain functions are supported across all elements in a work group since Fermi (any, all, reduce). Fermi and up also support the ballot instruction, which is useful for these types of algorithms. I think doing scans and broadcasts across all work items in a work group on Nvidia hardware requires allocating a very small amount of shared memory - one element for broadcast, and w elements for scans, where w is the number of warps in the workgroup. The OpenCL compiler could do this automatically.
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,137
    Likes Received:
    2,939
    Location:
    Well within 3d
    Is there an indication that the DDP instructions do something different than the no-allocation LDS instructions introduced with GCN, or is this more of an encoding change that still leverages the same hardware?
    This may leverage a similar method of execution as VALU instructions that source from the LDS, where there is no need for a WAITCNT for the LDS because a wavefront cannot issue past an in-progress ALU operation.
    Using explicit LDS to swizzle would require software tracking, whereas an encoding that rolls it into an ALU instruction would not.

    If it's reusing the LDS data paths, it could implications for the time it takes to execute the ops if LDS is being more heavily used. Contention would go down if a separate network were present per SIMD, which would have a hardware cost.
     
  8. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    863
    Likes Received:
    264
    I imagine you have some sort of barrel-register-file. Because all threads execute a symmetric cross-lane operation you just need to rewire the alu<->register connection, which I assume exists anyway because you can just switch active threads in a CU without a big penalty on GCN.
    A la: register-file base address = "(rotate((base + threadid) % groupsize, simd rotation width, amount))". Something like that.
     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,137
    Likes Received:
    2,939
    Location:
    Well within 3d
    Why would switching active threads require moving data between lanes? Absent cross-lane activity, register access can start with the base register ID for the wavefront+whatever ID the code thinks it is using.
    The originally introduced swizzling methods were categorized as LDS instructions that didn't write to the LDS storage banks.

    There are some operations that also require more than a simple rotation, including mirroring and broadcasting of specific lanes to later rows.
    Having an LDS-like network at the SIMD level could make the LDS network redundant.
     
  10. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    863
    Likes Received:
    264
    Hm. To me it looks like the GCN register file is like a say 23 bit adress space (8MB), of which you can only address a window of 7 bits (128 VGPRs). So every SIMD in the CUs hase a base-address, and all register access is relative to the base address, much like the "ebp" just for registers instead of memory. Let's call it "tctx" (thread context). Now, when you want to deactivate stalling SIMD threads in a CU, you remember the program counter, store is somewhere and reset the program with a different base address. All the state is in the registerfile, there are no flags or other processor state which could get lost, like with OoO and so on.
    No data is moved, but the lanes between SIMDs/CU are rewired to give access to different windows of the register file. The real register file wiring is much larger than just 7 bits, which means you can create an instruction which temporarily alters the access-network's base address such that a "mov v0, tctx[23].v56" would make the address of v0 fetch that part of the register file which is the 56th VGPR of the 23rd thread. That would be the generic idea.
    If you don't need to address every other register explicitly, but you want to have registers that are actual swizzles to the sibling threads then you see that you only need to do things like "(threadid + 1) % threadgroup", that gives you the results of your next circular neighbour, subtract your own value and you got derivatives. Multiply by 4 and you get the next SIMDs, multiply by 64 and you get the next CU, and so on. These operations are all arithmetic modifications of a base address in a global VGPR address space. The cross-SIMD "address" modifications that are possible are written down in the GCN documentation. The command is executed in the same cycle in parallel on all threads with the same modifier which is designed such that it's impossible to have a read/write conflict, it's also impossible to have read/read conflicts.
    But I'm not a low-level silicon guy, it might sound more like how a GCN emulator could implement the instruction efficiently.

    Not really I think, you have just a handful of registers (per thread), the LDS is huge in comparison.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    GCN register file is 256 addresses x 2048-bit (64 x 32-bit). It's why it can store 1 and fetch 3 distinct addresses every four cycles. For ever.

    It's trivially simple to divide the addresses equally among the set of wavefronts running on the SIMD, nothing fancy.

    I suspect you're using imprecise language, when you say:

    I suspect you mean SIMD lane 23 (wavefront lane 23 of 64), VGPR 56. GCN doesn't have that concept. All register addressing is uniform across all lanes. It's fat and dumb. Or, it's brutishly elegant, to be less pejorative.

    There is state beyond PC, e.g. to handle exceptions, countdowns for memory accesses etc.
     
  12. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    863
    Likes Received:
    264
    Yes. lane == thread. wavefront == threadgroup.

    It's certainly interesting. I had discussions lately where we tried to conjecture what exactly makes GCN so efficient at thread-switching, and I think it's the register-file connection/adressability.
    It's too bad these instructions don't make it in any shape or form into HLSL pixel and compute shaders. There are funny "x and x + 3" offsets possible in the dpp instructions, I wonder if they are related to getting the next triangles vertex-triple and if they're used in domain and geometry shaders. Imagine you'd make a pixel shader which decides it returns the same value for all pixels, distributes the calculation over the quad's threads and makes a SIMD-coherent branch out. I think I could play with this all day. :)
     
  13. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    These instructions got into OpenCL 2.0 very recently. AMD, Intel and NVIDIA all have cross lane operations in their GPUs. I would expect to see cross lane operations in Vulkan, since it shares the same SPIR-V back end with OpenCL 2.1.

    IMHO, it is only a matter of time when we get these to DirectX, since the cross lane operations provide nice performance gains for many algorithms (reduced GPR usage, reduced LDS usage, less instructions, etc) and at the same time make writing these algorithms much simpler. Just look at the OpenCL 2.0 examples to see how nice the code looks written with these operations.

    Intel example about sorting with cross lane operations: https://software.intel.com/en-us/ar...ted-parallelism-and-work-group-scan-functions
     
    #494 sebbbi, Mar 13, 2015
    Last edited: Mar 13, 2015
  14. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    999
    Likes Received:
    293
  15. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,079
    Likes Received:
    4,658
    Or the consoles are expected to use >6GB so having "just" 4GB is likely to hinder performance sooner rather than later.
    Or because the GM200 Geforce will probably bring 6GB so they'd get the shorter end of the stick if performance isn't that different between the two.

    Regardless, I'm glad that AMD continues to push the envelope for more memory in their graphics cards "for the common mortal".
    (Except for the R9 285.. 2GB? Bad AMD!)
     
    #496 ToTTenTranz, Mar 13, 2015
    Last edited: Mar 13, 2015
  16. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
  17. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,499
    Likes Received:
    919
    He actually said:
    This makes no sense. There's no way that AMD designed Fiji, then saw the Titan X and decided to redesign it with a memory bus twice as wide just a few weeks away from mass production. Yes, higher-density chips would be plausible, but I'm more inclined to think he has no idea what he's talking about.
     
    Lightman likes this.
  18. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,172
    Location:
    La-la land
    That pretty much sums up the entirety of fudzilla now, doesn't it?
     
    elect and Alexko like this.
  19. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    I can live with that!
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...