AMD Architecture Discussion

Discussion in 'Architecture and Products' started by Anarchist4000, Dec 19, 2016.

Tags:
  1. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    As i have not really follow much the discussion, i could have interpretate it bady, but, doesnt Vega and Zen introduct a new interconnect link ? with extremely high bandwith compared to previous one ( witth nearly no latency at all ) ? The R&D have been extremely costly ( in term of peoples and time on working on it for AMD. )
     
  2. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    AMD (reported by EETimes) tells that they have a new data fabric that can scale not only for their mobile and server SoCs, but also to the scale of what Vega and beyond need (512 GB/s+). There is no detail about it, other than it being a network-on-chip and having only one variant with coherency ("comes only in a coherent version"). What we called GMI before is apparently the inter-chip component of this new architecture.

    So we are talking about both the inter-chip link AND the on-chip cache coherent interconnect.

    http://www.eetimes.com/document.asp?doc_id=1330981&page_number=2
     
    pharma likes this.
  3. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Just the abstract on that makes a ton of sense. I'll need to read through that later when I get more time.

    One possibility is that they have the cache mapped to different pools and in turn interconnect channels. That would be easier to maintain than spreading cache across an increasing number of logical units. Cache size could then be tailored towards a specific task. Combine 6 CUs for one task, 4 CUs for another, etc and disable some of the interconnect links. So while it uses a mesh, not all the links would be active. Have a pool of interconnect links/buses from which to map out logical, task specific, hardware units. That would go a long way towards alleviating the spaghetti ball of an exponentially large mesh. Treat Vega as a giant FPGA, but instead of adders they'd be using entire CUs. Works well with what I've seen of that paper you linked as it might be increasing effective CU size in the process. Great for cache, but the cadence which normally accounts for synchronization would be interesting.

    That's what we were discussing. The EETimes article said it was a mesh interconnect in excess of 512GB/s for Vega and they could rework it in a matter of hours as opposed to months. The debate was over just how to scale and implement that. My previous assumption was the link they were describing was just a point to point interconnect between GPU and CPU. A faster, lower latency PCIE link thanks to the interposer. Given the mesh it seems highly likely it's a FPGA style interconnect.
     
  4. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    Kind of. But I was thinking about bandwidth amplification, which 3dilettante has touched on.

    One key thing about QR IMO is that it can be layered easily, and it is designed with the assumption of weak consistency so that the only external coherence traffic would be invalidations. So it is possible to have (1) a building block of a small number of CUs and a private L2 for local bandwidth amplification; and (2) multiple building blocks on the mesh, with per channel L3 handling GPU local atomics and global write combining. Both levels could then use the same interconnect (e.g. ring locally, mesh globally).
     
  5. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    The timeline for that paper will be interesting. It may have been too late (~2014 best I can tell) into the development cycle of Vega to get implemented. The fabric design might allow for it though. The applications for that will be interesting, as it enables capabilities that GPUs previously avoided in the past. It may address the ROP scaling issues rather efficiently with the layering as you described. Obvious benefits for compute when needed. Might be the closest thing to compute compression we've seen yet. I could see delta compression working alongside that rather nicely for certain workloads.

    In the case of Scorpio they could probably map an entire block of EDRAM into a CU cluster. For that matter an entire node on the mesh could be nothing but specialized cache, ROPs, or geometry engines. Definitely changes the programming model for certain elements.

    One other interesting possibility is that the mesh could in theory be used like a series of buses to broadcast data. Or possibly a <2048b memory channel. Unsure what applications that might have, but if every CU had to read the same data it could work.
     
  6. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    Not quite sure how it would fix ROP scaling.

    GPUs are already screen space partitioned in several stages (rasteriser, ROP, etc). It seems meshes could fit well. Say as ROPs are bound to shader engines and memory channels, they could virtualise the export bus to run on top of the NoC by exploiting locality in the mesh.
     
  7. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    https://www.eecis.udel.edu/~lxu/resources/TOP-PIM: Throughput-Oriented Programmable Processing in Memory.pdf
    If for example the ROPs weren't in the CUs and you needed to coalesce the writes a bit more. Still reading through that one myself.

    Throwback to the Navi discussion in March/April. Hitting on all the same stuff we're looking at now. Looks like parts of it happened with Vega though.
    https://forum.beyond3d.com/threads/amd-navi-speculation-rumours-and-discussion.57684/
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,125
    Likes Received:
    2,883
    Location:
    Well within 3d
    To clarify, the ROPs as we know them are not in the CUs, and are currently not coherent at all.
    Their usage model has very separated synchronization points, and various behaviors break the Quick Release assumption of rare read after write traffic.
    ROPs move tiles of data at a time, and the current method of use when dealing with depth and transparency puts a higher premium on reading data that was written earlier. Delta compression is currently not coherent and depending on the compressor's choice of tile and format can generate a variable and wide swath of invalidations based on whether a single value change flips the delta for the tile and possibly any other tiles/control words in the hierarchy.

    How ROPs could be brought into the coherent domain would need to be outlined given some of the current design choices.
    If they were brought in, it would seemingly mean bringing some kind of tiled memory pipeline into the neighborhood of the CU, or as a subsidiary export pipeline in the CU proper. There could be uses for doing so, including expanding the programmability of export operations , creating a different memory access type with different sizes and visibility controls, and possibly using ordering information and delta compression values as inputs to the shaders or the shader engine's wavefront logic.
     
  9. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    It's the programmable blending and OIT I was looking to address. At least the OIT should be coming with Vega, ideally the blending as they're similar. In theory that's somehow coherent along with the interconnect. My thinking was a two stage design with programmable first stage in the CU(s) as opposed to a programmable ROP elsewhere. Second stage being some basic blending or data manipulation logic. Possibly actual data compression and indexing to conserve space as opposed to just bandwidth. The QR for if programmable units in different CUs did something strange. Computing a tile wide value perhaps once the CUs simplified results. Perhaps related to DCC or a compression scheme. Ensure another CU didn't make any changes that would affect the outcome if several were working together. That would be at a rather course granularity.
     
  10. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    OIT was in consoles before If I'm not mistaken, Sega comes to mind but was later dropped due to limitations depending on what you want and the accuracy you want, its extremely expensive and this is probably way it hasn't been used in GPU's much, so I wouldn't think that would be even remotely a reason for CU's and ROP's needs.

    PS OIT performance has more to do with register pressure and local GPU cache and how the communication between the CU's in different blocks are handled.
     
  11. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    OIT can always be implemented in DX11 using per-pixel linked list, and AMD TressFx is using it. The problem is the unbound memory growth (if not hard capped), the information loss (if hard capped) and the scattered access pattern.

    He probably meant Rasterizer Ordered View, which is IIRC essential for approximated/adaptive OIT.
     
    Razor1 likes this.
  12. pMax

    Regular Newcomer

    Joined:
    May 14, 2013
    Messages:
    327
    Likes Received:
    22
    Location:
    out of the games
    I tried to follow up this discussion, but I do honestly not understand why are you talking of the bus interconnects.
    AMD stated they moved to a coherent bus: coherence is made at MC level access, as at DC level is already too late.
    So, given they were doing coherency between the GPU MMU and the CPU MMU using the PCI interconnect or something like that, imho this mean they made finally an intel-like unified-unified northbridge.
    At least, that is what I understood.
     
  13. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    A cache coherent interconnect doesn't mean everything has to be cache coherent. It just means cache coherency is maintained, and clients can very well bypass the coherence domain as needed — GPU is an obvious one.
     
  14. pMax

    Regular Newcomer

    Joined:
    May 14, 2013
    Messages:
    327
    Likes Received:
    22
    Location:
    out of the games
    ...yet i miss why people are discussing about bumpers and so on: those terminations must end in DCT/PHYs, do you multiply them as well (edit: rethinking, they were thinking of interconnecting the varous MCs with bumpers, maybe?)? Plus, as they are talking of coherence, that thing happens in the northbridge. Coherence between CPU and GPU block means an unified NB across both. So, again, I miss the point?
    (plus, once you have a full 512 gps coherent, why would you add an extra inchoerent bus? You use what you already have, with a different protocol maybe).
    Also, a fully coherent unified-unified northbridge allows you to put L3 behind the CPU and GPU, thus getting those nifty intel benefit with large L3 caches..
     
  15. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    There are multiple facets of coherences that are being talking about. You are focusing on system-level cache coherence. We are discussing the coherence and consistency in GPU's private domain that is preferably incoherent to the system while reusing the cache coherence fabric (protocol & interconnect).


    Maintaining cache coherency has overheads, and in some cases things are intentionally designed to avoid it for maximum efficiency. Let's say a GPU writing its local frame buffer that is designed to be coherent with the CPU only at API boundaries. There is no point to put it within the coherence domain — it gives you a hell of ownership and invalidation traffic at the system level that wasn't necessary at all. Moreover, it is not "adding an extra incoherent bus" — modern SoC interconnects usually support both coherent and incoherent accesses under the same umbrella.
     
    #35 pTmdfx, Dec 23, 2016
    Last edited: Dec 23, 2016
  16. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    If you need an example, the white paper of AMBA 4 ACE might help. Starting from Page 9, you will be informed that the interconnect supports both "non-shared" transactions that are non-coherent at system level, and various levels of coherent transactions (generally: caching vs non-caching client).
     
  17. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    https://software.intel.com/en-us/gamedev/articles/rasterizer-order-views-101-a-primer

    This is the approach I was thinking. Accelerated with those high performance scalars I've been theorizing when complete or overflowing. They'd be perfect for the sorting or linked lists so long as they were small enough to fit within the 256B-1KB(guessing) scratch pad. Keep in mind I was theorizing packed vectors (ideally all the ROV samples) to feed them data where the scattering is a bit less of a concern. You just end up with the larger initial memory allocation. Use the traditional ROP design for aggregating results, then the CU/scalars if it starts overflowing as highlighted above. Along with finalizing those results in a programmable manner.

    In the case of an overflow it would either compress, or get a new memory allocation to link to. One which likely isn't readily available to the ROP. Dump all the samples to a new linked location and start collecting again. Maintains performance until a shader decides to make use of all the samples.

    Extension of ROV. Traditional ROP handling opaque samples, possibly appending to a list of transparent samples. The CU stage, along with that theorized scalar, doing the sort, blending, and overflow.

    http://www.eecg.toronto.edu/~enright/micro14-interposer.pdf
    Current thinking was 2048b(?) interface on each CU or SE. Part of that tied directly to memory, another part remapped to a configurable network topology through an internal crossbar to create the mesh. The coherence would be a protocol over whatever topology was configured. Most of the interface would be disabled as it would be used by other nodes and would require a really large crossbar and ability to consume data in each node.
     
  18. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    A few wild ideas:

    1. Let's assume the rasterisers are already owning large interleaved partitions in Z order of the screen space.

    2. A CU group (let's say four of them) can further own a fixed subpartition in the logical space of their parent rasteriser.

    3. Since the subpartitions are fixed, maintaining the API ordering can be done locally in the CU group. This means the synchronisation needed by ROV can be maintained locally between just four CUs.

    4. Exporting to ROP AFAIU has the same requirement in the API (re)ordering as ROV, just that you can only write data out with a predefined atomic operation. As now the CU group maintains the API order for a fixed set of partitions in the screen space, perhaps ROPs can be tied to the same mapping, and moved into a CU group too. With a mesh NoC, it would not be a problem if the partitioning is configured with the proximity to memory controllers in mind (?).

    Then the pixel exporting process would effectively become no difference from ROV, except ROV reads and writes like UAV in the protected section, while pixel exporting is just writing stuff to the local ROP.

    6. Since there is already ordered sync at CU group level, perhaps a workgroup (that needs only unordered sync) can also be allowed to span over CUs within the same group. That would require the LDSes moving a bit further from the CUs though, and not as pipelined as it was.

    So in the end, say with 64CU Greenland, it may have 16 CU groups, each of which has 4 CUs, 4 Color ROPs, 16 Z/Stencil ROPs, 128KB L2 Cache and 256KB shared LDS.
     
    #38 pTmdfx, Dec 26, 2016
    Last edited: Dec 26, 2016
    Malo likes this.
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,125
    Likes Received:
    2,883
    Location:
    Well within 3d
    Just to point out for future clarification, coherence wouldn't change the problem that OIT solutions seek to solve: wavefronts reaching their export phase for transparent data at varying times based on the dynamic behavior of the CUs involved, leading to inconsistent final output. Having multiple threads spread across different cores in a coherent multicore CPU would have similar problems in the absence of any additional synchronization.
    Raster-order views inject a stall if it is determined that a pixel in a ROP tile is currently being worked on by a primitive earlier in submission order.

    I'm having some difficulty parsing this, and what parts are different or the same as before.
    The programmable CUs are the first stage already, if the definition of the second stage for the ROPs is basic blending and data manipulation.
    Having the working cache compressed for the second stage would be different, but it would be a change because it would likely make parallel work significantly more painful to achieve.
    I don't follow how QR is supposed to be conditional on the CUs doing something strange. It's either being used or it isn't. It's a little late to use it if it was not used and then we realize something strange happened. That would sound like some kind of very optimistic speculation with an intractable fallback case.

    As noted in that paper, ~2048 bits of interface per core was prohibitive in terms of die are lost to pad space. (the paper's projections are actually optimistic for what we have currently for bump pitch)
    In that regard, could you clarify what you mean by having most the interface disabled? These are physical objects with non-zero area that doesn't become available if they are turned off.

    Currently, 2D screen space is partitioned into rectangles and the rasterizers get a subset, which helps match up with the way the address space is striped to allow optimal utilization of the DRAM channels. A rasterizer can handle one primitive per clock, 16 pixels are rasterized per clock, meaning 16 quads for the 64-lane wavefront.

    Is this static ownership, or some kind of dynamically balanced ownership? There is a decent amount of static partitioning based on a shader engine having a rasterizer with a static mapping, but the programmable resources are more flexible within that subset to handle variability in demand.

    Export is given to what wavefront is able to negotiate an export buffer over the export bus at the end of a thread group's life span. ROV has a more global view, hence why Pixelsync stalls a group until earlier ones are able to finish their export. Without it, it's a race like any other concurrent access situation. At least with opaque geometry the constraint is what is closest, rather than the order of what was overdrawn/culled.
     
  20. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    It does not necessarily work at the granularity of pixels though. At least in theory, the post-rasteriser wavefront packing can be batched and sorted, which ensures overlapping pixels gets packed separately with a different wavefront with a larger ID (that is in submission order).

    Static ownership. The global 2D screen space is already interleaved in Z order between the rasterisers. So the sub-partitioning ideally would be interleaving in Z order within the local 2D space of the global partitions.

    Yes, but then at least for alpha blending, AFAIU the pixel export process would have to reorder out-of-order exports back into the submission order for stable and consistent results. Then in this sense, even if blending opaque fragments does not require the order but only checking the depth, the exporting process as a whole would still have an overlapping requirement that can share a synchronization mechanic with the ROV.

    Moreover, the idea is that with static partitioning and mapping of the screen space, the synchronization can be localised.
     
    #40 pTmdfx, Dec 27, 2016
    Last edited by a moderator: Dec 27, 2016
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...