AMD Mantle API [updating]

Discussion in 'Rendering Technology and APIs' started by MarkoIt, Sep 26, 2013.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    GDS is exposed by an extension in OpenCL on GCN. I can't think of a reason why Mantle wouldn't too.

    Why do you come to the conclusion that "Mantle only handles CPU side command submission"?

    Putting queued data in RAM versus on-die storage amounts to a huge difference in throughput. It's the difference between a GPU implementing an alternative rendering pipeline and a demonstration that such an alternative pipeline is possible. The latter is only interesting for research purposes if you have access to the former.

    It's like the early days of GPU compute, in which OpenGL pixel shaders were wrangled to do something that wasn't pixel shading. It works, yes. But it's not the same as full blooded compute.

    I don't understand why it's tricky - it's a block of memory with a different policy (compared with L2 cache, say). And until WDDM/D3D catches up and embraces such memory, there's no need.

    Mantle, at the very least, brings back some of the excitement that Larrabee engendered. I'm still mad at Intel for killing that off.
     
  2. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    Can you remind me what that extension is? Last time I asked, the only things I heard were that certain kinds of atomics were placed in GDS, and I'd expect the OpenCL 2.0 pipes would also be placed in GDS. But that's not really exposing GDS - both of those things are very restrictive uses of GDS, so I don't think they count.

    The issue is that really using GDS fully would require a global barrier in OpenCL. That's a far bigger extension, and I haven't seen that either.

    Am I overlooking something?
     
  3. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    AFAIK, GDS on ocl exposes only counters, not block RAM.
    Primarily from the optics of the reveal. They highlighted that they can get more draw calls, true parallel command submission. It does not appear that they will be exposing GPU side stuff, at least in the beginning. Or I am reading too much into it.

    I guess the tricky points come when the GPU becomes a system level co-processor. How do you handle multiple kernels using gds, when some of them might be context switched out any time? What if you were doing producer consumer and one of them is context switched? I am not sure what would be the right way of handling such situations.
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Those.

    Unless GDS expanded into the MB, then tens of KB really can't do much more when you have thousands of work items in flight at any point in time: 4 or 8 bytes of globally shared on die memory per work item isn't much use for a pipeline buffer (you'll only get indices to off-die data or very compact structures in the queue - no good if a workgroup produces 100KB of data).

    Which is why I'm alluding to GDS plus MB of on-die memory as being the killer feature of XBone.

    With producer-consumer you don't use global barriers. I think taking the G, "global", in GDS literally (as a global analogue of LDS) is starting off on the wrong foot. e.g. if you have an algorithm that depends on an intermediate kernel that does pruning on a data-structure produced by one kernel and consumed by another, you run these kernels simultaneously and use GDS atomics to manage the queue and get performance from on-die storage of the intermediate data. (Tis a pity that this doesn't work on current GCN cards, but there it is.)

    GPU workload management (amongst three kernels, say) is then driven by queue-derived metrics.

    I have exactly this problem on a pair of kernels I'm working on right now - my intermediate data is too large for LS, but an on-die queue of even just a MB would be perfect with these kernels working as producer-consumer. Especially as the producer has a throughput of 1/100th of the consumer, roughly. But I would also need each work item to be able to read/write its blob within the queue. You could argue that last point is exactly why we need a multi-MB GDS. The problem is you then lose fast global atomics (requiring a trade-off of banking versus logic to perform the gamut of atomic operations on multiple MB of data). So an architecture of GDS + large on-die RAM is the sweet spot. Exactly what we see in XBone.

    Honestly I'm not sure what kind of throughput-centric algorithms could make meaningful use of global barriers coupled with a small GDS (as it currently is) that wouldn't be better as producer-consumer (though small GDS + large on-die memory is where the fun starts). What am I missing?

    If there was a way for GCN on desktop to lock L2 cache lines, to provide a fixed, large, on-die memory, that would be cool. It wouldn't be multi-MB, but it would be a start.

    I'm intrigued by the problem you're considering where 64KB of GDS with support for a global barrier would be an effective solution. A single multi-modal (producer-consumer) persistent kernel dealing in small primitives is the only thing I can think of with current architectures. L2 caching on Kepler should give you something as fast as GDS - I honestly don't know what kind of performance GCN L2 cache would give you in this scenario.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Sorry, I took your original statement to mean something incorrect, I suspect: as a general dismissal of intra-GPU capabilities rather than solely a reference to spawning work.

    Well I dare say, that's what L2 (ultimately backed by off die memory) is for. You could stripe GDS across the L2 partitions to maximise responsiveness during a swap (?). The GPU can track which kernels have GDS instructions (or which pages of kernel instructions have GDS instructions), so it can pre-swap GDS<->L2 as kernels switch. If multiple live kernels are GDS heavy, well, you might find your algorithm works as well using global atomics rather than GDS atomics.

    If you're referring to multiple-contexts sharing a large on-die memory, you have the same. Only paged. And slow if heavily contended.

     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The bandwidth provided by the eSRAM is why I didn't put the Xbox One at the level of Cape Verde, which is where its other resource counts almost put it.
    Microsoft is pretty keen on telling the world how this architecture allows it to reach parity with the PS4.

    For most of the workload the consoles are tasked with, the 290X is going to be very, very good at it.
    For things that it isn't good at, a gaming rig will have a CPU that will outclass the console CPUs, and depending on the particular task can outclass the resources they will be able to devote to compute.

    That does leave a subset of tasks that could hop the PCIe bus, if this isn't an APU+dGPU system that AMD hopes will be more common.

    I'm not entirely sure where GPU compute fits in the picture for Mantle, since the big marketing references were for things like draw calls.
    As far as Microsoft or Sony's reaction on compute, they had absolutely no reason to be surprised.
    Sea Islands introduced the expanded ACEs and their user-level compute queues. Bonaire at least internally has it, as does Kabini, and from the looks of things Kaveri has it as well.
    Neither console maker would have reason to think that AMD would add the hardware to its non-console chips and then never expose it.


    In terms of CPU, GPU, and disk I/O, a good rig has a lot of brute force and TDP to burn.
    Even if efficiencies on the consoles were massive and Mantle didn't come around to smooth cases where modern desktop hardware has trouble, that buys them maybe one upgrade cycle in PC terms, not 7-10 years as the console makers intend.
     
  7. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,629
    Likes Received:
    1,227
    Location:
    British Columbia, Canada
    The 290X's off-chip bandwidth exceeds the likely on-chip ESRAM bandwidth of Xbone by a fairly large margin. Now I'm all for big on-chip caches going forward, but anything the ESRAM can do can indeed be brute forced through GDDR on a high end GPU. It'll use more power, but that's not a large concern for discrete GPUs at the moment.

    If you're concerned about CPU->GPU transfers over PCI-E, fine, but it's tough to make arguments about Xbone's ESRAM being very useful for that considering the size of it and the fact that the GPU tends to run enough behind the CPU to make fine-grained interaction problematic. Haswell's cache is big enough that it might be more viable, but those sorts of usages still have yet to be proven.

    I don't think you're drawing a meaningful distinction between what the "portable" APIs do and what you're assuming (or know?) than Mantle does. To be more clear, the UMD is DirectX is already supposed to fullfill exactly the "minimal layer to encode command buffer" purpose that you describe. So if Mantle is going to do something much superior, what changes is it going to make to the programming model that will allow that UMD layer to execute much more efficiently than it does today?

    Basically I see a few "big ticket" opportunities:
    1) Multithreaded submit can obviously be made better than it is today by removing some stuff that makes it awkward (Map/Discard, creation/destruction of textures on the fly, etc). I imagine this is part of where their "9x" number comes from, but honestly if all you're doing is moving relatively slow code to more cores, that's not a particularly compelling story in the long term.
    2) State blocks can be made to match a particular architecture more directly. There are still cases where drivers have to handle non-1:1 mappings of state and in some cases state being baked into shader code that are not ideal and necessitate checking those special cases all over the place. Specializing this for one architecture definitely helps, but it makes it less portable of course...
    3) Move fine grained resource stuff (creation/destruction, hazard tracking, etc) out of the driver and into user space. This is likely the biggest potential for real improvement, but it's less an API issue than an OS issue. If you go down this road, you start to have to lie to the OS memory manager and that can have a variety of consequences to the user experience. i.e. ultimately this one needs to be solved in the OS too.

    These things are really only going to improve the CPU overhead. AMD hasn't really said that they expect GPU-side performance improvements so it's possible that they indeed do not expose additional features there, but a lot of people have been talking as if they expect that side to go faster too. So one of the two groups is wrong :) Furthermore, there's nothing really on the GPU side that can't be done with GL/DX extensions so that's hardly an argument for needing a new API.
     
    #87 Andrew Lauritzen, Sep 29, 2013
    Last edited by a moderator: Sep 29, 2013
  8. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,235
    Likes Received:
    4,259
    Location:
    Guess...
    From whom? I'm just waiting until tomorrow morning until I've enough sobriety to contribute to this conversation (in whatever tiny way I can of course)! These guys opinions are exactly what me and you are looking for so don't knock it, just appreciate it mate and feel privileged that you're part of it :wink:
     
  9. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I'd much rather AMD beef up the L2 in front of the memory controllers and stack edram behind them. Even 290X, with it's 512 bit bus will only have 1MB of L2.

    At 20 nm (next year?), 8MB sram should be about 16 mm2, cheap for a 350mm2 chip. Going that way is much much better than a sw managed cache, imho. If we are having lots of SRAM, why not put it as a cache?
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Then why not cut out the middleman, and just beef up L2 in size and use large crystalwell, instead of a sw scratchpad?
     
  11. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Battlefield 4 and Frostbite 3 Will Support Both AMD Mantle and NVIDIA NVAPI APIs For PC Optimizations
    [​IMG]
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I'm talking about latency, not bandwidth. Think of this as a general purpose memory that, amongst other things, can do colour/z/stencil buffer caching, on a huge scale :cool:

    In traditional forward GPUs, the combination of batched pixels and on-die colour/z/stencil buffer cache allows the ROPs to keep up in high fill-/blend-rate scenarios. If these (read/modify/)write operations weren't block-cached on-die, then GPUs would need far far more off-die bandwidth and far more ROPs to maintain fillrates.

    In other words, ROPs are latency sensitive. It's just that pixel export needs only the tiniest amount of cache to make this operation comfortably fast.

    This is why I've been making the comparison with Crystalwell. It's bandwidth/latency in combination that opens up new algorithms.

    Until the second generation of games.

    This will have an adoption curve similar to how it took developers a while to get used to being forced to write multi-threaded code for PS3/XB360.
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Not at the same latency it won't. XBone does the large, fully-programmable, GDS that RecessionCone wants.

    In compute, LS (or Larrabee's L2 cache slices) provide a combination of latency and bandwidth that no amount of brute-force off-die bandwidth can replicate.

    XBone's on-die memory is global rather than local. It has an intermediate latency/bandwidth profile (compared with LS and GDDR5 in 290X). It sits nicely in the middle, while 290X if it only has L2 cache, won't have enough L2 cache to make a damn difference.
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    This is why I'm mad at Intel for canning Larrabee for consumer graphics. Well, other things too.
     
  15. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    17,879
    Likes Received:
    5,330
    A couple of noob questions
    GDS ?
    Crystalwell ?
     
  16. Sinistar

    Sinistar I LIVE
    Regular Subscriber

    Joined:
    Aug 11, 2004
    Messages:
    660
    Likes Received:
    74
    Location:
    Indiana
  17. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    I think this is how Nvidia enabled the specific MSAA support in Batman: Arkham Asylum. :???:
     
  18. repi

    Newcomer

    Joined:
    Dec 7, 2004
    Messages:
    203
    Likes Received:
    34
    Location:
    Sweden
    NVAPI is a small but good utility library to control display setup and get access to some 10.1 functionality on 10.0 devices. It is _not_ a graphics API. It is more comparable to AMD's AGS library, which we also use.
     
  19. Sinistar

    Sinistar I LIVE
    Regular Subscriber

    Joined:
    Aug 11, 2004
    Messages:
    660
    Likes Received:
    74
    Location:
    Indiana
    Wasn't 10.1 features what nvidia was paying developers to disable? So while paying UBI to disable 10.1 , they were adding it through a proprietary API to their own cards.
     
    #99 Sinistar, Sep 29, 2013
    Last edited by a moderator: Sep 29, 2013
  20. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    9,044
    Likes Received:
    1,116
    Location:
    WI, USA
    The graphics API for NVIDIA NV1 is NVAPI. I guess it has gone through a few revisions / tranformations. ;)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...