AMD Mantle API [updating]

Discussion in 'Rendering Technology and APIs' started by MarkoIt, Sep 26, 2013.

  1. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    18,409
    Likes Received:
    8,843
    Which has happened before. With a company that I cannot remember the name of, opting to take less money from another company rather than accepting Microsoft's more generous offer.

    I believe the contract gave Microsoft the right of first refusal. Meaning that as long as Microsoft matched whatever offer was on the table, the company (or was it certain assets of Nvidia?) legally had to be sold to Microsoft. It's been a long arsed time since I paid attention to it, but that's what I recall.

    Regards,
    SB
     
  2. liquidboy

    Regular Newcomer

    Joined:
    Jan 16, 2013
    Messages:
    416
    Likes Received:
    77
    impressive results from the mantle/dx12 preliminary tests by Ryan ...

    I have a question, not sure if anyone here can answer it (due to NDA) ...

    I know that Dx12 is tied to wddm2.0 which is tied to windows 10 ..

    And as Ryan points out, wddm 2.0 is an enabler of the perf gains that DX12 delivers (on the win10 os)..

    Ryan also rightfully points out that wddm 2.0 also is an enabler for the 3rd party graphics drivers, they're performance is also tied to wddm2.0

    So my question is " just like Dx12.0 impressive results is tied to wddm 2.0, is it fair to say that mantles impressive results is also tied to wddm 2.0" ... basically without wddm2.0 mantle wouldn't achieve such impressive results?!

    If we ran these same tests in windows 7 (wddm 1.3) with mantle, would it be able to achieve such impressive 143% - 300% improvements?!

    Similarly on windows 10 if it were still using wddm 1.3 would mantle still see these impressive 100+% results ?!
     
  3. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    629
    Likes Received:
    1,131
    Location:
    PCIe x16_1
    I have an update from Oxide on this matter. The following has been appended to the article:

     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Interesting. I had seen discussion about what it would take to swamp the command processor with targeted benchmarks, but this may be the the most prominent and complex software to manifest that limit so clearly.
    Does it seem reasonable to read into that statement that Mantle would be even closer to DX12 if it weren't for that API-specific optimization?

    Nvidia's path has better submission latency and also doesn't bottleneck as readily in CPU, which wouldn't be out of line with the historical trend that AMD's GPUs become limited somewhere in the driver and front end more quickly.
     
  5. liquidboy

    Regular Newcomer

    Joined:
    Jan 16, 2013
    Messages:
    416
    Likes Received:
    77
    The way i interpret that "update" to the article ...

    once the CPU bottleneck is removed , a new bottleneck emerges the "GPU Command Processor" ...

    So how does Mantle/DirectX12 overcome this bottleneck?

    1. Mantle - introduced a "Driver" optimization called "OptimizeSmallBatch" that combines many small batches, all on a second pass on the CPU, before submitting to the GPU ...
    2. DirectX12 - doesnt' sound like it did anything to optimize for "Small Batches" hence why "Mantle path holds a slight performance edge over the DX12 path on our AMD cards"

    So totally speculating here BUT another approach to overcoming this "GPU Command Processor" bottleneck is to "add more GPU command processors" ... Keep the opimization out of the "driver", keeping it thin/light.. Just fix the problem in HW ...
     
    #1985 liquidboy, Feb 7, 2015
    Last edited: Feb 7, 2015
    mosen likes this.
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I think Mantle did not introduce the batching optimization, rather Oxide's Mantle path has an optional software optimization that coalesces very small batches.

    The small batches problem is a giant reason why AMD used Oxide as a marketing tool for Mantle.
    It just turns out that once things are opened up, that while much better than DX11, AMD is not the best at the use case it championed.

    I think it's somewhat ironic, given how stridently Oxide was opposed to such optimizations that it made an engine that was almost pathologically sub-optimal, that it babied GCN (just a little) by compromising its own code of ultimate developer freedom.
     
  7. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    629
    Likes Received:
    1,131
    Location:
    PCIe x16_1
    Correct. 290X performance dips from 45fps to 39fps if you disable the batch optimization. But submit times also dip from 9ms to 4.4ms, which is comparable to DX12.

    http://images.anandtech.com/graphs/graph8962/71461.png
    http://images.anandtech.com/graphs/graph8962/71462.png

    Also correct. The batch optimization feature is built into the application. The driver is just a thin layer that has little control, this is what low-level APIs are all about.
     
    #1987 Ryan Smith, Feb 7, 2015
    Last edited: Feb 7, 2015
    iroboto and mosen like this.
  8. liquidboy

    Regular Newcomer

    Joined:
    Jan 16, 2013
    Messages:
    416
    Likes Received:
    77
    Thanks for the clarification, my mistake the optimization was indeed done by the engine devs ... Keeping the low-level api's thin ...

    It's great that an engine can do these optimizations as they see fit and not be dictated to by the driver ...

    My speculation still stands, that ultimately it would be great if this "command processor" bottleneck didnt exist in the first place .. BUT if it didnt then the bottleneck would move to another area etc.etc..
     
  9. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    13,049
    Likes Received:
    3,860
    the bottleneck would move to another area but then they can add more tasks to your cpu improving another part of the game.

    Serious pc gamers have to be at the point that we have at least quad core cpus with intel now making hex and octacores affordable
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Any engine can does this, on any API. This is a very watered-down version of the batching and coalescing done by engines or artists operating under standard APIs. This is a smaller instance of the small batch problem that was one of the bugbears Mantle--and the PR centerpiece Oxide--was meant to defeat.
    One of those small batches was a petard, and Mantle was slightly hoisted by it.


    There's going to be a bottleneck somewhere, unless the whole system is perfectly balanced. Then the whole thing is a limiter.
    In reality, it takes a fair amount of work to consistently bottleneck at the command processor, particularly when it's hidden behind a thick API.

    The command processor is an unglamorous pieces of front-end infrastructure that wasn't elaborated on much when GCN came out. It's not sexy, and that unsexy domain hasn't had many notable changes publicized.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,499
    Likes Received:
    1,858
    Location:
    London
    What makes you think NVidia hardware doesn't have the same problem?

    There's no Mantle path on NVidia, so there's no opportunity for the developer to optimise for the smallest batches.

    How do you know NVidia isn't coalescing in its driver?

    Except in XB One, apparently...
     
  12. Osamar

    Newcomer

    Joined:
    Sep 19, 2006
    Messages:
    218
    Likes Received:
    40
    Location:
    40,00ºN - 00,00ºE
    Excuse Ryan.
    If I am not wrong GTX 980 can see all the system memory and R9 290X "just" 8 Gb. Could you tell us what happens with the rest of the cards?
     
  13. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,538
    Likes Received:
    962
    Based on the submission times, if NVIDIA's driver is coalescing batches, it's doing it very, very fast.
     
  14. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    629
    Likes Received:
    1,131
    Location:
    PCIe x16_1
    Each card is different. GTX 680 was 17.9GB, and 285/260 were 5.7GB. AMD seems to allocate 4GB of virtual RAM on top of the physical VRAM, whereas NV allocates 16GB.
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Presumably, there is a limit for the processing rate of Nvidia's command processor. The 980 has sub-Mantle submission times and does not bottleneck at two cores, so it is at least higher than what Oxide can show.
    Nvidia's GPU clocks in the latest generation are higher, which could give a linear boost over the 290X. There could be architectural differences in the RISC cores or the surrounding hardware in the command processor block between the two architectures.

    I don't, but if it is, it is doing so while maintaining submission times that are as good or better than AMD's DX12 path. 30-40% of the submission latency for Mantle appears to be attributable to coalescing.
    Something at a hardware or software level for Nvidia would have to be an order of magnitude better to obscure that, which seems unlikely.

    Do you mean the dual command processors that as of the last known revelations only allows one to be used by the developer, or the unspecified improvements alluded to by Microsoft--in a generation where IP found in both consoles is embedded in Bonaire and Hawaii?

    Carrizo is the generation of GCN that might promise a bigger change in the front end, since it is promising preemption of the graphics context the command processor manages. I do not think that alone would help this problem, but it may be a sign that the architecture has gone through more design refinement in an area that until recently did not need to reach for peak performance due to being hidden by a thick API.
     
  16. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    17,347
    Likes Received:
    4,770
    What would happen if a game needed more memory ?
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,499
    Likes Received:
    1,858
    Location:
    London
    Agreed.

    Cannot be asserted, because you don't know whether the driver is coalescing.

    Agreed.

    I don't know what RISC cores you're referring to. I don't know the architecture of Maxwell's command processing. Is there parallelism there for command processing?

    Are we seeing a combination of a parallelised command processor + coalescing?

    I have no idea how you make the leap to an order of magnitude.

    Also, remember that NVidia's driver has perfect information about the GPU's state at any time, including the operation of the command processor. Oxide engine doesn't have command processor register/pipeline-load information or a detailed CP performance model, under Mantle, hence the simplistic "coalesce off/on" switch.

    "only allows one to be used by the developer". Let's see if there really is a second processor and if it's any use for games... It's why I used the word "apparently".

    I'm finding that OpenCL kernels (in a variety of applications, some I've coded) running in the background under Catalyst Omega have a severe impact on desktop responsiveness. Older drivers (multiple) had significantly less impact and were no worse in performance (and in some cases better).

    So it seems to me that AMD's recently moved in the wrong direction with regard to driver interaction with GPU command processing. I'm not making excuses, merely pointing out that there's plenty of room to manoeuvre.

    It would be interesting to compare CPU load in a D3D12 benchmark like this where the GPU framerate is the same on both AMD and NVidia (e.g. with a locked 20fps, say), i.e. the application's draw call rate is known and we can then observe whether one driver is doing more work on the CPU.
     
  18. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    Indeed. Draw calls that are very small is a bad idea because there are other bottlenecks right behind the command processor. Partial wavefronts on AMD hardware for one. Probably partial warps on Nvidia too.
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Internally, the front ends run microcode with custom RISC or RISC-like processors. This was stated as such in the VLIW days in a few AMD presentations, and it is also mentioned here http://www.beyond3d.com/content/reviews/52/7.
    Later resources like the Xbox SDK discuss a microcode engine, but they do mention a prefetch processor that feeds into it. How new that is, I'm not sure.
    There are vendors for inexpensive licensable and customizable RISC cores for this sort of thing. The other command front ends likely have something similar, possibly with feature support for the graphics-specific functionality stripped out.
    Various custom units probably have them as well, like the DMA engines. UVD and TrueAudio are the most obvious examples of licensed Tensilica cores, though I forget who is the go-to company for the sort of cores used in the command processors.
    It looks like the processor read in a queue command, pulls in the necessary program from microcode, runs it, then fetches the next command. Within a queue, this seems pretty serial. There is parallelism for compute, at least between queues but not within one.
    The ACEs are able to be freely used in parallel, and those are a subset of command processing. I guess the more complex graphics state hinders doing the same for graphics, although it seems AMD has been moving towards tidying it up if preemption is coming.

    Nvidia's solution was not so plainly described as such, but it has to do the same things, so there's some kind of simple processor hiding inside of the front end. AMD and Nvidia at this time do make non-RISC cores, but simple cores have sufficed until recently.

    I admit that was probably overestimating it for rhetorical effect.
    The 980's submission time is 3.3 ms, while the 290X is 4.8 and 9.0 for DX12 and Mantle, respectively.
    4.2ms of Oxide's work is somehow being fit into a fraction of that 3.3ms, and a large fraction of that is going to be devoted to the actual submission process rather than analyzing batch contents.
    Nvidia would be very good at doing what Oxide is doing, or its inherent overhead is low enough to give that much leeway for the analysis, or some combination of the two.

    I'm pretty sure there is. Whether it can be readily exposed to games, I'm not sure. It would have benefits in a system that is split into two guest VMs.

    It is unfortunate that this is still such an acute problem, given how important freely mixing different workloads is to AMD's vision of the future.

    If the AI could be ramped, it could also be approached from the direction of loading the CPU until the frame rate suffers. It doesn't seem like anything happens with all the core cycles that get freed up.
     
    liquidboy likes this.
  20. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    629
    Likes Received:
    1,131
    Location:
    PCIe x16_1
    It's up to the application. It will do whatever its programmers told it to do if it runs out of memory (keep in mind that the application knows how much memory is available beforehand, so it's never a surprise). I have to imagine if you're absolutely out of memory after 2+4GB, it's probably not in your best interest to keep going.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...