nVidia's Island DirectX 11 Demo runs slowly on AMD GPUs

Discussion in 'Architecture and Products' started by A1xLLcqAgt0qc2RyMz0y, Mar 30, 2010.

  1. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    985
    Likes Received:
    277
    http://www.brightsideofnews.com/news/2010/3/30/nvidias-island-directx-11-demo-works-on-amd-gpus.aspx

    Theo's analysts seems to be wrong.

    If you program to DX11 and DirectCompute isn't that vender neutral.
    Better hardware would render faster. A 5870 should be faster than a 5830.

    So with Fermi faster rendering shouldn't it be because Fermi has better hardware to do DirectCompute and Tessellation?
     
  2. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    The API being vendor neutral doesn't mean every app treats all HW equally. You can tune your algorithm to run better on one platform or another, just like with CPU code.
     
  3. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    True, but it could be an issue of developer productivity. It could be that Fermi's caches and architecture permit simpler straightforward code, or 'sloppier' less-frugal code. I'm more productive if I can take a traditional algorithm, and run it on an out-of-order CPU with traditional cache, vs an in-order CPU with manually managed local store for example.

    And sometimes, it might even be an issue where "tuning" doesn't work, and you need a separate path. Pre-G8x Nvidia GPUs had terrible dynamic branching behavior, so no amount of tweaking inputs or registers would solve the problem.

    Of course, it could also be a matter of not leveraging what AMD is good at, and coding for their memory architecture and atomics. That's not really tuning tho, as it forces the developer to maintain two separate paths, and there's still the possibility that even with custom paths, Fermi is just better suited to physics workloads.

    I'll note that I don't like the bifurcation that's happening because of the different memory/cache architectures of Fermi vs Cypress, but I can't criticize NVidia's decisions to amp up the caches, since it appear well within their rights to implement the DX Compute/OpenCL spec using an architecture like this. The divergence is regrettable as it will add pain for developers.
     
  4. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Or it could be something as simple as vectorizing your code when possible. You wouldn't use x87 when SSE was appropriate on a CPU.
    The OP was referring to DX11 and DirectCompute. Plenty of ways to take these vendor-neutral APIs and create workloads that favor one architecture over another, particularly with DirectCompute.
     
  5. Sontin

    Banned

    Joined:
    Dec 9, 2009
    Messages:
    399
    Likes Received:
    0
    Maybe GF100 has the better Tessellation implementation - and the water demo is nVidia's Tessellation showcase.
     
  6. Mat3

    Newcomer

    Joined:
    Nov 15, 2005
    Messages:
    163
    Likes Received:
    8
    Isn't the whole point of tessellation to reduce bandwidth by creating the extra details on chip? What's being spilled out to memory? Not the extra triangles...
     
  7. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Depends on the situation, in the case of something like the water simulation, I'd be inclined to agree with you, but as you know, many compilers and VMs offer auto-vectorization features, and C programmers don't automatically write code to leverage x87 vs SSE in every situation, I see lots of math related code written in a scalar fashion. My point is, if Fermi has better performance on non-vectorized code, and if the non-vectorized code performs on par as the hand vectorized code, then Fermi presents a net win for the developer since he can obtain equal performance for less development effort.

    I'm in 100% agreement, and I don't see the situation changing anytime soon with respect to the current tools and APIs devs have to work with, and I don't blame either IHV for this, it's just a hard problem to create an abstract platform that can fully leverage divergent platforms. However, I would like to point is that there is a 'ease of development' story here, just as there is with single-thread OoO x86/cache vs in-order PP/LDS (e.g. Cell). There obviously things that HW can do to make devs and tool vendors jobs easier. Two platforms with equal sustained performance can still impose quite different costs on devs.
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    How do the Ladybug and Mecha D3D11 demos run on GTX480?

    Regardless, I think that's the nicest GPU water I've ever seen :cool:

    Jawed
     
  9. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Actually, pretty good (we were planning to do diagrams on them too, but they somehow got lost). But then, they do not over emphasize stuff where GF100 is way behind Cypress but are normal technology showcases for general techniques with DX11. You could argue of course, that tessellation also is a general DX11 technique.
     
  10. cho

    cho
    Regular

    Joined:
    Feb 9, 2002
    Messages:
    416
    Likes Received:
    2
  11. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yes this definitely can't be emphasized enough, as I noted in the Fermi thread. Performance portability is going to be less and less common moving forward. Already an increasingly large number of constants need to be tweaked for various architectures and Fermi and Cyprus have some fundamental architectural differences that seem to affect even which algorithms you should use. This is not unlike the CPU space, but it's a bit more extreme: performance cliffs are orders of magnitude rather than single-digit %ages.

    This is neither good nor bad... it's just something people need to be aware of now that we're writing fairly low level code compared to the traditional graphics pipeline. Thus it's going to be harder to summarize things about which architecture is "better" in broad categories... the answer is almost always going to be "it depends" now.
     
  12. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    My guess is that this particular test has nothing to do with L1 and L2 or even the DirectCompute part of the workload, as wave physics are very straightforward and don't need complex data structures using weird access patterns.

    NVidia just cranked up the tessellation to obscene levels. You can see in the screenshot that there are 12M triangles created. Fermi can do 2-3 tessellated triangles per clock, Cypress can do 1/3rd. That's a factor of 5 at least, and sometimes more.
     
  13. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Definitely possible. You can see in the NVIDIA videos that while the quality changes very little after the first few steps the application defaults to a fairly high tessellation level. I'm debating whether the frame rate in the videos on youtube are correct, as it appears pegged at 25fps on the 480 regardless of the tessellation level, which seems odd assuming a fairly simple wave physics step.

    Has anyone played with how this scales with tessellation levels across ATI and NVIDIA? I'd love to see a graph.
     
  14. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    Since you put it that nicely, I'll go and make one tomorrow morning:)
     
  15. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yay thanks! I'll have that to look forward to tomorrow :)
     
  16. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    I may have spoken a bit too soon, it would appear the demo isn't on the public domain yet? In which case we'll have to postpone any investigation until it becomes public domain, since I don't have access to nV's top secret reviewer sauces:???:
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
  18. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Sweet, I love their reviews. Given that the GTX480's advantage is under 5% at LOD1, that confirms by beliefs. ATI's large perf deficit has nothing to do with the compute shader and everything to do with tessellation.

    If we assume that ixbt's setting of LOD=50 gives about the same triangle count as bson's settings (since both give 9.4 fps on Cypress), then we see that those 12M triangles take 84.5ms extra over the LOD=1 case, which works out to exactly 6 clocks per triangle. Considering Damien's figure of 3 clock per tesselated tri and his information about degrading performance when multiple input vertices are used on non-Fermi architectures, this makes sense.

    I still don't understand why ATI designed it to be so slow. The shader processors are very fast and have extremely high bandwidth to the L1 and even L2. Are you sure that Cypress doesn't use a seperate vertex cache anymore? Read port limitations on the cache holding the control points (e.g. 2 vec4's per clock) is the only reason I can think of.
     
    #18 Mintmaster, Apr 1, 2010
    Last edited by a moderator: Apr 1, 2010
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    AMD claimed the vertex cache was omitted. I think they said it now goes through the texture caches.

    Could it be serialization over that crossbar between the two sides?
    Cycle 0: tesselator output
    Cycle 1: send to SIMD bank 0
    Cycle 2: send to SIMD bank 1

    I'm probably wrong on how I imagine the data path being followed.

    edit: Might be easier to just broadcast to both sides.
     
  20. chavvdarrr

    Veteran

    Joined:
    Feb 25, 2003
    Messages:
    1,165
    Likes Received:
    34
    Location:
    Sofia, BG
    The text says that with LOD=100 there are 28mln triangles, while at LOD=25 there are 4mln tris
    480@100 is as fast as 5870@25
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...