AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Discussion in 'Architecture and Products' started by ToTTenTranz, Sep 20, 2016.

  1. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Keep in mind "automatic" isn't much more than the driver optimizations that likely already occur behind the scenes. Nvidia's tiled raster for example may very well be the equivalent of a primitive shader. The only difference being exposure to devs.
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,098
    Likes Received:
    2,814
    Location:
    Well within 3d
    I recall Fermi being criticized in old articles for some kind of "software" tessellation, which perhaps may have involved some additional optimizations in the purportedly fixed-function hand-off. Perhaps some element of that is similar.

    The automatic generation in this case is creating a set of derivative shaders based on developers' code and inserting it into a phase of vertex processing the developers did not reference, which seems a bit more involved. AMD's point about primitive shaders are that if you get to the point of involving the rasterizer, it's already too late.
     
  3. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,702
    Likes Received:
    2,430
    I think those were some rumors started by none other than Charlie, claiming Fermi will have bad Tessellation performance due to it using a software solution, later on that proved to be wrong, as Fermi came with so much hardware Tessellators it was leaps and bounds beyond AMD's GPUs for several generations.
     
    #4583 DavidGraham, Nov 8, 2017
    Last edited: Nov 8, 2017
    A1xLLcqAgt0qc2RyMz0y and Picao84 like this.
  4. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Exactly. Charlie didn’t understand that DX11 tessellation had a lot of shader involvement. Which isn’t a too surprising comment to make from somebody who also claimed that Nvidia’s CPU and GPU would use the same execution pipeline.
     
    A1xLLcqAgt0qc2RyMz0y and Picao84 like this.
  5. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Not according to him, according to him its the whole pipeline.

    From the amdvegahardwarereviews thread:
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,098
    Likes Received:
    2,814
    Location:
    Well within 3d
    I do not interpret statements from AMD's designers, Vega ISA doc, AMD diagrams, presentations, patents, and other disclosures to be consistent with that.
     
    3dcgi, pharma and Jawed like this.
  7. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Neither do I, thats why I was arguing with him in that thread... but he's still sticking to his 'opinion'.
     
  8. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Doesn't matter if it's my opinion if it's correct. In a system designed to take advantage of feedback it stands to reason they would use everything available. Besides, they even included instructions to accelerate what I suggested in the ISA. That and the part of the pipeline discarding primitives would seem an ideal place to handle all culling of said primitives. Where 300 instructions to discard one is a performance win. Binning isn't too unlike rasterization with big pixels. Then throw in some Z culling which has been around forever. Testing against the current scene and the rest of a bin. I'd be shocked if they weren't doing something like what I described in the face of triangle and bandwidth constraints.
     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,098
    Likes Received:
    2,814
    Location:
    Well within 3d
    Is the 300 instructions to cull one triangle figure a rhetorical flourish?

    There are many implications and questions to that and everything else claimed, but before wading far into the weeds I will say my gut instinct is to take a lost cycle in the geometry engine and have faith that maybe the one of the next 299 triangles will launch at least one wavefront that affects final rendering.
     
    pharma likes this.
  10. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    I guess that figure was pulled out of this. :)
     
    Anarchist4000 and AlBran like this.
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,098
    Likes Received:
    2,814
    Location:
    Well within 3d
    The specific reference I saw for 300 was the ns cost for a memory reference necessary for an indirect draw being handled by the command processor, which the system has limited ability to hide if there are calls with 0 triangles.

    This does show where there are points in the architecture where there's massively less leeway to handle costs concurrently. One CP versus 2-4 DSBRs is still dwarfed by the parallel capacity of the back end. Given the Vega white paper's placement of primitive shaders, it would require time travel for an intra-draw element like a primitive shader to retroactively decide the command processor shouldn't have launched the vertex process that contains it.

    The architectural descriptions and patents also usually describe the tiled deferred rasterization step as being post primitive assembly, which if mapped to the "everything is a primitive shader" claim would create a shader spanning VGT, the FIFO, PA, and right up to the start of the SPI instantiating wavefronts.

    The balancing act in managing the VGT to PA path is where I'm curious if there's an impact for Vega.
    A primitive shader's code is inside the VGT path, and it is culling code that exists in addition to (or redundantly with) code or hardware that implements the VGT and standard culling portion of the process. AMD indicated that the standard pipeline is still there, which may explain some of the complexity in activating or exposing primitive shaders more generally, if it does involve balancing overproduction or starvation across elements like a limited number of inter-stage FIFOs.

    I also note that the culling shaders do not try to check for all coverage scenarios for MSAA, which makes me wonder if a primitive shader would either.
     
    AlBran, Grall, Digidi and 1 other person like this.
  12. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Do you have a link to that? I'm interested GPU-driven rendering and was wondering the costs of zero triangle draw calls for indirect draws. Given that presentations found it important enough to compact the draw indirect buffer, the info would be interesting.
     
  13. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Don't do zero triangle draws. They still cost GPU cycles (I only have console numbers, so I can't give them). ExecuteIndirect supports indirect draw count (filled by GPU). OpenGL 4.3+ supports indirect draw count too (https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_indirect_parameters.txt).

    Compacting data with local+global atomic counter is efficient. There's however no order guarantees (depth sorting doesn't remain). DX12 SM6.0 has GlobalOrderedCountIncrement. Haven't tried this, but the GCN equivalent instruction (DS_ORDERED_COUNT) works well for this case on consoles. DICE had some info about using it in their culling implementation.
     
    tinokun, AlBran and BRiT like this.
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,098
    Likes Received:
    2,814
    Location:
    Well within 3d
    There is an embedded link in the post I replied to for a GDC16 slide deck for a presentation concerning GPU-driven culling with compute shaders for the Frostbite engine.
    http://www.wihlidal.ca/Presentations/GDC_2016_Compute.pdf
     
  15. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    @sebbbi Thanks for the input, BTW do you still work on GPU driven pipelines? or does Claybook take up all your time? If so have you experimented with DX12 yet? Do you find it a good fit for GPU driven pipelines? Also any advice for breaking a model down into clusters? Its a hard problem to try to do optimally. How did you handle bounding volumes for clusters in skinned meshes? I haven't come up with a good solution for that.

    edit - Sorry for being off topic but I figured I had his attention so why not.

    Thanks... I must have missed it as I already have that presentation, or I perused the powerpoint version of it.

    edit - the powerpoint version doesn't have the notes below the slide.
     
    #4595 Infinisearch, Nov 10, 2017
    Last edited: Nov 10, 2017
  16. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    It wouldn't retroactively decide, but poll input from prior geometry as one possible test. Move Z culling into the primitive testing at a per bin resolution. There would also be some mechanism to analyze or reduce the bins. For example if the last triangle occluded some portion of the triangles. Keep running simple passes until the bin was full of valid geometry or even spawn a new partition within.

    That's roughly what I'm suggesting. All of that running on a CU until satisfied with whatever bins we're created. A single draw call possibly instantiating only one wavefront for establishing bins.
     
  17. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    492
    Likes Received:
    212
  18. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    216
    Likes Received:
    95
  19. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Well primitive shaders replace the vertex shader and that patent mentions feeding the vertex shader so...
     
    Digidi likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...