AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Discussion in 'Architecture and Products' started by Deleted member 13524, Sep 20, 2016.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Geometry engines are the fixed-function portion of a shader engine, and Vega 10's CU count has been apparently pinned at 64. The design norm would be to have 4 shader engines and 16 CUs each. That doesn't rule out AMD changing something, but if the pattern holds the next increment is a big jump to 8 shader engines, with a lower CU to fixed-function ratio than Polaris.
    Rasterizer and ROP throughput would presumably jump by a similar magnitude.
    The number of RBE clients now plugged into AMD's L2 is another item of concern, since there would have been less to worry about with that many ROPs back when they were incoherent and AMD didn't dare.

    As notable as that would be, it would seemingly be contrary to the efficiency goals implied by the binning logic and probably belied by the emphasis on a clock increase. It would also be rather ironic if Vega is supposed to bring in features heralding the replacement of fixed-function primitive handling by having the highest ratio of dedicated hardware to programmable throughput in generations.

    I'm open to being pleasantly surprised, but the conservative interpretation of the footnote is that Fury X has 4 geometry engines and a peak of 4 polygons, and Vega (non-specific, possibly speaking for whole Vega family) can go up to ("up to" is another way of saying "peak") 11 polygons per clock with 4 geometry engines. A >5x increase in any SKU of the Vega line would be something AMD would be sorely tempted to put into marketing.


    I look forward to more information on the L2 and memory controller/interconnect. One interpretation of all of this is that the high bandwidth cache controller is where the Infinity Fabric is, which leaves the L2 less disrupted by not letting the fabric's throughput change the L2's traditionally higher internal bandwidth, particularly with geometry, compute, and pixel data paths hitting the L2. It doesn't seem to make too much sense for a consumer graphics discrete, but perhaps this is a hallmark of Vega's non-consumer ambitions.
    Another question I have is the L2's slice structure and capacity. Fiji had 32 channels of HBM, and had memory synthetics that showed there was a general equivalence to Hawaii until access patterns found a way to exceed the on-die capabilities of the whole hierarchy, rather than finding any scaling of L2 capability with the higher channel count. That's almost as if the L2 cache was not distributed fully for 4 stacks of HBM. Vega's keeping to 2 stacks of HBM2 allows a Hawaii-type pairing of slices to channels, which might make for better utilization of memory bandwidth.
     
  2. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Or 6 for Vega 10 (like GP100/102) and 3 for Vega11? :runaway:
    6 for Vega 10 don't fit nicely with 64 CUs though, as long as AMD keeps the shader engines symmetrical (nV does not do it, at least for salvage versions). Or the implicit 1:1 mapping of geometry engines to shader engines is removed and AMD can now arbitrarily redistribute stuff between CUs.
     
    chris1515 likes this.
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The slide for the Intelligent Workgroup Distributor seems to give a linear relationship between Geometry Engine to Compute Engine to Pixel Engine, reminiscent of the Shader Engine arrangement.
    It's a sparse slide, however, and how a distributor can load balance effectively in the face of static assignment is unclear. The data flow also puts the load-balancing portion ahead of everything that might be oversubscribed, so it feels like there needs to be feedback pathways to know when particular shader engines are getting slammed.

    However, even should geometry engines no longer be mapped to a shader engine, getting the exports back out of an arbitrary CU would imply a similar break in the 1:1 RBE mapping, or some other design change. However, both geometry and ROP mapping have a traditional 1:1 link to the rasterizer's assigned screen space, which to be extra complicated would be on the far side of the primitive setup path. That either puts the binning rasterizer (which has a reliance on some kind of screen-space assignment) in the geometry engine, or on the wrong side relative to the workgroup distributor.

    That doesn't rule out some kind of re-mapping or re-routing of outputs from the respective stages, but that does sound complicated. Messing with ROP assignment would also inject the possibility of ping-ponging between RBEs--so I suppose it's a good thing those are L2 clients in that scenario. However, that leaves the question of the nature of the ROP caches and the positioning of the various forms of compression relative to the ROP caches, L2, memory controller, and CUs. Playing with the ROP tiling behavior and compression (metadata has its own memory/caching concerns) and fitting it into the L2 sounds like a fun place for complexity.
     
  4. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    I was suggesting it as a more remote possibility. Didn't make it very clear though.

    Regarding the work distribution you probably refer to this slide:
    [​IMG]

    This is a pretty high level pictogram, which likely doesn't show how it really works.
    It would be probably better to put the engines next to each other and loop back to the distributor instead of letting the geometry engine feed the compute engine which again feeds into the pixel engine. This wouldn't be a very sensible arrangement. For instance, a single compute engine can feed the complete shader array, i.e. all shader engines. Would be kind of stupid to restrict a compute shader to a fourth of the GPU, isn't it? So there is definitely a arbiting/distribution mechanism between these engines and the shader array, too.

    But I fully agree that it seems to be a sensible arrangement to tie subsets of RBEs to each rasterizer. And it will be fun to see, how exactly AMD is doing the binning and how the binning tiles align to the screen space tiles for the rasterizers and RBEs.
     
    #844 Gipsel, Jan 17, 2017
    Last edited: Jan 17, 2017
  5. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    417
    Likes Received:
    381
    I think they meant shader cores for compute engines for this particular slide. Tying RBEs with screen-space-tiled rasterisers seem to be the case for quite a while too. Not quite sure if how primitives are distributed in GCN though.
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The distributor is being brought up in the context of geometry processing, which currently does imply some level of static assignment. Compute shaders wouldn't have the same limits, but will hit the distributor or contend with it for CU wavefront dispatch. That horizontal arrangement in the pictograph would be close to the current high-level concept of processing within a shader engine, hence why breaking a 1:1 mapping for just one portion can complicate matters if the others are left unchanged.
    A scenario where load balancing can start to matter is if the output of one or more geometry engines starts hitting the same tiled rasterizer/ROP assignment, which would in the current shader engine arrangement leave the possibility of stalling on export resources or the fraction of CUs linked to those resources in one shader engine.
    Actual load-balancing would mean making it so that more CUs, rasterizers, and RBEs can participate, but that leaves a pixel-sync kind of scenario where you don't want to find the same tile accidentally dispatched if there's work already in-flight for another geometry engine, CU, bin, rasterizer, RBE, or any caches or buffers associated with any of them.
    Feedback from each stage would seem to be necessary for the load balancer, and since this is "intelligent" it seems like some heuristics may be needed to avoid some of the turnaround or sync penalties for the more distant ends of the process.

    A path might still exist in the case of a fall-back to pure IMR mode, or the distributor might have a conservative path emulated.

    One question is where the binning portion is versus the final higher-precision rasterization. This may explain the odd triangle throughput for geometry engines, if there's a limit to the size of a packet of processed primitives that they can spit out, or the ability of the balancer to process and assign the packet/bin. The little rasterizer block that determines coverage could work offset from the bin, and possibly in the arbitrary routing scenario could be determining coverage for an arbitrary rectangle of screen space. On the other hand, the binning process might actually be improved if the most recent depth information can be pulled from the depth cache or a hierarchical buffer--which would have been easier and faster to query when the assignment was more static. Perhaps that might fit into the distributor's job description as well.
     
  7. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    Question ansered, you get 2 times and more the polygon through put when using primitive shaders by the use of the shader array.

    Enough? 2:34



    So pretty much same capabilities as Polaris (geometry is concerned with current or older games) most likely still has 4 geometry units......with the addition of the primitive shaders you get more.

    I can see this tech come in handy quicker in consoles with Vega architecture (next gen xbox rumored) but for PC's probably won't see this for a year or two after Vega is released to developers.....

    This also goes for Vega's tile renderer which it too needs to go through the primitive shader.
     
    #847 Razor1, Jan 18, 2017
    Last edited: Jan 18, 2017
    CSI PC, Malo and pharma like this.
  8. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    No.
    It is a (conservative) estimate of the potential additional benefit of using a primitive shader compared to the traditional pipeline. He explicitly said so. It is very likely, that this is a comparison of Vega with primitive shader vs. Vega without using it. Later in the interview, he says that Vega offers a geometry throughput uplift compared to previous generations also without using a primitive shader (while refusing to quantify it or provide specifics).
    What?
    If you mean the tiling (draw stream binning) rasterizer, then clearly no. Or have you ever heard nV's rasterizer requires additional special shader stages to work?

    Edit:
    The better part of the video for your point would be actually starting at the 36 minute mark, when he gets confronted directly with the footnote mentioning the 11 triangles per clock.
    He admitted to have been unaware of that footnote, struggled a bit by saying he thinks it's not applying to a specific product and an example what Vega could do in a configuration with 4 geometry engines (reinforcing a bit the point CarstenS was making [that AMD didn't exactly say, that Vega10 has 4 geometry engines, could be a different number], if it was not just hedging from his side). He then came back to one of the "talking points" of the Vega reveal, the primitive shaders (giving some credence to the idea, that this number is really pertaining to that), but basically saying "it's difficult" to explain how one arrives at the number of 11 triangles per clock (allegedly realistic and possibly taking into account multiple constraints like memory bandwidth [I mentioned that before]).
    So maybe he was not fully briefed about what exactly is on the slides and was not willing to reveal any specifics. Or someone at AMD pulled some shaky number out of the air with the help of some halfbaked rules of thumb and put them on that slide. As explained, that number and also putting it on that slide doesn't make much sense in that case, as it would be somewhat fundamentally flawed.
     
    #848 Gipsel, Jan 18, 2017
    Last edited: Jan 18, 2017
    Lightman likes this.
  9. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    Check around 36 minutes as well where they ask Scott about the 11 polygons, Scott repeats it again, primitive shaders are what is giving the performance increase in polygon through put due to the culling.

    yes the draw stream binning NEEDS primitive shaders to work in its current iteration in Vega.

    nV's hardware has nothing to do with this, AMD also has stated this need for draw stream binning to be utilized. Also mentioned in this video too. Really wish it was an article, much quicker to read then sit around and listen lol.
     
    DavidGraham and pharma like this.
  10. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Where exactly did you get this from? Scott Wasson said the new pixel pipeline including the tiling rasterizer works without the need to change the code.
    NV's hardware also has a binning rasterizer. So why should AMD need a primitive shader to get it to work? That has not much to do with each other. I listened to the complete video. If I have missed such a comment, please tell me where it is.
     
    #850 Gipsel, Jan 18, 2017
    Last edited: Jan 18, 2017
    pTmdfx and no-X like this.
  11. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    Damn videos lol, tellin ya they should go back to webpages lol

    6:40

    They are talking about primitive shader and the draw stream binning, where it needs to be exposed in the API or a library.
     
    CSI PC likes this.
  12. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    The primitive shader needs to be exposed, right. That is in a passage where he hedges against promising too much performance uplift through the new features. After discussing the benefits of the new rasterizers and that in some games (optimzed for very low overdraw) it could be limited, he comes to the primitive shaders and that they need to be exposed, i.e. don't provide an "automatic" improvement.
     
    Razor1 and Lightman like this.
  13. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    If it was auto magical as nV's has been, I would expect to see quite a bit of improvement in some current games. His statement leads me to believe, there needs to be programmer intervention to see a decent amount of performance improvements from it. And possible implication on power consumption too which they didn't get into, but if it was there I'm sure it would be something they would have hinted on right off the bat since its a problem area for current AMD chips in comparison to their direct counterparts.
     
  14. seahawk

    Regular

    Joined:
    May 18, 2004
    Messages:
    511
    Likes Received:
    141
    What I think is that they need software work to improve the geometry performance, either by the application using primitive shaders, or by the driver wrapping the code to achieve a similar effect.
     
    Razor1 and DavidGraham like this.
  15. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Off-topic, but I find it quite sad how websites/youtubers are stealing other people's IP (in this case fotos, not mine, but still) without even crediting them. Sorry for the µrant.

    Couple of seconds before that (around 6:10) the interviewed guy (who in this part of the video does explicitly does NOT sound like Scott...) explicitly says, the effectiveness of the draw stream binning will depend on how much is done in software already - with heavily optimized software culling already in place, the effect of the hardware being less pronounced. AFAIGI as a non-native speaker.
     
    Lightman and Razor1 like this.
  16. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    As said Carsten, i think they just expose that performance could be relative and depend on case use, not that you need this case use forcibly. ( at least it is how i understand it. ) If the developpers have allready done the job on the engine, the "hardware" optimization will not double the performance over it.

    But that's the case for everything.
     
    Lightman likes this.
  17. Malo

    Malo Yak Mechanicum
    Legend Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    8,931
    Likes Received:
    5,533
    Location:
    Pennsylvania
    What really happened at AMD:

    Marketing meets with engineer.
    Marketing: "So I hear something called geometry units are more powerful with how many triangles they can draw?"
    Engineer: "Yeah we really dialed them up to 11!"
    Marketing: "11. Got it, thanks!"
    Engineer: "Wait, no that's not what I meant!"
     
    #857 Malo, Jan 18, 2017
    Last edited: Jan 18, 2017
  18. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    I have to say, after listening to the video, they heavily state and hint at their solution being software based more than hardware based. So the automatic performance uplift seems to be not as pronounced or transparent as any traditional -moslty hardware based- solution. It needs special attention to be attainable.

    AMD marketing peoole might not be knowledgable or porperly oriented, but this is what they are strongly implying. Cautiously promising uplifts if all conditions are met (software/developer awareness and adoption).
     
    pharma and Razor1 like this.
  19. revan

    Newcomer

    Joined:
    Nov 9, 2007
    Messages:
    55
    Likes Received:
    18
    Location:
    look in the sunrise ..will find me
    "AMD’s new Vega GPU architecture changes things on two fronts, on die and off. Of the two the off-die seems the most fundamental change to SemiAccurate but both bring ground up new tech to GPUs.

    During CES AMD unveiled a bit more about Vega including some high level architecture details. It isn’t the full technical deep dive but there is a lot of information to be had. What’s more interesting is when you start asking how it ties into the other technologies they have introduced lately, SSG being a key one. The bits that make a gaming GPU into an AI device like Instinct also benefit from these changes too.

    "Vega at a high level

    Lets start out with the obvious changes, the three on the left. If you are familiar with GCN architecture devices like Hawaii, you probably realize they are getting a bit long in tooth. The architecture isn’t bad but the process nodes it was originally meant for have long past and the optimization points for 16/14/10/7nm call for fundamentally different methods. Those changes require both shader level and device level architecture changes and that starts with the engines and pipelines.

    First on the list of big changes is a really big bang, think DX9 or geometry shader addition. It is called the Primitive shader and it is lumped under the heading of New Programmable Geometry Pipeline. The old way of doing things was to have separate pixel, vertex, and geometry shaders fed by a compute engine (ACE) or geometry command processor (GCP). These fed the Geometry Processor and then the various pipelines, Vertex Shader(VS) then Geometry Shader(GS). "

    and so on...

    MORE HERE : http://semiaccurate.com/2017/01/17/amd-talks-vega-high-level/
    A very serious article , despite site's repution!

    Going after pop-corn... :)
     
  20. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    you can leverage (quote) or (code) blocks to separate which text belongs to the article and which text is yours
     
    Malo likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...