AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

3dilettante · Jan 17, 2017

CarstenS said:
Sorry for double-posting, but it just occured to me, that AMD actually did NOT say, Vega was designed with four geometry engines. Only that in Vega, four geometry engines could handle 11 polygons. Vega's physical implementation could have more than those 4, if they're trying and understate a characteristic for once. A little far-fetched, I know, but still possible.

The text of the slide says:
New Programmable Geometry Pipeline
Over 2X peak throughput per clock

The respective footnote (full):
Geometry throughput slide: Data based on AMD Engineering design of Vega. Radeon R9 Fury X has 4 geometry engines and a peak of 4 polygons per clock. Vega is designed to handle up to 11 polygons per clock with 4 geometry engines. This represents an increase of 2.6x. VG-3

Geometry engines are the fixed-function portion of a shader engine, and Vega 10's CU count has been apparently pinned at 64. The design norm would be to have 4 shader engines and 16 CUs each. That doesn't rule out AMD changing something, but if the pattern holds the next increment is a big jump to 8 shader engines, with a lower CU to fixed-function ratio than Polaris.
Rasterizer and ROP throughput would presumably jump by a similar magnitude.
The number of RBE clients now plugged into AMD's L2 is another item of concern, since there would have been less to worry about with that many ROPs back when they were incoherent and AMD didn't dare.

As notable as that would be, it would seemingly be contrary to the efficiency goals implied by the binning logic and probably belied by the emphasis on a clock increase. It would also be rather ironic if Vega is supposed to bring in features heralding the replacement of fixed-function primitive handling by having the highest ratio of dedicated hardware to programmable throughput in generations.

I'm open to being pleasantly surprised, but the conservative interpretation of the footnote is that Fury X has 4 geometry engines and a peak of 4 polygons, and Vega (non-specific, possibly speaking for whole Vega family) can go up to ("up to" is another way of saying "peak") 11 polygons per clock with 4 geometry engines. A >5x increase in any SKU of the Vega line would be something AMD would be sorely tempted to put into marketing.

I look forward to more information on the L2 and memory controller/interconnect. One interpretation of all of this is that the high bandwidth cache controller is where the Infinity Fabric is, which leaves the L2 less disrupted by not letting the fabric's throughput change the L2's traditionally higher internal bandwidth, particularly with geometry, compute, and pixel data paths hitting the L2. It doesn't seem to make too much sense for a consumer graphics discrete, but perhaps this is a hallmark of Vega's non-consumer ambitions.
Another question I have is the L2's slice structure and capacity. Fiji had 32 channels of HBM, and had memory synthetics that showed there was a general equivalence to Hawaii until access patterns found a way to exceed the on-die capabilities of the whole hierarchy, rather than finding any scaling of L2 capability with the higher channel count. That's almost as if the L2 cache was not distributed fully for 4 stacks of HBM. Vega's keeping to 2 stacks of HBM2 allows a Hawaii-type pairing of slices to channels, which might make for better utilization of memory bandwidth.

Gipsel · Jan 17, 2017

seahawk said:
Could mean 4 for Vega 11 and maybe 8 for Vega 10. But could also mean 4 for Vega 10 and 2 for Vega 11.

Or 6 for Vega 10 (like GP100/102) and 3 for Vega11? :runaway:

6 for Vega 10 don't fit nicely with 64 CUs though, as long as AMD keeps the shader engines symmetrical (nV does not do it, at least for salvage versions). Or the implicit 1:1 mapping of geometry engines to shader engines is removed and AMD can now arbitrarily redistribute stuff between CUs.

3dilettante · Jan 17, 2017

Gipsel said:
Or the implicit 1:1 mapping of geometry engines to shader engines is removed and AMD can now arbitrarily redistribute stuff between CUs.

The slide for the Intelligent Workgroup Distributor seems to give a linear relationship between Geometry Engine to Compute Engine to Pixel Engine, reminiscent of the Shader Engine arrangement.
It's a sparse slide, however, and how a distributor can load balance effectively in the face of static assignment is unclear. The data flow also puts the load-balancing portion ahead of everything that might be oversubscribed, so it feels like there needs to be feedback pathways to know when particular shader engines are getting slammed.

However, even should geometry engines no longer be mapped to a shader engine, getting the exports back out of an arbitrary CU would imply a similar break in the 1:1 RBE mapping, or some other design change. However, both geometry and ROP mapping have a traditional 1:1 link to the rasterizer's assigned screen space, which to be extra complicated would be on the far side of the primitive setup path. That either puts the binning rasterizer (which has a reliance on some kind of screen-space assignment) in the geometry engine, or on the wrong side relative to the workgroup distributor.

That doesn't rule out some kind of re-mapping or re-routing of outputs from the respective stages, but that does sound complicated. Messing with ROP assignment would also inject the possibility of ping-ponging between RBEs--so I suppose it's a good thing those are L2 clients in that scenario. However, that leaves the question of the nature of the ROP caches and the positioning of the various forms of compression relative to the ROP caches, L2, memory controller, and CUs. Playing with the ROP tiling behavior and compression (metadata has its own memory/caching concerns) and fitting it into the L2 sounds like a fun place for complexity.

Gipsel · Jan 17, 2017

3dilettante said:
The slide for the Intelligent Workgroup Distributor seems to give a linear relationship between Geometry Engine to Compute Engine to Pixel Engine, reminiscent of the Shader Engine arrangement.
It's a sparse slide, however, and how a distributor can load balance effectively in the face of static assignment is unclear. The data flow also puts the load-balancing portion ahead of everything that might be oversubscribed, so it feels like there needs to be feedback pathways to know when particular shader engines are getting slammed.

However, even should geometry engines no longer be mapped to a shader engine, getting the exports back out of an arbitrary CU would imply a similar break in the 1:1 RBE mapping, or some other design change. However, both geometry and ROP mapping have a traditional 1:1 link to the rasterizer's assigned screen space, which to be extra complicated would be on the far side of the primitive setup path. That either puts the binning rasterizer (which has a reliance on some kind of screen-space assignment) in the geometry engine, or on the wrong side relative to the workgroup distributor.

That doesn't rule out some kind of re-mapping or re-routing of outputs from the respective stages, but that does sound complicated. Messing with ROP assignment would also inject the possibility of ping-ponging between RBEs--so I suppose it's a good thing those are L2 clients in that scenario. However, that leaves the question of the nature of the ROP caches and the positioning of the various forms of compression relative to the ROP caches, L2, memory controller, and CUs. Playing with the ROP tiling behavior and compression (metadata has its own memory/caching concerns) and fitting it into the L2 sounds like a fun place for complexity.

I was suggesting it as a more remote possibility. Didn't make it very clear though.

Regarding the work distribution you probably refer to this slide:

This is a pretty high level pictogram, which likely doesn't show how it really works.
It would be probably better to put the engines next to each other and loop back to the distributor instead of letting the geometry engine feed the compute engine which again feeds into the pixel engine. This wouldn't be a very sensible arrangement. For instance, a single compute engine can feed the complete shader array, i.e. all shader engines. Would be kind of stupid to restrict a compute shader to a fourth of the GPU, isn't it? So there is definitely a arbiting/distribution mechanism between these engines and the shader array, too.

But I fully agree that it seems to be a sensible arrangement to tie subsets of RBEs to each rasterizer. And it will be fun to see, how exactly AMD is doing the binning and how the binning tiles align to the screen space tiles for the rasterizers and RBEs.

pTmdfx · Jan 17, 2017

Gipsel said:
I was suggesting it as a more remote possibility. Didn't make it very clear though.

Regarding the work distribution you probably refer to this slide:

This is a pretty high level pictogram, which likely doesn't show how it really works.
It would be probably better to put the engines next to each other and loop back to the distributor instead of letting the geometry engine feed the compute engine which again feeds into the pixel engine. This wouldn't be a very sensible arrangement. For instance, a single compute engine can feed the complete shader array, i.e. all shader engines. Would be kind of stupid to restrict a compute shader to a fourth of the GPU, isn't it? So there is definitely a arbiting/distribution mechanism between these engines and the shader array, too.

But I fully agree that it seems to be a sensible arrangement to tie subsets of RBEs to each rasterizer. And it will be fun to see, how exactly AMD is doing the binning and how the binning tiles align to the screen space tiles for the rasterizers and RBEs.

I think they meant shader cores for compute engines for this particular slide. Tying RBEs with screen-space-tiled rasterisers seem to be the case for quite a while too. Not quite sure if how primitives are distributed in GCN though.

3dilettante · Jan 17, 2017

Gipsel said:
I was suggesting it as a more remote possibility. Didn't make it very clear though.

Regarding the work distribution you probably refer to this slide:

This is a pretty high level pictogram, which likely doesn't show how it really works.
It would be probably better to put the engines next to each other and loop back to the distributor instead of letting the geometry engine feed the compute engine which again feeds into the pixel engine. This wouldn't be a very sensible arrangement. For instance, a single compute engine can feed the complete shader array, i.e. all shader engines. Would be kind of stupid to restrict a compute shader to a fourth of the GPU, isn't it? So there is definitely a arbiting/distribution mechanism between these engines and the shader array, too.

The distributor is being brought up in the context of geometry processing, which currently does imply some level of static assignment. Compute shaders wouldn't have the same limits, but will hit the distributor or contend with it for CU wavefront dispatch. That horizontal arrangement in the pictograph would be close to the current high-level concept of processing within a shader engine, hence why breaking a 1:1 mapping for just one portion can complicate matters if the others are left unchanged.
A scenario where load balancing can start to matter is if the output of one or more geometry engines starts hitting the same tiled rasterizer/ROP assignment, which would in the current shader engine arrangement leave the possibility of stalling on export resources or the fraction of CUs linked to those resources in one shader engine.
Actual load-balancing would mean making it so that more CUs, rasterizers, and RBEs can participate, but that leaves a pixel-sync kind of scenario where you don't want to find the same tile accidentally dispatched if there's work already in-flight for another geometry engine, CU, bin, rasterizer, RBE, or any caches or buffers associated with any of them.
Feedback from each stage would seem to be necessary for the load balancer, and since this is "intelligent" it seems like some heuristics may be needed to avoid some of the turnaround or sync penalties for the more distant ends of the process.

A path might still exist in the case of a fall-back to pure IMR mode, or the distributor might have a conservative path emulated.

But I fully agree that it seems to be a sensible arrangement to tie subsets of RBEs to each rasterizer. And it will be fun to see, how exactly AMD is doing the binning and how the binning tiles align to the screen space tiles for the rasterizers and RBEs.

One question is where the binning portion is versus the final higher-precision rasterization. This may explain the odd triangle throughput for geometry engines, if there's a limit to the size of a packet of processed primitives that they can spit out, or the ability of the balancer to process and assign the packet/bin. The little rasterizer block that determines coverage could work offset from the bin, and possibly in the arbitrary routing scenario could be determining coverage for an arbitrary rectangle of screen space. On the other hand, the binning process might actually be improved if the most recent depth information can be pulled from the depth cache or a hierarchical buffer--which would have been easier and faster to query when the assignment was more static. Perhaps that might fit into the distributor's job description as well.

Razor1 · Jan 18, 2017

Question ansered, you get 2 times and more the polygon through put when using primitive shaders by the use of the shader array.

Enough? 2:34

So pretty much same capabilities as Polaris (geometry is concerned with current or older games) most likely still has 4 geometry units......with the addition of the primitive shaders you get more.

I can see this tech come in handy quicker in consoles with Vega architecture (next gen xbox rumored) but for PC's probably won't see this for a year or two after Vega is released to developers.....

This also goes for Vega's tile renderer which it too needs to go through the primitive shader.

Gipsel · Jan 18, 2017

Razor1 said:
Question ansered, you get 2 times and more the polygon through put when using primitive shaders by the use of the shader array.

Enough? 2:34

No.
It is a (conservative) estimate of the potential additional benefit of using a primitive shader compared to the traditional pipeline. He explicitly said so. It is very likely, that this is a comparison of Vega with primitive shader vs. Vega without using it. Later in the interview, he says that Vega offers a geometry throughput uplift compared to previous generations also without using a primitive shader (while refusing to quantify it or provide specifics).

Razor1 said:
This also goes for Vega's tile renderer which it too needs to go through the primitive shader.

What?
If you mean the tiling (draw stream binning) rasterizer, then clearly no. Or have you ever heard nV's rasterizer requires additional special shader stages to work?

Edit:
The better part of the video for your point would be actually starting at the 36 minute mark, when he gets confronted directly with the footnote mentioning the 11 triangles per clock.
He admitted to have been unaware of that footnote, struggled a bit by saying he thinks it's not applying to a specific product and an example what Vega could do in a configuration with 4 geometry engines (reinforcing a bit the point CarstenS was making [that AMD didn't exactly say, that Vega10 has 4 geometry engines, could be a different number], if it was not just hedging from his side). He then came back to one of the "talking points" of the Vega reveal, the primitive shaders (giving some credence to the idea, that this number is really pertaining to that), but basically saying "it's difficult" to explain how one arrives at the number of 11 triangles per clock (allegedly realistic and possibly taking into account multiple constraints like memory bandwidth [I mentioned that before]).
So maybe he was not fully briefed about what exactly is on the slides and was not willing to reveal any specifics. Or someone at AMD pulled some shaky number out of the air with the help of some halfbaked rules of thumb and put them on that slide. As explained, that number and also putting it on that slide doesn't make much sense in that case, as it would be somewhat fundamentally flawed.

Razor1 · Jan 18, 2017

Gipsel said:
No.
It is a (conservative) estimate of the potential additional benefit of using a primitive shader compared to the traditional pipeline. He explicitly said so. It is very likely, that this is a comparison of Vega with primitive shader vs. Vega without using it. Later in the interview, he says that Vega offers a geometry throughput uplift compared to previous generations also without using a primitive shader (while refusing to quantify it or provide specifics).
What?
If you mean the tiling (draw stream binning) rasterizer, then clearly no. Or have you ever heard nV's rasterizer requires additional special shader stages to work?

Check around 36 minutes as well where they ask Scott about the 11 polygons, Scott repeats it again, primitive shaders are what is giving the performance increase in polygon through put due to the culling.

yes the draw stream binning NEEDS primitive shaders to work in its current iteration in Vega.

nV's hardware has nothing to do with this, AMD also has stated this need for draw stream binning to be utilized. Also mentioned in this video too. Really wish it was an article, much quicker to read then sit around and listen lol.

Gipsel · Jan 18, 2017

Razor1 said:
yes the draw stream binning NEEDS primitive shaders to work in its current iteration in Vega.

Where exactly did you get this from? Scott Wasson said the new pixel pipeline including the tiling rasterizer works without the need to change the code.

Razor1 said:
nV's hardware has nothing to do with this, AMD also has stated this need for draw stream binning to be utilized. Also mentioned in this video too.

NV's hardware also has a binning rasterizer. So why should AMD need a primitive shader to get it to work? That has not much to do with each other. I listened to the complete video. If I have missed such a comment, please tell me where it is.

Razor1 · Jan 18, 2017

Gipsel said:
Where exactly did you get this from? Scott Wasson said explicitly, the new pixel pipeline including the tiling rasterizer works "automatic", without the need to change the code.
NV's hardware also has a binning rasterizer. So why should AMD need a primitive shader to get it to work? That has not much to do with each other. I listened to the complete video. If I have missed such a comment, please tell me where it is.

Damn videos lol, tellin ya they should go back to webpages lol

6:40

They are talking about primitive shader and the draw stream binning, where it needs to be exposed in the API or a library.

Gipsel · Jan 18, 2017

Razor1 said:
6:40

They are talking about primitive shader and the draw stream binning, where it needs to be exposed in the API or a library.

The primitive shader needs to be exposed, right. That is in a passage where he hedges against promising too much performance uplift through the new features. After discussing the benefits of the new rasterizers and that in some games (optimzed for very low overdraw) it could be limited, he comes to the primitive shaders and that they need to be exposed, i.e. don't provide an "automatic" improvement.

Razor1 · Jan 18, 2017

Gipsel said:
The primitive shader needs to be exposed, right. That is in a passage where he hedges against promising too much performance uplift through the new features. After discussing the benefits of the new rasterizers and that in some games (optimzed for very low overdraw) it could be limited, he comes to the primitive shaders and that they need to be exposed, i.e. don't provide an "automatic" improvement.

If it was auto magical as nV's has been, I would expect to see quite a bit of improvement in some current games. His statement leads me to believe, there needs to be programmer intervention to see a decent amount of performance improvements from it. And possible implication on power consumption too which they didn't get into, but if it was there I'm sure it would be something they would have hinted on right off the bat since its a problem area for current AMD chips in comparison to their direct counterparts.

seahawk · Jan 18, 2017

What I think is that they need software work to improve the geometry performance, either by the application using primitive shaders, or by the driver wrapping the code to achieve a similar effect.

CarstenS · Jan 18, 2017

Off-topic, but I find it quite sad how websites/youtubers are stealing other people's IP (in this case fotos, not mine, but still) without even crediting them. Sorry for the µrant.

Razor1 said:
Damn videos lol, tellin ya they should go back to webpages lol
6:40
They are talking about primitive shader and the draw stream binning, where it needs to be exposed in the API or a library.

Couple of seconds before that (around 6:10) the interviewed guy (who in this part of the video does explicitly does NOT sound like Scott...) explicitly says, the effectiveness of the draw stream binning will depend on how much is done in software already - with heavily optimized software culling already in place, the effect of the hardware being less pronounced. AFAIGI as a non-native speaker.

lanek · Jan 18, 2017

As said Carsten, i think they just expose that performance could be relative and depend on case use, not that you need this case use forcibly. ( at least it is how i understand it. ) If the developpers have allready done the job on the engine, the "hardware" optimization will not double the performance over it.

But that's the case for everything.

Malo · Jan 18, 2017

What really happened at AMD:

Marketing meets with engineer.
Marketing: "So I hear something called geometry units are more powerful with how many triangles they can draw?"
Engineer: "Yeah we really dialed them up to 11!"
Marketing: "11. Got it, thanks!"
Engineer: "Wait, no that's not what I meant!"

DavidGraham · Jan 18, 2017

I have to say, after listening to the video, they heavily state and hint at their solution being software based more than hardware based. So the automatic performance uplift seems to be not as pronounced or transparent as any traditional -moslty hardware based- solution. It needs special attention to be attainable.

AMD marketing peoole might not be knowledgable or porperly oriented, but this is what they are strongly implying. Cautiously promising uplifts if all conditions are met (software/developer awareness and adoption).

revan · Jan 18, 2017

"AMD’s new Vega GPU architecture changes things on two fronts, on die and off. Of the two the off-die seems the most fundamental change to SemiAccurate but both bring ground up new tech to GPUs.

During CES AMD unveiled a bit more about Vega including some high level architecture details. It isn’t the full technical deep dive but there is a lot of information to be had. What’s more interesting is when you start asking how it ties into the other technologies they have introduced lately, SSG being a key one. The bits that make a gaming GPU into an AI device like Instinct also benefit from these changes too.

"Vega at a high level

Lets start out with the obvious changes, the three on the left. If you are familiar with GCN architecture devices like Hawaii, you probably realize they are getting a bit long in tooth. The architecture isn’t bad but the process nodes it was originally meant for have long past and the optimization points for 16/14/10/7nm call for fundamentally different methods. Those changes require both shader level and device level architecture changes and that starts with the engines and pipelines.

First on the list of big changes is a really big bang, think DX9 or geometry shader addition. It is called the Primitive shader and it is lumped under the heading of New Programmable Geometry Pipeline. The old way of doing things was to have separate pixel, vertex, and geometry shaders fed by a compute engine (ACE) or geometry command processor (GCP). These fed the Geometry Processor and then the various pipelines, Vertex Shader(VS) then Geometry Shader(GS). "

and so on...

MORE HERE : http://semiaccurate.com/2017/01/17/amd-talks-vega-high-level/
A very serious article , despite site's repution!

Going after pop-corn...

iroboto · Jan 18, 2017

revan said:
"AMD’s new Vega GPU architecture changes things on two fronts, on die and off. Of the two the off-die seems the most fundamental change to SemiAccurate but both bring ground up new tech to GPUs.

During CES AMD unveiled a bit more about Vega including some high level architecture details. It isn’t the full technical deep dive but there is a lot of information to be had. What’s more interesting is when you start asking how it ties into the other technologies they have introduced lately, SSG being a key one. The bits that make a gaming GPU into an AI device like Instinct also benefit from these changes too.

"Vega at a high level

Lets start out with the obvious changes, the three on the left. If you are familiar with GCN architecture devices like Hawaii, you probably realize they are getting a bit long in tooth. The architecture isn’t bad but the process nodes it was originally meant for have long past and the optimization points for 16/14/10/7nm call for fundamentally different methods. Those changes require both shader level and device level architecture changes and that starts with the engines and pipelines.

First on the list of big changes is a really big bang, think DX9 or geometry shader addition. It is called the Primitive shader and it is lumped under the heading of New Programmable Geometry Pipeline. The old way of doing things was to have separate pixel, vertex, and geometry shaders fed by a compute engine (ACE) or geometry command processor (GCP). These fed the Geometry Processor and then the various pipelines, Vertex Shader(VS) then Geometry Shader(GS). "

and so on...

MORE HERE : http://semiaccurate.com/2017/01/17/amd-talks-vega-high-level/
A very serious article , despite site's repution!

Going after pop-corn...

you can leverage (quote) or (code) blocks to separate which text belongs to the article and which text is yours

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

3dilettante

Gipsel

3dilettante

Gipsel

pTmdfx

3dilettante

Razor1

Gipsel

Razor1

Gipsel

Razor1

Gipsel

Razor1

seahawk

CarstenS

Moderator

lanek

Malo

Yak Mechanicum

DavidGraham

revan

iroboto

Daft Funk