AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

From a (VERY) quick glance with some basic fillrate tests (yes, i know, hence VERY quick) there does not appear to be anything unusual with Vega - like certain past architectures where enabling MSAA would greatly reduce pixel- or textur rate, which it was designed no to in the first place.
 
I was disappointed that not a single reviewer tested Vega with huge datasets versus traditional 8 GB GPUs and 12 GB Titan X.
I read somewhere that HBCC is disabled in current drivers. Or at least in the drivers that were sent to reviewers.
 
Who has all the swizzles... :devilish:
There are 3 completely different topics with swizzling:
- standard swizzle (in this thread): Layout of texels within texture defined by D3D (as opposed to IHV defined swizzles which may differ from one IHV to another or from one GPU generation to another).
- indexed swizzle: Which sebbbi mentioned here and refers to ability of threads to exchange data withing a warp/wavefront. Specifically that one lane may directly index a value in a register from another lane.
- viewport swizzle: Which you mentioned and which allows to output a triangle from pass through geometry shader to multiple viewports and reorient it properly (flip some coordinates around) when rendering cube map in a single pass for example.
 
@sebbbi, now that you've played with Vega quite a bit, do you have an educated guess as to why it performs so much worse than what you'd expect from its specifications?
 
[...]

HMBCC also should reduce frame judder in DX11 games since the GPU doesn't need to transfer whole resources on use. It can simply page on demand, causing much smaller data movement per frame -> less stalls. I would be interested to know whether this is already is visible in current games. Do we see less fps spikes and better minimal fps compared to other AMD GPUs?
[...]
As far as I know three german sites tested the HBCC (High-Bandwidth-Cache-Controller).
The results are mixed, some games profit from the HBCC some games are running worse.
I uploaded two images with some results below, there are more results if you click on the links.

Computerbase (had mixed results):
hbccn0p3t.jpg

https://www.computerbase.de/2017-08..._hbcc_hat_im_schnitt_weder_vor_noch_nachteile

PCGH (the HBCC was never worse but in Metro the HBCC improved the performance):
hbccx3jso.jpg

http://www.pcgameshardware.de/Radeo...66623/Specials/HBCC-Gaming-Benchmark-1236099/

Gamestar (the HBCC was never better, on average a few % loss):
http://www.gamestar.de/artikel/rade...t-high-bandwith-cache-controller,3318564.html


The games each one testet were different and so was the testsystem.

PCGH used the 6800K @ 4,4 Ghz with 32GB (3200) in quad-channel-mode, ~4GB system memory were reserved for ~12 GB unified memory.

CB used also 32GB (probably quad-channel (3000) with the 6850K @ 4,3 Ghz)
They reserved 8GB for 16GB unified memory.

Gamestar in contrast used the 7700K @ default with 16GB (2400) dual-channel memory where they reserved 4GB for 12 GB unified memory.
 
Why are you so obsessed by this? Each GPC can rasterize 1 triangle per clock and each SM has a PolyMorph engine which can process a triangle every two clocks. But as I pointed out earlier and sebbbi pointed out as well you really, really need to take quite a lot of other stuff into consideration as well. Reaching this peaks in benchmarks will be a whole different matter (have you seen a Vega benchmark that would suggest 17 triangles per clock? Or even 8?).

That's a good point. AMD stated that they reached the 17 polygons per clock in there internal benchmarks. Also I was surprised that they lifted up the value from 11 to 17. And I also was surprised that the primitive shader is not really activated in the driver and that it will be diliverd later. Also it's interesting that they know the driver name where it should come.

For me it is strange, first they make jokes like poor Volta. Than they compare Vega to a 1080(non ti). For me it looks they have serious problems to get there geometry engine to run but when it runs we will see a huge performance increase, because the utilisation of the hardware will be higher.
 
That's a good point. AMD stated that they reached the 17 polygons per clock in there internal benchmarks. Also I was surprised that they lifted up the value from 11 to 17.
The exact word is polygons per clock in the earlier slide versus 17 or more in the Vega whitepaper. I'm not sure if there's some subtly different scenario being applied in each, or if AMD was being conservative about what it could apply its primitive shaders to.

For me it is strange, first they make jokes like poor Volta.
I think it's possible that it was some marketing intern's idea of a little joke that took on a life of its own.
I don't think anybody engineering Vega would have thought this seriously.


In the other thread, I was speculating on whether some of the apparent lack of development in advance of the chip being taped out could have stemmed from some kind of re-shuffling in Vega's design process. The leaked Greenland GPU had certain features that may not be in Vega 10, like 1/2 rate DP and 4 GMI links.

If there was something like AMD having multiple candidate designs that weren't settled on until late, it may have inhibited preliminary development or led to some IP blocks being adjusted to different levels than early modeling was done for.
 
Maybe AMD was not sure about primitive shader. It's interesting that primitive shader coexist with the old geometry pipeline.
 
Last edited:
The formulation of the primitive shader given so far is that it contributes to efficiency by conservatively culling geometry that the current geometry pipeline would wind up rejecting, freeing up traffic and storage pertaining to attribute calculation and cycles the fixed-function engines must use to reject even the most obviously unused primitive.

That it's conservative means sometimes a little or a lot of the primitive shader's input must get through, and since it's only culling it doesn't do a lot of the processing the main pipeline must perform.

Some of the items that haven't been fully explored are what portion of the culling can be handled by the primitive shader versus the primitive discard accellerator, versus DSBR, versus the rest of the pipeline.
The other hardware elements somewhat overlap with the primitive shader and are themselves conservative. The main pipeline and rasterizer also eventually drill down to the sub-pixel level, which could incur a fair amount of math for the primitive shader to do on its own, and then have the rasterizer do all over again.
Perhaps if there were a way to pass along a guarantee that certain culling actions are known to be fully accurate, something could be skipped.

I'm not sure how many situations involve the primitive shader's calculations for frustrum and triangle facing needing to be more conservative than the main pipeline. Sample coverage at the level that the rasterizer can drill down to might incur a fair amount of math, and might affect the per-primitive cost of a primitive shader based on the complexity of the sampling if it were coded to try calculating things that finely for every triangle and then having to pass it all through anyway.
 
Maybe AMD was not sure about primitive shader. It's interesting that primitive shader coexist with the old geometry pipeline.
May be better to think of primitive shaders as a means to control the fixed function pipeline. The whole point is to take a primitive, do something to it, and hand it off for rasterization. That can be culling, transforming, sorting, generating, etc to make the pipeline more efficient. If presented with triangle strips consisting of only front facing, visible triangles it's unnecessary. Therefore it coexists.
 
Are you sure that it's only culling? If you look in the amd slide it looks like it's a different/ independent pass through the geometry engine.

https://techgage.com/article/amd-details-vega-architecture-hbm2-ncu-primitive-shaders/

I think that slide is more about illustrating that the primitive shader combines the capabilities of the formerly separate stages. The culling portion is what's been mostly talked about in the most recent disclosures, and it was discussed earlier that the system or software could opt to not use its capabilities, possibly in cases where it would be more overhead than help.

Using that illustration, the fixed-function portion of the setup pipeline would come after, and that's where the other culling checks and rasterizer would be.
 
I find this sentence interesting:
And at that point when you finally have the final position of the vertices of the triangle, is one point where we can always find out whether or not the triangle is inside of the frustum, back-faced, or too small to hit. From frustum testing, there’s a mathematical way to figure out whether or not a vertex is inside of the view frustum. If any one of the vertices are inside of the view frustum, then we’ll know that the triangle can potentially create pixels. To do a back-faced culling perspective, you can find two edges or with three vertices you can find one edge and a second edge, and then you can take a cross-product of that and determine the facedness of the triangle. You can then product that with the eye-ray, and if it’s a positive result, it’s facing the direction of the view, and if it’s negative it’s a back-faced triangle and you don’t need to do it.
http://www.gamersnexus.net/guides/3010-primitive-discarding-in-vega-with-mike-mantor

Looks like AMD find a way to do back face culling with vertex data?

From the white paper:
Furthermore,
traditional geometry pipelines discard primitives after
vertex processing is completed, which can waste computing
resources and create bottlenecks when storing a large batch
of unnecessary attributes. Primitive shaders enable early
culling to save those resources.
 
Last edited:
I find this sentence interesting:

http://www.gamersnexus.net/guides/3010-primitive-discarding-in-vega-with-mike-mantor

Looks like AMD find a way to do back face culling with vertex data?
The later portion of the quote discusses having two edges or creating two edges from three vertices, which should be available in this instance. Then it discusses taking the cross product and its product with the eye-ray, indicating whether the triangle is back-faced or not.
That goes to where I wasn't sure how often the primitive shader would diverge from the primitive pipeline in terms of frustrum and back-faced triangles.

I read that text, and I watched the video for any mention about the overhead of the zero-coverage checks, which might be influenced by the sampling mode and elements that are typically handled at higher precision by the rasterizer. I don't recall seeing that portion mentioned, and drilling down to the level that the DSBR can go would potentially scale overhead based on a portion of the complexity of the pixel output.

The primitive shader is conservative, in part to be safe, and I think in part because it should be coarse enough that the work to cull plus the non-culled work is supposed to be less than just using what's already there.
 
It's always been possible to do this in hardware, it's just a small part of the acceleration that you get from GPUs! Using shader code to do it, instead, is not a major engineering feat. Larrabee was doing this.

It's analogous to when GPUs changed from having fixed counts of vertex and pixel shader pipelines in hardware into a unified architecture where all shader pipes could do both vertex and pixel shading. It was done because it allowed the GPU to adjust to the workload, using programmability and load-balancing metrics to adjust to the workload. It led to better performance.

So the primitive shader is similar: it allows the hardware to cover a large range of situations, particularly very high geometry load and be less bottlenecked than a fixed configuration of buffers/hardware.
 
At least for my browser right now, I cannot see the actual tweet.
However, would the following be an earlier precursor, given what it does and that it was done on GCN?
http://www.gamasutra.com/view/feature/191007/inside_the_playstation_4_with_mark_.php?page=3

"There are a broad variety of techniques we've come up with to reduce the vertex bottlenecks, in some cases they are enhancements to the hardware," said Cerny. "The most interesting of those is that you can use compute as a frontend for your graphics."

This technique, he said, is "a mix of hardware, firmware inside of the GPU, and compiler technology. What happens is you take your vertex shader, and you compile it twice, once as a compute shader, once as a vertex shader. The compute shader does a triangle sieve -- it just does the position computations from the original vertex shader and sees if the triangle is backfaced, or the like. And it's generating, on the fly, a reduced set of triangles for the vertex shader to use. This compute shader and the vertex shader are very, very tightly linked inside of the hardware."

At least up to the point where it is running two separate kernels communicating through a buffer, it's using a subset of the vertex shader's functionality to do similar position and visibility culling before feeding the remaining vertices into the main vertex processing phase. Possibly, the architecture goes a step further in its generalizing and combining shader stages so that it can make one stage out of the formerly separate sieve and vertex pair?
 
Back
Top