AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

A number of things.
Clock speed is dependent on logic design.
Clock speed is dependent on layout - a given design can be electrically implemented to minimize interference and allow higher clock speeds (may lead to lower density, more processing steps. ) Different libraries are an aspect of this.
Clock speed is dependent on process. As far as I'm aware, we don't even necessarily know where Vega will be fabbed, much less the particulars of the process.
Agreed but within reason.
The clocks were reasonably similar between 390/390x and Fiji.
There is a 10% difference between clocks of the 1050 (Samsung) and say 1060 (TSMC) but the 1050 is deliberately kept comfortably withi 75W in reference spec-rated Boost form and designed without auxiliary connectors (if added one still cannot exceed 75W).
The primary difference is a slight variation in the voltage-frequency performance envelope but the limitation to stop it matching or getting much closer to the 1060 is the strict power regulation to maintain below 75W in all forms.
The 1050 hits 1840MHz at around 74.4W when overclocked, the 1060 without same restriction hits around 2000-2050MHz.
So different fabs and libraries and yet without the strict power regulation for 75W the 1050 would pretty much be close to the 1060 albeit with a subtly different voltage-frequency performance envelope.
Cheers
 
Last edited:
Is it expected there to be significant amount of perfect vegas or will the main stream part be somewhat cut down? Maybe liquid cooled perfect part with perfect HBM2 chips to take the single gpu performance crown and pricing that makes the hbm and liquid cooling work within reasonable profit margin?
 
If this is pure OpenCL benchmark, then it has nothing to do with hardware tessellation or fixed function geometry performance.
My understanding was that the geometry processors still leaned on a CU/LDS for the shading/interpolation and part of the tessellation work. Single permanent wave per SE, not necessarily adhering to the cadence. I'll admit I don't fully understand that part of the pipeline though.

I would be surprised if GCN5 is a brand new CU architecture. I still believe that NCU is simply their marketing name for heavily power & clock optimized CU: 1.5 GHz clock rate, reduced power usage, support for double/quad rate 16/8 bit ops
The question is how they optimized it though? The clock difference from Polaris seems too significant for simply offsetting the power savings of a RF cache.

An extension of their register indexing where the RF only consisted of cached registers might be interesting. Spill the rest to ram to keep more waves in flight. That may explain the decreased cache sizes and could work in conjunction with the proposed operand cache.

Dual issue ADD/MUL could technically increase clocks, faster than fused, if all 4 operands were used as inputs and outputs reused. I think that's roughly the AVX-512 FMAPS instructions as well. Which to my understanding entails four consecutive multiply-accumulate operations. That'd be useful in graphics and deep learning. Not to mention avoiding bank conflicts.
 
My understanding was that the geometry processors still leaned on a CU/LDS for the shading/interpolation and part of the tessellation work. Single permanent wave per SE, not necessarily adhering to the cadence. I'll admit I don't fully understand that part of the pipeline though.
Yes, but you are still going at it the wrong way. OpenCL has no concept of "geometry". Unless this benchmark uses OpenGL/D3D/Vulkan then it doesn't matter the tiniest bit what the geometry throughput of a GPU is because none of the geometry features are not going to be visible to it. It's a general purpose OpenCL code (running on CUs) that happens to process geometry.
 
Yes, but you are still going at it the wrong way. OpenCL has no concept of "geometry". Unless this benchmark uses OpenGL/D3D/Vulkan then it doesn't matter the tiniest bit what the geometry throughput of a GPU is because none of the geometry features are not going to be visible to it. It's a general purpose OpenCL code (running on CUs) that happens to process geometry.
I realize that, but the bulk of the code and OpenCL throughput I'd expect to be representative of what a fully programmable card could push. Only difference being instead of a fixed function unit filling a wave, it's accomplished through software and there is no raster.
 
Why are you pulling in "geometry pipelines" then? OpenCL only has a concept of "kernel". It does not matter if GPU has no geometry pipes (say Knights Landing) or it has a billion of them.
 
Why are you pulling in "geometry pipelines" then? OpenCL only has a concept of "kernel". It does not matter if GPU has no geometry pipes (say Knights Landing) or it has a billion of them.
Because to my understanding, a CU encompasses part of that pipeline. Only difference being the OpenCL version would have saturated the card with triangles. If Vega removed the first stage and all that's left is shaders, then Vega can't handle graphics by your definition. All that would be left is a kernel dividing triangles.
 
Only difference being the OpenCL version would have saturated the card with triangles.
OpenCL has no concept of triangles.
If Vega removed the first stage and all that's left is shaders, then Vega can't handle graphics by your definition.
With OpenCL no GPU can handle graphics.
All that would be left is a kernel dividing triangles.
As said by MDolenc already: With OpenCL all what is left are compute shaders/kernels processing some data (as there is no concept of a triangle).

As you recognized before, OpenCL doesn't use any of the fixed function hardware for geometry processing or rasterization. Therefore, you can't assess the performance of these parts with OpenCL. It's that simple.
 
Because to my understanding, a CU encompasses part of that pipeline. Only difference being the OpenCL version would have saturated the card with triangles.
Again: there are no triangles in OpenCL. Graphics APIs have concept of triangles and yes in that case hull/domain/geometry/(primitive) shaders will invoke CUs. There is no such thing in OpenCL (or CUDA). It's just data that goes into the kernel and data that comes out of the kernel. It does not matter one bit if data happens to be triangles it does not affect anything.
 
Why are you pulling in "geometry pipelines" then? OpenCL only has a concept of "kernel". It does not matter if GPU has no geometry pipes (say Knights Landing) or it has a billion of them.
I'd guess that Anarchist4000 expects the Catmull-Clark compute shader to exhibit similar load to GPU SIMDs as tessellated games do. However this isn't the case, since the geometry/tesselation bottleneck of early GCN versions was elsewhere, not in the SIMDs. SIMDs were poorly utilized in these games. OpenCL compute kernel isn't using these parts of the GPU, so that bottleneck doesn't apply.
 
Again: there are no triangles in OpenCL. Graphics APIs have concept of triangles and yes in that case hull/domain/geometry/(primitive) shaders will invoke CUs. There is no such thing in OpenCL (or CUDA). It's just data that goes into the kernel and data that comes out of the kernel. It does not matter one bit if data happens to be triangles it does not affect anything.
No triangles, but the same algorithms and series of instructions mapped to generic hardware. What could occur if the fixed function hardware was removed entirely. Not unlike matrix multiplication at a low level as a bunch of MUL and ADD instructions. Too my understanding this is what primitive shaders are doing, but with far more flexibility at the cost of some efficiency.

I'd guess that Anarchist4000 expects the Catmull-Clark compute shader to exhibit similar load to GPU SIMDs as tessellated games do. However this isn't the case, since the geometry/tesselation bottleneck of early GCN versions was elsewhere, not in the SIMDs. SIMDs were poorly utilized in these games. OpenCL compute kernel isn't using these parts of the GPU, so that bottleneck doesn't apply.
In early GCN yes, but Nvidia hardware for example spreads that load across all SMs. A move away from the fixed function would likely entail some form of compute shader. That would put all the load, somewhat less efficiently, onto the SIMDs and probably ACE hardware. Moreso with extremely high levels of tessellation. While that result may have been anomalous, it would indicate how well part of the algorithm mapped to the underlying hardware. If accurate it might indicate Vega required tessellation for acceptable performance, the opposite of prior GCN hardware. For what should be a standard compute kernel, Vega was executing it differently from past generations.
 
No triangles, but the same algorithms and series of instructions mapped to generic hardware. What could occur if the fixed function hardware was removed entirely. Not unlike matrix multiplication at a low level as a bunch of MUL and ADD instructions. Too my understanding this is what primitive shaders are doing, but with far more flexibility at the cost of some efficiency.
The fixed function tessellator hardware is simply generating barycentric coordinates based on edge&center tessellation factors (received from hull shader). These barycentric coordinates will be the input to the domain shader invocations. The problem with old AMD hardware was that the hull shader and domain shader had to run on the same CU. This is problematic when the hull shader outputs large tessellation factors -> lots of domain shader invocations. Geometry shaders have similar problems with load balancing (one of the reasons of GS slowness).

Hull shader and domain shader are already fully programmable. All math is done in shaders. The only fixed function part left is the tessellator (that simply calculates the triangle count and barycentrics based on the tessellation factors). The most important thing to notice is that the data paths are also fixed. If we had flexible data paths + efficient way to spawn warps/waves from shader code, we wouldn't need hull/domain shaders or the tessellator at all. This is mostly a data flow problem. Who produces the data, who consumes it and where it is stored between the stages. Current hardware isn't flexible enough to allow the programmer to define this.
For what should be a standard compute kernel, Vega was executing it differently from past generations.
It is apparent that for some reason Vega benchmark score was bad in this particular benchmark. However it is important to understand that the geometry pipeline changes between the GCN4->GCN5 architectures play no role in this result. OpenCL doesn't use geometry pipelines at all. Most likely there are some changes in the shader cores (NCU) or shader compiler that explain this difference. Or simply that beta hardware has some units disabled and this test is heavily bottlenecked by that. Or their preliminary NCU shader compiler still needs some work before it's ready.
 
It is apparent that for some reason Vega benchmark score was bad in this particular benchmark. However it is important to understand that the geometry pipeline changes between the GCN4->GCN5 architectures play no role in this result. OpenCL doesn't use geometry pipelines at all.
Not denying it could be a regression, but it's also possible GCN5 emulates the entire geometry pipeline with a primitive shader and some new capabilities. In that case the OpenCL result would be part of the pipeline. AMD devs did say they were doing a lot of shader work to get Vega running on linux with the first stage removed. It could be a software based pipeline.

The most important thing to notice is that the data paths are also fixed. If we had flexible data paths + efficient way to spawn warps/waves from shader code, we wouldn't need hull/domain shaders or the tessellator at all. But this is mostly a data flow problem. Who produces the data, who consumes it and where it is stored between the stages.
The ACEs should be able to do that with some modifications to my understanding. Each having essentially unlimited inactive as pointers in ram. Just need some sort of fork to invoke the new shaders by kicking the dispatch back to the ACEs. HBCC could probably handle allocation as everything is virtual and capable of being evicted.
 
The ACEs should be able to do that with some modifications to my understanding. Each having essentially unlimited inactive as pointers in ram. Just need some sort of fork to invoke the new shaders by kicking the dispatch back to the ACEs. HBCC could probably handle allocation as everything is virtual and capable of being evicted.
You don't want to store the intermediate data between pipeline stages to RAM. You want to use fast on-chip memories, such as LDS and GDS. There's a total of 4 MB LDS on the chip. If we want to use memory instead, we need significantly larger L2 cache.
 
You don't want to store the intermediate data between pipeline stages to RAM. You want to use fast on-chip memories, such as LDS and GDS. There's a total of 4 MB LDS on the chip. If we want to use memory instead, we need significantly larger L2 cache.
Agreed, but that wouldn't be intermediate data as much as shader invocations filling a work queue. Ideally that stays in cache, but a deep queue has the possibility to overflow already.

A kernel could create a persistent queue on an ACE and stream work to keep intermediate data manageable. Kind of surprised current GCN can't do this. Not far off of the indirect execution, so may be a software thing. With some creativity, a programmer could create an entire pipeline with hopefully less than 4-8 stages (whatever HWS actively tracks) without running to completion. Just need to ensure everything fits to keep it optimal and throttle appropriately.
 
Looks like we'll get some moderate-major roadmap updates on May 16th.

http://wccftech.com/amd-taking-the-covers-off-vega-navi-may-16th/

And hopefully this old thing will get revised.

AMD-Next-Gen-Vega-GPU-and-Navi-GPU-2017-2018-768x432.jpg
 
Not denying it could be a regression, but it's also possible GCN5 emulates the entire geometry pipeline with a primitive shader and some new capabilities.
I'm pretty sure the "Primitive Shader" refers to this (from the open-source driver):
Code:
/* Valid shader configurations:
 *
 * API shaders       VS | TCS | TES | GS |pass| PS
 * are compiled as:     |     |     |    |thru|
 *                      |     |     |    |    |
 * Only VS & PS:     VS |     |     |    |    | PS
 * GFX6 - with GS:   ES |     |     | GS | VS | PS
 *      - with tess: LS | HS  | VS  |    |    | PS
 *      - with both: LS | HS  | ES  | GS | VS | PS
 * GFX9 - with GS:   -> |     |     | GS | VS | PS
 *      - with tess: -> | HS  | VS  |    |    | PS
 *      - with both: -> | HS  | ->  | GS | VS | PS
 *
 * -> = merged with the next stage
 */
So, essentially there's now only one hw shader pre-tesselation, and one post-tesselation. Plus, if geometry shader is used, a passthrough copy shader (note that the hw "VS" stage is the only one who can write to the parameter cache, and therefore is always last).
But there's still hw support for this, the driver doesn't just emulate everything with ordinary compute kernels certainly.

The fixed function tessellator hardware is simply generating barycentric coordinates based on edge&center tessellation factors (received from hull shader). These barycentric coordinates will be the input to the domain shader invocations. The problem with old AMD hardware was that the hull shader and domain shader had to run on the same CU. This is problematic when the hull shader outputs large tessellation factors -> lots of domain shader invocations. Geometry shaders have similar problems with load balancing (one of the reasons of GS slowness).
Those shader stages use something called the "ESGS ring" for exporting values. If I understood things correctly, prior to VI this was indeed just memory, but VI can use LDS for it, and Vega uses LDS exclusively. Therefore, I can't see how you could execute later stages in a different CU. Albeit I could certainly be wrong...
 
So this isn't good...

http://www.tweaktown.com/news/57418/amd-radeon-rx-vega-less-20-000-available-launch/index.html

Supposedly, it'll be a near-paper launch:

I've been told that there will be less than 16,000 cards that will ship in the first few months after it launches, something that will come down to the HBM2 used on the card. HBM2 is in extremely limited supply, and is expensive to use - and since there's not enough, that scarcity is driving up the production costs of the card - and will see AMD only having 16,000 cards or so in the months post-launch.

So if this rumor were to be true, it's fair to assume that AMD might use part 2 Gbps HBM2 for the higher end and part 1.6 Gbps HBM2 for the 1080-competing lower end Vega 10 Pro stuff, right?
 
Back
Top