AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Also some technical questions:
Vega supports double prec. global atomics? like Pascal added to OGL via GL_NV_shader_atomic_float64..
also what about integer 64 global atomics.. was already there with Polaris?
also with Siggraph GL news plus Vega details I expected to see a new EXT or ARB extension of a feature added by Vega and which were previously only avaiable on NV GPUs (and Intel iGPUs) and exposed via some NV extensions..
I'm talking about conservervative rasterization (even Intel has a OpenGL extension for it and even on Linux Mesa is already implemented)
NV has:
GL_NV_conservative_raster
GL_NV_conservative_raster_dilate
GL_NV_conservative_raster_pre_snap_triangles (tier2 feature)
Also hope to see that Vega OpenGL driver supports ARB_fragment_shader_interlock extension now that ROV are supported.. strangely AMD exposed already INTEL_fragment_ordering which should provide equal functionality on pre-Vega cards altough that cards doesn't support ROV feature.. how?
 
Exotic memory raising platform cost, lengthened pipeline for higher clock-speeds costing a lot of transistors and reducing IPC, feature extensions which require significant developer effort to implement, an architecture pitched as best suited for 'tomorrow's workloads'.
Sounding a bit like the P4.
I don't think IPC has been reduced. FWIW, MAD-latency per clock still seems exactly in line with Fiji.
 
Nevertheless it does clock lower than the same architecture implemented on the similar TSMC process.
Boost clocks are actually 26% higher on GP106 vs GP107.
That's pretty much the real world performance delta between the gtx1080 and 1080ti. In Vegas performance segment such differences make a large difference in perception, and thus also in what prices you can ask.
Isn't max OC clock difference between GP107 and GP106 around 10% (1900 MHz vs 2100 MHz). Even with 10% higher clocks, Vega wouldn't be competitive with GP102. So it's the architecture that makes large(r) difference
 
Also it could be quite difficult to compare chip even of the same architecture because i.e. GP107 may be using a lower number of metal layers compared to GP106 for cost saving, being an ultra-low budget part. It would be interesting to know these details
 
Nevertheless it does clock lower than the same architecture implemented on the similar TSMC process.
Boost clocks are actually 26% higher on GP106 vs GP107.
That's pretty much the real world performance delta between the gtx1080 and 1080ti. In Vegas performance segment such differences make a large difference in perception, and thus also in what prices you can ask.

But still there is a good factory setting difference of 15% for Boost and around 25% in practical Boostperformance.
 
I don't think IPC has been reduced. FWIW, MAD-latency per clock still seems exactly in line with Fiji.
If AMD wants to maintain the 4-cycle instruction cadence, they can't touch the instruction latency. If they break the 4-cycle cadence, they need a new shader compiler as the current one is allowed to assume that all standard instructions have no visible latency (results immediately usable for the next instruction). The 4-cycle cadence with no visible instruction latency was one of the key points of GCN architecture. It simplified shader compiler design radically.

AMD could have however increased latency of instructions requiring s_waitcnt. These include LDS load, vector memory load, scalar memory load, texture sampling and some cross lane ops. It should also be simple to increase latencies of the L1 and L2 caches. But of course you then need higher occupancy to hide these extra latencies.
 
Also some technical questions:
Vega supports double prec. global atomics? like Pascal added to OGL via GL_NV_shader_atomic_float64..
Why do you even want float atomics?
GCN1 supported float and double atomics (cmpswap, min, max) - nvidia can't do any of that (per this extension) but can do atomic add.
But GCN3 tossed out all float and double atomics (well you can do swap...).
 
GCN1 supported float and double atomics (cmpswap, min, max) - nvidia can't do any of that (per this extension) but can do atomic add.
But GCN3 tossed out all float and double atomics (well you can do swap...).

I think nvidia now includes additional functionality with support of OpenGL 4.6.

The ARB_shader_atomic_counters extension introduced atomic counters, but
it limits list of potential operations that can be performed on them to
increment, decrement, and query. This extension extends the list of GLSL
built-in functions that can operate on atomic counters. The list of new
operations include:

* Addition and subtraction
* Minimum and maximum
* Bitwise operators (AND, OR, XOR, etc.)
* Exchange, and compare and exchange operators
https://forum.beyond3d.com/posts/1993984/
https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_shader_atomic_counter_ops.txt
 
If all of AMD's slides were made using the newer drivers with DSBR enabled and those slides show Vega ~= 1080FE, that indicates Vega would be 10-15% slower than the 1080 without that new feature enabled. I'm not saying you're wrong but my brain can't wrap itself around what that means (i.e. with a 50% clock increase Vega would only be ~15% faster than a FuryX). What has AMD been doing for the last 2yrs? :no:

Polaris was kind of a success. But for high end, yeah, it's a letdown... Being one year late, and with all the architectural changes, they still can't beat nVidia ? Having mores feature and being kind of "futur proof" is great, but you have to think about "right now raw performances" too. I just think they're understaffed and / or less talented, they just can't compete right now. It's just speculations anyway...
 
If all of AMD's slides were made using the newer drivers with DSBR enabled and those slides show Vega ~= 1080FE, that indicates Vega would be 10-15% slower than the 1080 without that new feature enabled. I'm not saying you're wrong but my brain can't wrap itself around what that means (i.e. with a 50% clock increase Vega would only be ~15% faster than a FuryX). What has AMD been doing for the last 2yrs? :no:

DSBR is enabled for Vega FE in pro applications only in the current driver (17.20?).AMD posted a slide indicating performance uplift with it enabled vs disabled. Clearly it was a best case for them. The lack of similar slides for game workloads implies rather limited gains there. But I have read that power consumption improves a bit with it enabled anyway.

slides-42.jpg
 
If AMD wants to maintain the 4-cycle instruction cadence, they can't touch the instruction latency. If they break the 4-cycle cadence, they need a new shader compiler as the current one is allowed to assume that all standard instructions have no visible latency (results immediately usable for the next instruction). The 4-cycle cadence with no visible instruction latency was one of the key points of GCN architecture. It simplified shader compiler design radically.
I understand what you're getting at, but I don't think maintaining the 4-cycle instruction cadence is necessary.

Today's SIMD can be filled with one warp only (as long as there are CU external data fetches). But if they simply increase that number from 1 to 2, they could ping-pong between those two warps and no changes to the compiler would be strictly necessary. Since Nvidia needs that too, I don't think that's a huge problem?
 
Interview with Chris Hook, with some more info on power consumption:


Vega Nano - 150W TGP (total graphics power = GPU + Memory)
Vega 56 - 165W TGP
Vega 64 - 220W TGP

He claims power consumption will vary with the complexity of the game being run. Maybe this means Vega drivers will have FRTC enabled by default, as part of Enhanced Sync.
I have to say I really like FRTC. In my Crossfire setup and using a 74FPS FRTC (in a 40-75Hz Freesync monitor), I sometimes get to save over 300W (at the wall) with no discernible loss in performance.


For Vega 64, the TGP<TDP difference is 75W, and for Vega 56 the difference is 45W (this much is being spent on power conversion, active fan and I/O?..)
I guess the Nano should be a sub-200W TDP card, maybe with the same 175W as its predecessor.
 
Remember Polaris and GPU-z (not the tool's fault!) reports of power consumption, where in fact it was only the GPU, not the whole card. The difference was quite substantial. With Vega and the memory inclucded in this figure, differences will be smaller, but there nonetheless.
 
But still there is a good factory setting difference of 15% for Boost and around 25% in practical Boostperformance.
GP106 boost: 1706MHz (1060 6GB)
GP107 boost: 1392MHz (1050Ti)
1706/1392= 1.226 or rounded to two significant digits 23% higher clocks for the GP106

Numbers this time taken from nVidias own site.
 
GP106 boost: 1706MHz (1060 6GB)
GP107 boost: 1392MHz (1050Ti)
1706/1392= 1.226 or rounded to two significant digits 23% higher clocks for the GP106

Numbers this time taken from nVidias own site.

And this is TSMC's 16FF+ (GP106) vs. Samsung's 14LPP (GP107). Although GlobalFoundries' 14LPP (Polaris & Vega) is technically the same implementation as Samsung's, it's still a different foundry in a different place, so there could be some differences there, too.
In the end, we still have no idea what's coming from TSMC out of that GF payout. Maybe Ryzen+Vega APUs?
 
GP106 boost: 1706MHz (1060 6GB)
GP107 boost: 1392MHz (1050Ti)
1706/1392= 1.226 or rounded to two significant digits 23% higher clocks for the GP106

Numbers this time taken from nVidias own site.

I was comparing 1050ti to RX560, where the 1050ti still enjoys a healthy advantage over the RX560.
 
If AMD wants to maintain the 4-cycle instruction cadence, they can't touch the instruction latency.
They could as long as instruction issue latency does not drop below execution cycle count. They would have to touch something else.
There's some measures like forwarding or rearranging operand fetch that could help cover a number of gaps, and at least with the probably delayed register writeback forwarding is already involved in covering for the trailing edge of the current pipeline.

If they break the 4-cycle cadence, they need a new shader compiler as the current one is allowed to assume that all standard instructions have no visible latency (results immediately usable for the next instruction).
I've seen some commentary here, and also in places like GPUOpen about flaws in AMD's code generation choices. Is a new compiler necessarily a bad thing?

The 4-cycle cadence with no visible instruction latency was one of the key points of GCN architecture. It simplified shader compiler design radically.
I'm not convinced at this point that having instruction latency is an intractable, unsolved, or substantially difficult problem.

GCN's tidy execution loop strikes me as being trapped in a local minimum. It affects too many architectural parameters to be easily changed, and simultaneously they cannot be adjusted without impacting it and each other.

AMD could have however increased latency of instructions requiring s_waitcnt.
GFX9 increases vmcnt by 4x, with the extra two bits placed at the end of the representation to maintain backwards compatibility.
Vega strives to maintain binary compatibility in other ways. Its old FP16 instruction encoding is maintained for the pre-GFX9 semantics, although those are now renamed as legacy operations. The new FP16 instructions that mostly mirror the old ones inherited their name, which the LLVM patch notes are rather caustic about describing.

Today's SIMD can be filled with one warp only (as long as there are CU external data fetches). But if they simply increase that number from 1 to 2, they could ping-pong between those two warps and no changes to the compiler would be strictly necessary. Since Nvidia needs that too, I don't think that's a huge problem?
It's what AMD's VLIW GPUs did. The minimum occupancy for basic throughput per SIMD was two wavefronts, an A and B.
There were some interesting games that could be played with passing register values or allocating within the register file, but they were potentially complex.
GCN upped it to 4, but since the CU was the new basis it was able to drop it to 1 per SIMD.
 
Back
Top