AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

How much of Vega's (and Polaris's?) power problems are solely due to Global Foundries, versus TSMC?
I'd wager none, They basically used the new node to increase clock speeds at roughly the same power consumption as FuryX. Apparently GCN required a lot of juice and area for that. Ryzen turned out fine despite being fabbed at GF.
 
Reference?
radeonsi/gfx9: reduce max threads per block to 1024 on gfx9+
The number of supported waves per thread group has been reduced to 16 with gfx9. Trying to use 32 waves causes hangs, and barriers might not work correctly with > 16 waves.
https://cgit.freedesktop.org/mesa/mesa/commit/?id=a0e6b9a2db5aa5f06a4f60d270aca8344e7d8b3f
Sorry, per thread group not SIMD. Regardless that seems a rather interesting change in the scheme of things. However I thought 1024 was an established limit for most APIs.

I would reference the filing date for AMD's binning rasterizer, which is 2013. Vega's development may have been prolonged by internal issues, but even in other contexts you wouldn't expect something patented now to have been sat on for the years the design was moving down the pipeline. Given the timing, we might have to wonder if or when it might show up. That it was filed and made public may mean AMD isn't worried about competitors seeing it too early (publication can be delayed significantly from filing if you care).
AMD has had a bunch of patents recently that were all quickly filed and published. For the most part they seem to be software techniques for ambiguous hardware. Like most patents. Anyways:
MEMORY MANAGEMENT IN GRAPHICS AND COMPUTE APPLICATION PROGRAMMING INTERFACES
METHOD AND APPARATUS TO ACCELERATE RENDERING OF GRAPHICS IMAGES (Perhaps worst patent title ever?)

128K of what, page table entries?
GCN does support 4KB x86 page tables, which would not cover 16GB. Even if going with the coarse 64KB PRT granularity, the additional context and history tracking would go over a MB.
On top of that, CPU TLB hierarchies and the page table hierarchy are backed up by their caches. The HBCC may not have that option, or it might want to avoid using it that much given the L2 isn't that expansive.
In a more compressed form yes. While certainly possible, I'm guessing the paged memory is a subset of the overall pool. Leaving the HBCC to only track active pages. Some resources(framebuffer, meshes, stacks) simply won't lend themselves to paging very well and likely be kept in a separate pool.
 
I'd wager none, They basically used the new node to increase clock speeds at roughly the same power consumption as FuryX. Apparently GCN required a lot of juice and area for that. Ryzen turned out fine despite being fabbed at GF.

I'd say you're oversimplifying things. Ryzen is too diferent to Vega / Fury to compare (and who's to say it doesn't performs as it does in spite of the process that hampers it, instead of because of it).

Further, for Fury vs Vega, both the architecture and process are different. Which is the main cause of Vega's underperforming could be anyone's guess.
 
Possible additions so far - without actual sizes though:
Some references to "Texture Caches" in drivers so there could be multiple generic L2s. That might actually make sense for async to avoid trashing.

Instruction caches would be significant. 48KB (16+32KB) per CU on an old GCN iteration and they've been growing. Those could be critical with higher clocks and over 3MB. Then assume if INT and FP are running concurrently as suggested in one slide they would need to be much larger. Could be 10MB or more in various instruction caches there. Certainly not everything, but that could be half the unaccounted SRAM.
 
Texture Caches are L1. 16 KiB per CU.
Where did you get the instruction caches sizes from? Very curious, since I've either completely forgotten about them being discussed or have never seen it.
edit: Ah, the very first GCN presentation. 4 CUs sharing 16 kiB scalar read-only cache (constants?) and 32 kiB instruction L1!
Which slide suggests that INT and FP are running concurrently (on the vec16-SIMDs)?
 
Sorry, per thread group not SIMD. Regardless that seems a rather interesting change in the scheme of things. However I thought 1024 was an established limit for most APIs.
The GCN3 ISA indicates the maximum workgroup size is 16 wavefronts (1024 work items). Whatever limit is being set here has some other confounding issue if they got away with 2048 before.
 
The "explanation" of the primitive shader is entirely unconvincing. Seems likely to be a white elephant. I wonder if this was built by AMD as the basis for a console chip at some later date. In a console it would be totally awesome, I presume.
It could be an evolution of a customization created for a console already built.
http://www.gamasutra.com/view/feature/191007/inside_the_playstation_4_with_mark_.php?page=3

"There are a broad variety of techniques we've come up with to reduce the vertex bottlenecks, in some cases they are enhancements to the hardware," said Cerny. "The most interesting of those is that you can use compute as a frontend for your graphics."

This technique, he said, is "a mix of hardware, firmware inside of the GPU, and compiler technology. What happens is you take your vertex shader, and you compile it twice, once as a compute shader, once as a vertex shader. The compute shader does a triangle sieve -- it just does the position computations from the original vertex shader and sees if the triangle is backfaced, or the like. And it's generating, on the fly, a reduced set of triangles for the vertex shader to use. This compute shader and the vertex shader are very, very tightly linked inside of the hardware."

The front end reorganization may allow for a more generic version of this compute shader so that it can feed primitive setup across various combinations of VS,TS, and GS. Perhaps the specialized compiler mode is a precursor to how AMD expected to make existing vertex code work for Vega.


Some references to "Texture Caches" in drivers so there could be multiple generic L2s. That might actually make sense for async to avoid trashing.
Exact wording may be important. There are references to texture channel caches, which are actually describing the L2.
 
Which slide suggests that INT and FP are running concurrently (on the vec16-SIMDs)?
AMD estimates a Vega NGCU to be able to handle 4-5x the number of operations per clock cycle relative to the previous CUs in Polaris. They demonstrate a use case of Rapid Packed Math using 3DMark Serra- a custom demo created by Futuremark for AMD to show off this technology- wherein 16-bit integer and floating point operations result in as much as 25% benefit in operation count.
https://www.techpowerup.com/reviews/AMD/Vega_Microarchitecture_Technical_Overview/3.html
Going to hold off on the concurrently part given the context for the time being after re-reading that paragraph. That 4-5x part is still interesting though. INT8 would be 4x, but the extra +1 relative to Polaris I'm unsure about. The rest of the paragraph is INT16/FP16 which I figured ran concurrently for 4x.

Looking at slide 17, it's possible there are two 64KB banks per SIMD. That could account for a good chunk of SRAM and make sense with the longer pipelines and higher clocks.

The GCN3 ISA indicates the maximum workgroup size is 16 wavefronts (1024 work items). Whatever limit is being set here has some other confounding issue if they got away with 2048 before.
It may be a linux thing because that code would have been actively used for years now. All the documentation I recall has the 1024 work item limit as you mentioned, but obviously they were exceeding that limit with some success.
 
Going to hold off on the concurrently part given the context for the time being after re-reading that paragraph. That 4-5x part is still interesting though. INT8 would be 4x, but the extra +1 relative to Polaris I'm unsure about. The rest of the paragraph is INT16/FP16 which I figured ran concurrently for 4x.
If some of the instructions in the addressing category have scalar and vector variants, the scalar portion running a chained operation can be the +1 if running concurrently with a 4x INT8 operation. It seems like a sensible thing to have in both domains.
 
It's only inactive in Vega FE, it's active in AMD's final RX performance targets. AMD stated it's active in 17.20 driver (page 43 note), AMD tested all of their games using driver 17.30, which is the driver after it.

If all of AMD's slides were made using the newer drivers with DSBR enabled and those slides show Vega ~= 1080FE, that indicates Vega would be 10-15% slower than the 1080 without that new feature enabled. I'm not saying you're wrong but my brain can't wrap itself around what that means (i.e. with a 50% clock increase Vega would only be ~15% faster than a FuryX). What has AMD been doing for the last 2yrs? :no:
 
So.. having viewed the slides, in summary, no advancement of fp64 performance. yeah my main interest. Graphics performance of my R9 290x's is good enough for me. But its great to see the other improvements.
 
I'd say you're oversimplifying things. Ryzen is too diferent to Vega / Fury to compare (and who's to say it doesn't performs as it does in spite of the process that hampers it, instead of because of it).

Further, for Fury vs Vega, both the architecture and process are different. Which is the main cause of Vega's underperforming could be anyone's guess.

GP107 is the counter point. Although with baseline frequencies rather low it boost easily to 1700Mhz, uses less power per FPS than the AMD competition and is made in the same process but at Samsung.
 
that indicates Vega would be 10-15% slower than the 1080 without that new feature enabled.
It was indeed, Vega FE is inbetween 1070 and 1080, and it had DSBR disabled. AMD will provide a patch to enable it for FE when RX launches. So FE and RX will have equal gaming performance. Maybe @Rys can shine more light on the matter, if his hands are not tied that is.
 
I think the real question is, with so many FreeSync monitors on the market, why do they bundle the one with the worst reputation? I think Samsung simply saw this as a opportunity to offload a lemon, and AMD took the bait.

GSync has plenty of flickering issues as well. Samsung is one of the biggest suppliers and having a 20%+ discount on a monitor is pretty huge, which many manufacturers probably can't do. Since samsung makes the panels they obviously have the highest markup.

A quick fix was already posted and I'm sure saving a ton vs GSync is welcome.
 
If all of AMD's slides were made using the newer drivers with DSBR enabled and those slides show Vega ~= 1080FE, that indicates Vega would be 10-15% slower than the 1080 without that new feature enabled. I'm not saying you're wrong but my brain can't wrap itself around what that means (i.e. with a 50% clock increase Vega would only be ~15% faster than a FuryX). What has AMD been doing for the last 2yrs? :no:
Exotic memory raising platform cost, lengthened pipeline for higher clock-speeds costing a lot of transistors and reducing IPC, feature extensions which require significant developer effort to implement, an architecture pitched as best suited for 'tomorrow's workloads'.
Sounding a bit like the P4.
 
GP107 is the counter point. Although with baseline frequencies rather low it boost easily to 1700Mhz, uses less power per FPS than the AMD competition and is made in the same process but at Samsung.
Nevertheless it does clock lower than the same architecture implemented on the similar TSMC process.
Boost clocks are actually 26% higher on GP106 vs GP107.
That's pretty much the real world performance delta between the gtx1080 and 1080ti. In Vegas performance segment such differences make a large difference in perception, and thus also in what prices you can ask.
 
Last edited:
Back
Top