AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

BacBeyond · Jul 31, 2017

Cat Merc said:
The flickering happens when you enable the 48-100 mode.

See https://forum.beyond3d.com/posts/1993993/

DavidGraham · Jul 31, 2017

Jawed said:
How much of Vega's (and Polaris's?) power problems are solely due to Global Foundries, versus TSMC?

I'd wager none, They basically used the new node to increase clock speeds at roughly the same power consumption as FuryX. Apparently GCN required a lot of juice and area for that. Ryzen turned out fine despite being fabbed at GF.

Anarchist4000 · Jul 31, 2017

3dilettante said:
Reference?

radeonsi/gfx9: reduce max threads per block to 1024 on gfx9+
The number of supported waves per thread group has been reduced to 16 with gfx9. Trying to use 32 waves causes hangs, and barriers might not work correctly with > 16 waves.
https://cgit.freedesktop.org/mesa/mesa/commit/?id=a0e6b9a2db5aa5f06a4f60d270aca8344e7d8b3f

Sorry, per thread group not SIMD. Regardless that seems a rather interesting change in the scheme of things. However I thought 1024 was an established limit for most APIs.

3dilettante said:
I would reference the filing date for AMD's binning rasterizer, which is 2013. Vega's development may have been prolonged by internal issues, but even in other contexts you wouldn't expect something patented now to have been sat on for the years the design was moving down the pipeline. Given the timing, we might have to wonder if or when it might show up. That it was filed and made public may mean AMD isn't worried about competitors seeing it too early (publication can be delayed significantly from filing if you care).

AMD has had a bunch of patents recently that were all quickly filed and published. For the most part they seem to be software techniques for ambiguous hardware. Like most patents. Anyways:
MEMORY MANAGEMENT IN GRAPHICS AND COMPUTE APPLICATION PROGRAMMING INTERFACES
METHOD AND APPARATUS TO ACCELERATE RENDERING OF GRAPHICS IMAGES (Perhaps worst patent title ever?)

3dilettante said:
128K of what, page table entries?
GCN does support 4KB x86 page tables, which would not cover 16GB. Even if going with the coarse 64KB PRT granularity, the additional context and history tracking would go over a MB.
On top of that, CPU TLB hierarchies and the page table hierarchy are backed up by their caches. The HBCC may not have that option, or it might want to avoid using it that much given the L2 isn't that expansive.

In a more compressed form yes. While certainly possible, I'm guessing the paged memory is a subset of the overall pool. Leaving the HBCC to only track active pages. Some resources(framebuffer, meshes, stacks) simply won't lend themselves to paging very well and likely be kept in a separate pool.

entity279 · Jul 31, 2017

DavidGraham said:
I'd wager none, They basically used the new node to increase clock speeds at roughly the same power consumption as FuryX. Apparently GCN required a lot of juice and area for that. Ryzen turned out fine despite being fabbed at GF.

I'd say you're oversimplifying things. Ryzen is too diferent to Vega / Fury to compare (and who's to say it doesn't performs as it does in spite of the process that hampers it, instead of because of it).

Further, for Fury vs Vega, both the architecture and process are different. Which is the main cause of Vega's underperforming could be anyone's guess.

Anarchist4000 · Jul 31, 2017

CarstenS said:
Possible additions so far - without actual sizes though:

Some references to "Texture Caches" in drivers so there could be multiple generic L2s. That might actually make sense for async to avoid trashing.

Instruction caches would be significant. 48KB (16+32KB) per CU on an old GCN iteration and they've been growing. Those could be critical with higher clocks and over 3MB. Then assume if INT and FP are running concurrently as suggested in one slide they would need to be much larger. Could be 10MB or more in various instruction caches there. Certainly not everything, but that could be half the unaccounted SRAM.

CarstenS · Jul 31, 2017

Texture Caches are L1. 16 KiB per CU.
Where did you get the instruction caches sizes from? Very curious, since I've either completely forgotten about them being discussed or have never seen it.
edit: Ah, the very first GCN presentation. 4 CUs sharing 16 kiB scalar read-only cache (constants?) and 32 kiB instruction L1!
Which slide suggests that INT and FP are running concurrently (on the vec16-SIMDs)?

3dilettante · Jul 31, 2017

Anarchist4000 said:
Sorry, per thread group not SIMD. Regardless that seems a rather interesting change in the scheme of things. However I thought 1024 was an established limit for most APIs.

The GCN3 ISA indicates the maximum workgroup size is 16 wavefronts (1024 work items). Whatever limit is being set here has some other confounding issue if they got away with 2048 before.

Kaotik · Jul 31, 2017

Just to interrupt your usual broadcast, @ToTTenTranz could you update the title to include Vega 12?

3dilettante · Jul 31, 2017

Jawed said:
The "explanation" of the primitive shader is entirely unconvincing. Seems likely to be a white elephant. I wonder if this was built by AMD as the basis for a console chip at some later date. In a console it would be totally awesome, I presume.

It could be an evolution of a customization created for a console already built.
http://www.gamasutra.com/view/feature/191007/inside_the_playstation_4_with_mark_.php?page=3

"There are a broad variety of techniques we've come up with to reduce the vertex bottlenecks, in some cases they are enhancements to the hardware," said Cerny. "The most interesting of those is that you can use compute as a frontend for your graphics."

This technique, he said, is "a mix of hardware, firmware inside of the GPU, and compiler technology. What happens is you take your vertex shader, and you compile it twice, once as a compute shader, once as a vertex shader. The compute shader does a triangle sieve -- it just does the position computations from the original vertex shader and sees if the triangle is backfaced, or the like. And it's generating, on the fly, a reduced set of triangles for the vertex shader to use. This compute shader and the vertex shader are very, very tightly linked inside of the hardware."

The front end reorganization may allow for a more generic version of this compute shader so that it can feed primitive setup across various combinations of VS,TS, and GS. Perhaps the specialized compiler mode is a precursor to how AMD expected to make existing vertex code work for Vega.

Anarchist4000 said:
Some references to "Texture Caches" in drivers so there could be multiple generic L2s. That might actually make sense for async to avoid trashing.

Exact wording may be important. There are references to texture channel caches, which are actually describing the L2.

Anarchist4000 · Aug 1, 2017

CarstenS said:
Which slide suggests that INT and FP are running concurrently (on the vec16-SIMDs)?

AMD estimates a Vega NGCU to be able to handle 4-5x the number of operations per clock cycle relative to the previous CUs in Polaris. They demonstrate a use case of Rapid Packed Math using 3DMark Serra- a custom demo created by Futuremark for AMD to show off this technology- wherein 16-bit integer and floating point operations result in as much as 25% benefit in operation count.
https://www.techpowerup.com/reviews/AMD/Vega_Microarchitecture_Technical_Overview/3.html

Going to hold off on the concurrently part given the context for the time being after re-reading that paragraph. That 4-5x part is still interesting though. INT8 would be 4x, but the extra +1 relative to Polaris I'm unsure about. The rest of the paragraph is INT16/FP16 which I figured ran concurrently for 4x.

Looking at slide 17, it's possible there are two 64KB banks per SIMD. That could account for a good chunk of SRAM and make sense with the longer pipelines and higher clocks.

3dilettante said:
The GCN3 ISA indicates the maximum workgroup size is 16 wavefronts (1024 work items). Whatever limit is being set here has some other confounding issue if they got away with 2048 before.

It may be a linux thing because that code would have been actively used for years now. All the documentation I recall has the 1024 work item limit as you mentioned, but obviously they were exceeding that limit with some success.

3dilettante · Aug 1, 2017

Anarchist4000 said:
Going to hold off on the concurrently part given the context for the time being after re-reading that paragraph. That 4-5x part is still interesting though. INT8 would be 4x, but the extra +1 relative to Polaris I'm unsure about. The rest of the paragraph is INT16/FP16 which I figured ran concurrently for 4x.

If some of the instructions in the addressing category have scalar and vector variants, the scalar portion running a chained operation can be the +1 if running concurrently with a 4x INT8 operation. It seems like a sensible thing to have in both domains.

Elfear · Aug 1, 2017

DavidGraham said:
It's only inactive in Vega FE, it's active in AMD's final RX performance targets. AMD stated it's active in 17.20 driver (page 43 note), AMD tested all of their games using driver 17.30, which is the driver after it.

If all of AMD's slides were made using the newer drivers with DSBR enabled and those slides show Vega ~= 1080FE, that indicates Vega would be 10-15% slower than the 1080 without that new feature enabled. I'm not saying you're wrong but my brain can't wrap itself around what that means (i.e. with a 50% clock increase Vega would only be ~15% faster than a FuryX). What has AMD been doing for the last 2yrs? :no:

moozoo · Aug 1, 2017

So.. having viewed the slides, in summary, no advancement of fp64 performance. yeah my main interest. Graphics performance of my R9 290x's is good enough for me. But its great to see the other improvements.

seahawk · Aug 1, 2017

entity279 said:
I'd say you're oversimplifying things. Ryzen is too diferent to Vega / Fury to compare (and who's to say it doesn't performs as it does in spite of the process that hampers it, instead of because of it).

Further, for Fury vs Vega, both the architecture and process are different. Which is the main cause of Vega's underperforming could be anyone's guess.

GP107 is the counter point. Although with baseline frequencies rather low it boost easily to 1700Mhz, uses less power per FPS than the AMD competition and is made in the same process but at Samsung.

silent_guy · Aug 1, 2017

BacBeyond said:
Honestly no idea why samsung does the multiple modes when other vendors using the same panels don't, because yeah, 80-100 is very small.

I think the real question is, with so many FreeSync monitors on the market, why do they bundle the one with the worst reputation? I think Samsung simply saw this as a opportunity to offload a lemon, and AMD took the bait.

DavidGraham · Aug 1, 2017

Elfear said:
that indicates Vega would be 10-15% slower than the 1080 without that new feature enabled.

It was indeed, Vega FE is inbetween 1070 and 1080, and it had DSBR disabled. AMD will provide a patch to enable it for FE when RX launches. So FE and RX will have equal gaming performance. Maybe @Rys can shine more light on the matter, if his hands are not tied that is.

BacBeyond · Aug 1, 2017

silent_guy said:
I think the real question is, with so many FreeSync monitors on the market, why do they bundle the one with the worst reputation? I think Samsung simply saw this as a opportunity to offload a lemon, and AMD took the bait.

GSync has plenty of flickering issues as well. Samsung is one of the biggest suppliers and having a 20%+ discount on a monitor is pretty huge, which many manufacturers probably can't do. Since samsung makes the panels they obviously have the highest markup.

A quick fix was already posted and I'm sure saving a ton vs GSync is welcome.

kalelovil · Aug 1, 2017

Elfear said:
If all of AMD's slides were made using the newer drivers with DSBR enabled and those slides show Vega ~= 1080FE, that indicates Vega would be 10-15% slower than the 1080 without that new feature enabled. I'm not saying you're wrong but my brain can't wrap itself around what that means (i.e. with a 50% clock increase Vega would only be ~15% faster than a FuryX). What has AMD been doing for the last 2yrs?

Exotic memory raising platform cost, lengthened pipeline for higher clock-speeds costing a lot of transistors and reducing IPC, feature extensions which require significant developer effort to implement, an architecture pitched as best suited for 'tomorrow's workloads'.
Sounding a bit like the P4.

Entropy · Aug 1, 2017

seahawk said:
GP107 is the counter point. Although with baseline frequencies rather low it boost easily to 1700Mhz, uses less power per FPS than the AMD competition and is made in the same process but at Samsung.

Nevertheless it does clock lower than the same architecture implemented on the similar TSMC process.
Boost clocks are actually 26% higher on GP106 vs GP107.
That's pretty much the real world performance delta between the gtx1080 and 1080ti. In Vegas performance segment such differences make a large difference in perception, and thus also in what prices you can ask.

gamervivek · Aug 1, 2017

A reddit user posted a comparison of 1080Ti at roughly similar clocks with a frontier edition, around 10% advantage for the nvidia card.

http://www.3dmark.com/compare/fs/13254853/fs/131174

Edit : the vega card is overclocked on memory as well so the difference will be greater.

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

BacBeyond

DavidGraham

Anarchist4000

entity279

Anarchist4000

CarstenS

Moderator

3dilettante

Kaotik

Drunk Member

3dilettante

Anarchist4000

3dilettante

Elfear

moozoo

seahawk

silent_guy

DavidGraham

BacBeyond

kalelovil

Entropy

gamervivek