AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

The problem with the typo is that a RX480 would never be in the performance bracket as the GTX1080. Then again, these requirements are completely messed up a lot more often than they should.

There is precedent for game requirements to pick non-equivalent parts from different camps.

Like the BF1 requirements recommend a 4790 or 8350 on the cpu side. I think we could agree those aren't going to perform similarly.

And then the "minimum" is a 6600K or a 6350. I can't make this shit up: http://techreport.com/news/30679/battlefield-1-system-requirements-ask-for-a-core-i5-6600k

I know that's CPUs and we're talking about GPUs, but if Bethesda wanted to pick the "best" option from each camp, then a 1080 and 480 wound be just that (even though they aren't equivalent).

Not necessarily. A web guy could have removed it while verifying accuracy or NDAs were breached if it was drawing attention.

That is possible. I know tons of people picked up on it on that Reddit post that I linked.

We'll have to check it in a couple days.
 
Speaking of Vega, Hynix has updated their HBM2 portfolio again, and it seems they now only have 1.6Gbps chips. 2Gbps chips are nowhere to be found.

https://videocardz.com/65649/sk-hynix-updates-memory-product-catalog-hbm2-available-in-q1-2017

So either big Vega is coming with 1.6Gbps after all (410GB/s total), or AMD will be turning to Samsung to get the 2Gbps chips on big Vega and Hynix's chips will be used on something else like small Vega or APUs.


They can't turn to Samsung till their exclusivity clause is full filled.....

And big Vega is surely going to need more than 410 GB/s bandwidth, and this is why AMD hasn't nailed down a launch date and gave use a "vague" 1st half, even that looks to be iffy going by Hynix's 2.0 ghz memory being TBA. Kinda getting close to the wire.

So was it really good for AMD to have HBM exclusivity from Hynix? Looks to me they got bitten in the ass by this one.....
 
And big Vega is surely going to need more than 410 GB/s bandwidth

Why? The chip is close to GP102 in size and the latter has a 480GB/s bandwidth. With two 1.6Gbps stacks the big Vega would get 410GB/s.
It's a 17% difference in memory bandwidth.
 
They can't turn to Samsung till their exclusivity clause is full filled.....

And big Vega is surely going to need more than 410 GB/s bandwidth, and this is why AMD hasn't nailed down a launch date and gave use a "vague" 1st half, even that looks to be iffy going by Hynix's 2.0 ghz memory being TBA. Kinda getting close to the wire.

So was it really good for AMD to have HBM exclusivity from Hynix? Looks to me they got bitten in the ass by this one.....
What Zaphod said.

And I always understood that AMD has secured "priority access" to HBM2 produced by SKHynix (i.e. they can block access of others by buying all of it) in exchange to the commitment to buy a significant share of the initial products. To my understanding that doesn't necessarily prohibit AMD from buying also HBM2 from Samsung, especially in case SKHynix can't deliver 2GT/s HBM2 and Samsung offers it. But I guess nobody here knows (and wants to share) the exact wording of the agreement.
A more fundamental problem with Samsung's HBM2 could come from implementation details like the physical dimensions of the HBM2 stacks, which are likely slightly different between Samsung and SKHynix (just the height of the stack is specified). It could be that the stacks of one of the manufactures need a slightly bigger interposer, i.e., AMD may need to design and manufacture a different interposer to be able to use Samsung HBM2. Does anybody know the size of the Samsung stacks? I've seen only the exact dimensions for SKHynix.
 
Last edited:
Why? The chip is close to GP102 in size and the latter has a 480GB/s bandwidth. With two 1.6Gbps stacks the big Vega would get 410GB/s.
It's a 17% difference in memory bandwidth.


because as it is right now, AMD's chip need 30% more bandwidth for the same or similar performance nV chips, and this is because nV's chips currently have better compression techniques, since we haven't heard anything from AMD on Vega on this, I'm thinking its not going to change much.
 
What Zaphod said.

And I always understood that AMD has secured "priority access" to HBM2 produced by SKHynix (i.e. they can block access of others by buying all of it) in exchange to the commitment to buy a significant share of the initial products. To my understanding that doesn't necessarily prohibit AMD from buying also HBM2 from Samsung, especially in case SKHynix can't deliver 2GT/s HBM2 and Samsung offers it. But I guess nobody here knows (and wants to share) the exact wording of the agreement.
A more fundamental problem with Samsung's HBM2 could come from implementation details like the physical dimensions of the HBM2 stacks, which are likely slightly different between Samsung and SKHynix (just the height of the stack is specified). It could be that the stacks of one of the manufactures need a slightly bigger interposer, i.e., AMD may need to design and manufacture a different interposer to be able to use Samsung HBM2. Does anybody know the size of the Samsung stacks? I've seen only the exact dimensions for SKHynix.


Missed that post, ok, yeah but usually exclusivity isn't worded as such with just priority access. We know AMD has exclusivity for HBM 1, which probably wasn't full filled so it might also go over to any of the HBM family. This kind of agreement has to protect both sides, if AMD wanted exclusivity they would need to buy a certain amount of memory models to cover the possibility of Hynix supplying the entire market with HBM.

If AMD is allowed to use Samsung there would have to be some sort of "buy out" type clause that would have fairly strict cost ramifications, like what AMD has with GF with wafer allocations. And that would already make an expensive memory even more expensive.
 
because as it is right now, AMD's chip need 30% more bandwidth for the same or similar performance nV chips, and this is because nV's chips currently have better compression techniques, since we haven't heard anything from AMD on Vega on this, I'm thinking its not going to change much.
It's not the compression technique, which gives nV this advantage. The major reason is the tiling rasterizer and the ROPs connected to the L2, both featured also in Vega. We don't know yet, how efficient AMD's implementation will be compared to Pascal. But there is a certain chance it will work just as good (as there is also the chance it will work a bit worse or a bit better than nV's solution).
Missed that post, ok, yeah but usually exclusivity isn't worded as such with just priority access. We know AMD has exclusivity for HBM 1, which probably wasn't full filled so it might also go over to any of the HBM family. This kind of agreement has to protect both sides, if AMD wanted exclusivity they would need to buy a certain amount of memory models to cover the possibility of Hynix supplying the entire market with HBM.

If AMD is allowed to use Samsung there would have to be some sort of "buy out" type clause that would have fairly strict cost ramifications, like what AMD has with GF with wafer allocations. And that would already make an expensive memory even more expensive.
Do you know the exact wording of the agreement or just some chit chat about it without something concrete? For me it's the latter.
 
because as it is right now, AMD's chip need 30% more bandwidth for the same or similar performance nV chips, and this is because nV's chips currently have better compression techniques, since we haven't heard anything from AMD on Vega on this, I'm thinking its not going to change much.
Oh but we have heard of this. The bandwidth advantage of nvidia isn't really due to actual "better compression" (it is possible it's a bit better but it would be a tiny part of why it needs less bandwidth). Rather, it is safe to say it's the TBDR tricks + unified L2 (with ROPs) which makes the most difference. Exactly what VEGA is supposed to do too...
Not saying bandwidth efficiency is going to be the same (there's by far not enough details available), but the potential is certainly there to be in the same ballpark - could really be better or worse...
 
It's not the compression technique, which gives nV this advantage. The major reason is the tiling rasterizer and the ROPs connected to the L2, both featured also in Vega. We don't know yet, how efficient AMD's implementation will be compared to Pascal. But there is a certain chance it will work just as good (as there is also the chance it will work a bit worse or a bit better than nV's solution).
Do you know the exact wording of the agreement or just some chit chat about it without something concrete? For me it's the latter.


We don't know how much savings of bandwidth that is giving them from just the tiled rasterizer but we can see Pascal over Maxwell, the compression technique is giving them a major increase in bandwidth efficiency.

I don't know the exact words, but I have worked with exclusivity clauses in other industry before, and yeah, when ever they are used the company getting exclusivity has to protect the company giving the exclusivity, by covering the profits lost by not selling to others. Otherwise it won't hold up in court of law since there would be no need for exclusivity.

Essential what exclusivity does in a competitive market is stop it from being competitive and removes competition in a free market, which that market will no longer be free, there needs to be something that covers that, and its always money that covers it.

http://www.businessopportunity.com/Blog/better-understand-exclusivity-clause/

Just a tid bit on exclusivity, this is why any attorney worth their salt will tell you, only get into exclusivity with another company if you are 100% sure of the outcomes, there should be no guess work involved for what you are getting and what you plan to do with it exclusivity, if there is they will tell you right off the bat no to do it, and that's when an actuary will step in and advice the attorney what the potentials are on both sides.
 
Last edited:
Imho it makes sense that they do not list faster HBM2, as this will only go to AMD and others who have already signed up.
 
It's not the compression technique, which gives nV this advantage. The major reason is the tiling rasterizer and the ROPs connected to the L2, both featured also in Vega. We don't know yet, how efficient AMD's implementation will be compared to Pascal. But there is a certain chance it will work just as good (as there is also the chance it will work a bit worse or a bit better than nV's solution).

In one of the interviews with Scott Wasson concerning Vega, there was a comparison between the old way and new way, where the ROPs were not aware of copies of the data they were modifying residing in the L2.

It's a bit ambiguous as to what it means to be "aware" of data in the L2.
I believe Nvidia's scheme was characterized as using the L2 to host data for both ROPs and the SMs.
Being made aware of data in the L2 for Vega may mean that both paths put data into the shared cache, but it could be "aware" in the same way Onion+ makes coherent GPU memory "aware" by having GPU traffic invalidate CPU cache lines.

There's still a win by avoiding heavy synchronization and flushing, but it may be more incremental by leaving the ROPs generally operating as they have been, just invalidating L2 lines in a more unidirectional manner than operating in shared cache.
It could be more integrated than that, although some of the questions that would raise would be interesting to see answered.
 
In one of the interviews with Scott Wasson concerning Vega, there was a comparison between the old way and new way, where the ROPs were not aware of copies of the data they were modifying residing in the L2.

It's a bit ambiguous as to what it means to be "aware" of data in the L2.
I believe Nvidia's scheme was characterized as using the L2 to host data for both ROPs and the SMs.
Being made aware of data in the L2 for Vega may mean that both paths put data into the shared cache, but it could be "aware" in the same way Onion+ makes coherent GPU memory "aware" by having GPU traffic invalidate CPU cache lines.

There's still a win by avoiding heavy synchronization and flushing, but it may be more incremental by leaving the ROPs generally operating as they have been, just invalidating L2 lines in a more unidirectional manner than operating in shared cache.
It could be more integrated than that, although some of the questions that would raise would be interesting to see answered.
According to AMD's presentation, the ROPs will be a client of the L2 (and they lose the direct connection to the memory controller). That gives it much more space to work with than the relatively tiny ROP caches connected directly to the memory controller. Therefore, using more of the inherent spatial locality of primitives in a drawcall gets easier as framebuffer tiles have to be swapped to memory less often. On top of that, the binning rasterizer basically orders the rendering process to increase the spatial locality in a significant way to make this caching even more efficient. The bandwidth savings have two sources:
(i) a much larger effective cache size before hitting memory (according to the Vega presentation it will be a two-level caching for the ROPs)
(ii) increased spatial locality of framebuffer accesses due to the tiling rasterizer increasing the effectiveness of the caches

Other potential advantages as a shorter turnaround time for reading a render target after writing to it (as one could directly read stuff still in the L2) come on top of it.
 
Last edited:
According to AMD's presentation, the ROPs will be a client of the L2 (and they lose the direct connection to the memory controller).
That was my general assumption after the initial presentation.
It was in the interview (wish it were in text form) that I found the odd turn of phrase. It was all about the ROPs being aware of data in the L2, but the other direction went without comment. It could simply be the case of verbal imprecision in an time-constrained interview, although he spent a good amount of effort in being as precise as he could overall.

That gives it much more space to work with than the relatively tiny ROP caches connected directly to the memory controller.
However, the way the ROPs work with memory and compression doesn't align with how the L2 works, traditionally. Dropping the read/write path for the ROP caches into would require addressing some of the mismatched requirements ROPs and CUs have for memory and compression handling.

Other potential advantages as a shorter turnaround time for reading a render target after writing to it (as one could directly read stuff still in the L2) come on top of it.
That's an area that would be interesting to see how it was handled. AMD's CU L1 coherence is maintained at wavefront boundaries, unlike the kernel-level coherence of Nvidia's L1, which would make sharing the L2 a more intensive exercise. ROP tiles can vary in size and timing, and items like the traditionally separate compression pipeline unclear in how they are positioned in this new scheme.

There are implications for communicating through the L2 and compression methods for any implementation choice. The best ROP compression ratios cover bigger tiles and do not fit the CUs. Per AMD, DCC works better with current GPUs when the driver can determine that CUs will not be reading a render target.
 
However, the way the ROPs work with memory and compression doesn't align with how the L2 works, traditionally. Dropping the read/write path for the ROP caches into would require addressing some of the mismatched requirements ROPs and CUs have for memory and compression handling.
I don't know. Looks like they are still keeping the tiny caches within the RBEs (which handle compression and decompression) as L1 and just back them with the L2. That means the L2 would only be exposed to the compressed render target tiles (probably limited in size to multiples of the cache line size) and it doesn't need to know anything about it in detail (it just caches cachelines mapping to certain memory addresses). To avoid too much trashing of the L2, one could think of partitioning the L2. One could restrict the RBEs to use only a subset of the L2 (for instance just a quarter or the half of the ways of the set associative L2).This should work pretty well (as a render target is usually contiguous).
There are implications for communicating through the L2 and compression methods for any implementation choice. The best ROP compression ratios cover bigger tiles and do not fit the CUs. Per AMD, DCC works better with current GPUs when the driver can determine that CUs will not be reading a render target.
That's true if a shader access to the render target necessitates a decompression pass. If the TMUs can directly read compressed render targets, I don't see a problem.
AMD's problem right now is also that larger tiles allow for better compression but also cause larger overhead for swapping them in and out the ROP caches. That means that larger tiles (with better compression on average) only get feasible with larger caches, which would be kinda true with a L2 backing of the ROPs.
 
I don't know. Looks like they are still keeping the tiny caches within the RBEs (which handle compression and decompression) as L1 and just back them with the L2. That means the L2 would only be exposed to the compressed render target tiles (probably limited in size to multiples of the cache line size) and it doesn't need to know anything about it in detail (it just caches cachelines mapping to certain memory addresses).
If the idea is that CUs can read that target back in the L2, it introduces a dynamically variable relationship between how many lines in the L2 belong to a tile, and the value of their contents.
The DCC pipeline is described as internally maintaining a secondary cache of compression metadata, which is now an unknown as far as whether it is cached.
Getting a consistent view across multiple locations when there's a hierarchical relationship involved seems like a complex undertaking.
I don't know if that would make the compression path an arbiter of access to compressed targets, since it would know how many lines are relevant and what values it generated. There are many more clients to service than back when it was situated between a statically mapped RBE and its associated channel.

That's true if a shader access to the render target necessitates a decompression pass. If the TMUs can directly read compressed render targets, I don't see a problem.
The TMUs already can read DCC targets, at the cost of losing compression efficiency versus targets defined as being off-limits.

"Shader-readable targets are not as well compressed as when it is known that the shader will not read them."
http://gpuopen.com/dcc-overview/http://gpuopen.com/dcc-overview/

One possible scenario where this may happen is from the Polaris whitepaper, where 256 byte blocks can compress at an 8:1 ratio, which means it can compress down to less than the width of a non-ROP cache line.

"A single pixel in each block is written using a normal representation and all other pixels in the block are encoded as a difference from the first value. The block size is dynamically chosen based on access patterns and the data patterns to maximize the benefits. The peak compression ratio is 8:1 for a 256-byte block."
http://radeon.wpengine.netdna-cdn.c...is-Architecture-Whitepaper-Final-08042016.pdf

I'm not sure if it's coincidental that it matches GDDR5's prefetch length of 8 and transfer size of 32B.
This may make sense since it's currently more about saving DRAM accesses, and any ratios that wind up creating an additional partial burst save nothing in that regard.
 
If the idea is that CUs can read that target back in the L2, it introduces a dynamically variable relationship between how many lines in the L2 belong to a tile, and the value of their contents.
The DCC pipeline is described as internally maintaining a secondary cache of compression metadata, which is now an unknown as far as whether it is cached.
Getting a consistent view across multiple locations when there's a hierarchical relationship involved seems like a complex undertaking.
I don't know if that would make the compression path an arbiter of access to compressed targets, since it would know how many lines are relevant and what values it generated. There are many more clients to service than back when it was situated between a statically mapped RBE and its associated channel.

...

The TMUs already can read DCC targets, at the cost of losing compression efficiency versus targets defined as being off-limits.
As you said, the TMUs (GCN3/4) already can read (sample) variable width DCC targets, meaning that the DCC metadata needs to be loaded to L1 and L2 caches. I imagine Vega ROPs (tiny L1 ROP caches) and TMUs (CU L1 caches) load the DCC metadata from the shared L2 cache. I would assume that you need to flush both the ROP L1 caches and the CU L1 caches when transitioning a render target to readable. But since these caches are inclusive (L2 also has the L1 lines), this flush never causes any memory traffic. This should be very fast compared to the current ROP cache + L2 cache flush.

But Rasterizer Ordered Views (ROV) need tighter synchronization. Current GCN L1 caches between CUs are not coherent. Driver needs to disable L1 cache for UAV writes when "globallycoherent" is enabled (multiple CUs reading & writing same memory). GCN also writes atomics directly to L2 (automatically seen by all CUs). When ROV is enabled, it needs to behave similarly as "globallycoherent". ROP writes need to go directly to L2 cache. Also I wouldn't be surprised if Nvidia and Intel disabled DCC when ROV is enabled. AMD could simply do the same.
 
Last edited:
To avoid too much trashing of the L2, one could think of partitioning the L2. One could restrict the RBEs to use only a subset of the L2 (for instance just a quarter or the half of the ways of the set associative L2).This should work pretty well (as a render target is usually contiguous).

Alternatively, assuming some sort of pseudo LRU eviction policy in the L2, insert ROP data, not as least recently used, but with an "age". If the ROP data sees reuse, it will be marked as least recently used and stay in L2, if not, it will be evicted quickly.

Cheers
 
Back
Top