AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Not even close.
8c4adf8d42d1fb43e74207ebb37441b1b3d22f15b80bbd999267a417f2718aff.png
Link to the tests?

https://www.spec.org/gwpg/gpc.data/vp12.1/summary.html
There seems to be 2 kinds of P6000 results (Filter), 2 are like the one you have there, 2 are lower, quite close to Vega. Same goes for P5000
 
Is Pascal more bandwidth efficient than Polaris? Sure, no question about it.
Does this mean, that the Pascal color compression is "better"? Not necessarily (it's even difficult to define, what "better" means).....
Lul...

If AMD's Polaris would have implemented the exact same compression scheme as Pascal, it would likely be slower and would consume more bandwidth than it does now.
*cough* Bullshit. *cough*

which makes it difficult to impossible to compare.
Oh, I haven't had any problem comparing the two... YMMV.
I don't think the issues raised can be so readily dismissed.

References on Nvidia:
http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/8
Delta compression is by its nature less efficient than whole color compression, topping out at just 2:1 compared to 8:1 for the latter.
...
The impact of 3rd generation delta color compression is enough to reduce NVIDIA’s bandwidth requirements by 25% over Kepler, and again this comes just from having more delta patterns to choose from. In fact color compression is so important that NVIDIA will actually spend multiple cycles trying different compression ratios, simply because the memory bandwidth is more important than the computational time.

http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/8
Meanwhile, new to delta color compression with Pascal is 4:1 and 8:1 compression modes, joining the aforementioned 2:1 mode. Unlike 2:1 mode, the higher compression modes are a little less straightforward, as there’s a bit more involved than simply the pattern of the pixels. 4:1 compression is in essence a special case of 2:1 compression, where NVIDIA can achieve better compression when the deltas between pixels are very small, allowing those differences to be described in fewer bits. 8:1 is more radical still; rather an operating on individual pixels, it operates on multiple 2x2 blocks. Specifically, after NVIDIA’s constant color compressor does its job – finding 2x2 blocks of identical pixels and compressing them to a single sample – the 8:1 delta mode then applies 2:1 delta compression to the already compressed blocks, achieving the titular 8:1 effective compression ratio.

For AMD:
http://techreport.com/review/30328/amd-radeon-rx-480-graphics-card-reviewed/2
In addition to a larger L2 cache that allows more data to remain on the chip, Polaris has improved delta color compression (or DCC) capabilities that allow it to compress color data at 2:1, 4:1, or 8:1 ratios.

AMD's discussion of DCC: http://gpuopen.com/dcc-overview/

Speculative reference to one way AMD could be doing this hierarchically:
https://google.com/patents/US8810562?cl=nl
An interatively applied hierarchical scheme where compression is affected by the absolute value of the deltas in a given tiled area, but not necessarily pattern-matched.


Comparing compression methods can involve multiple factors, such as the highest ratios they can muster, how frequently they can hit various ratios, and possibly what complications they may inject into the design or usage.

Per the references, Nvidia's compression started out with a potentially lower max ratio of 2:1, but with a pattern-matching method with multiple cycles expended on the data being held in the L2 before writeback.
Pascal moved to higher ratios like 4:1 and 8:1, with a hierarchical application of compression on tiles of compressed data.

Per the Techreport blurb, at least Polaris has the 4:1 and 8:1 ratios. What exactly it is doing to get those ratios is uncertain, but there is a patent where there is a loop of getting hierarchical deltas of deltas.
We do know from AMD's DCC description and from its patches concerning metadata that there is a separate and thrashable compression cache, which can actually worsen performance if it is forced to spill enough. This path is also non-coherent and requires flushing for read after write.
It is not clear to what extent Nvidia's compression matches AMD's restrictions.

We do not have visibility on what the compressors' compressed data actually is, so we do not know how often they compress data and how well they do. Even if a compressor is able to reach a higher average ratio in a no-spill case, if it's doing something like spilling halfway through then there is a constraint on its effectiveness that is not pattern related.

Synthetics that try to measure bandwidth ratios can see what the memory hierarchy spits out, but that could be obfuscated by multiple spills of highly compressed data, special cases (1, 0.0, etc.), additional flushes or format restrictions based on whether there are RAW hazards, or possibly inflated outputs due to fastpaths added by tiling unrelated to the compressor.

AMD's method puts the compressor in the path where its thrash-prone ROP caches are spilling to a write to DRAM. There's much less space or time for trying multiple patterns like there can be for a tile that can sit in the L2 rather than obstruct the DRAM path. A more hierarchical approach would seem to fit this placement, but it might have more cases where it falls through.
The threshold where the multiple attempts exceeds savings due to recalculating too many deltas may also be crossed more readily.
 
There seems to be 2 kinds of P6000 results (Filter), 2 are like the one you have there, 2 are lower, quite close to Vega. Same goes for P5000
The lower P6000 results are from CPUs clocked @2.2GHz, you are really going to take those and compare them to a Vega with a CPU running @4GHz?
These are P6000/P5000 on a 3.6GHz CPU
https://www.spec.org/gwpg/gpc.data/...ults_20170210T1401_r1_HIGHEST/resultHTML.html
https://www.spec.org/gwpg/gpc.data/...ults_20170207T1002_r1_HIGHEST/resultHTML.html
 
3dilettante said:
The threshold where the multiple attempts exceeds savings due to recalculating too many deltas may also be crossed more readily.
I don't dispute that such conditions are possible. I do have a problem with asserting that they are "likely", which is the basis of the entire argument. I mean the guy apparently can't even see the irony of contesting an unfounded claim (which I did not make btw), with another even more unfounded claim. He is essentially arguing with himself, so I have no idea why I need to be involved in the conversation at this point at all...

For my part, I am reasonably satisfied with the conclusions I have drawn, and while the topic is indeed interesting, I have not witnessed any compelling evidence to revise my beliefs. I also do not anticipate any such evidence to be forthcoming... If some should arise, that would be great, until then....
 
Why the strawman? I do not know why you have turned hostile toward me, but my perspective remains the same as it ever has.

Again, I never said otherwise, so why the strawman? I don't get it.
Obviously. It's not a strawman. I explained, why your claim regarding the color compression can't be made by looking just at the overall bandwidth efficiency. You somewhat agree and at the same time you called that explanation bullshit (I was not hostile, btw.).
As for what I am interested in
You stated quite clearly that you are interested in the overall bandwidth efficiency, I got that. But you took offense at my remark, that from the overall bandwidth efficiency one can't readily judge the quality of the color compression (as you did!), especially if there are pretty significant architectural differences between the compared GPUs. You could have just accepted that, or you could have engaged in a meaningful discussion to explore the nitty gritty details of that or also to refute my remark. You did neither.

But no hard feelings! Let's return to the topic.
 
Gipsel said:
I explained, why your claim regarding the color compression can't be made by looking just at the overall bandwidth efficiency.
No, you misinterpreted my claim (strawman), and continue to do so, but I digress....
 
Atomics exist that could bypass those lines. If paging registers it makes sense. Framebuffer less so as there won't be gaps. So yeah it won't work everywhere, but significant. If they did some variable SIMD design those cache sizes could change as well.
GCN's L2 atomics have hardware that is tied to cache lines, and the DRAM devices themselves work at a burst granularity.
GCN's compute focus shifted its cache lines to better match up with those of the CPU and coherent domains, so if one wants to say compute is the future but we should have random special-purpose alignments they cannot have it both ways.

So by Pascal being replaced "soon" you mean "6-8 months from now"?
For Vega, it's the upper range of Pascal, which is starting to be phased out that highest end right now. Vega isn't fully out yet, and the start of the announcement/hype process for the next chips can start earlier than the full deployment.

For me 6 months isn't that long, but that is admittedly subjective. My perception may be colored by how relatively long the generational time frames are.
 
The lower P6000 results are from CPUs clocked @2.2GHz, you are really going to take those and compare them to a Vega with a CPU running @4GHz?
These are P6000/P5000 on a 3.6GHz CPU
https://www.spec.org/gwpg/gpc.data/...ults_20170210T1401_r1_HIGHEST/resultHTML.html
https://www.spec.org/gwpg/gpc.data/...ults_20170207T1002_r1_HIGHEST/resultHTML.html

, your link said said Highest result obtained on a Dell Precision Tower 5810 ( who will be replaced/updated with the Vega version ).. As it seems Dell have contract a lot on Vega.. Including exclusivity on Ryzen Threadrippers for some months .

This said, i willl happy they run their CPU at .4.6ghz ..... Wow, a stock Xeon with 128GB DDR4 running at 4.6ghz ... matching the 7900X overclock on air... and on full cores..
 
Last edited:
Lots of people are complaining that the guy is testing Vega FE on a PC with a 550w PSU and therefore everything he does is invalid.
Why? he is even forcing the clocks to stay @1600MHz all the time. So the card is operating at full power already.
 
I don't dispute that such conditions are possible. I do have a problem with asserting that they are "likely", which is the basis of the entire argument.
The assumed pattern for ROP behavior is that tiles of pixels are imported in a pipelined fashion to the ROP caches, modified, and then moved back out in favor of the lines that belong to the next export.
There are some data points in favor of this.
One is that the highest utilization of DRAM bandwidth comes from the ROPs, per various descriptions of the Xbox One's ESRAM bandwidth usage and tests of GPU fill rate. DRAM utilization would actually fall if there were periods where it was left unused.
The tight coupling of ROPs to channels is consistent with their need to heavily use DRAM.
There are also games that tile things like particle rendering to the small size of ROP caches to reach high performance, with significant performance drop-off if the workload doesn't completely fit the caches.

I think that's consistent with caches that are expected to miss regularly.
 
Why? he is even forcing the clocks to stay @1600MHz all the time. So the card is operating at full power already.
I don't think his results are invalid. I just thought it was worth mentioning since many people (mostly AMD fans) refuse to accept the guy's results.
 
If you don't recall it anymore, that was what I answered to:
You did make that claim. I just argued that and why it is hard to judge just from overall bandwidth efficiency. Make whatever you want with that. I made my point.

I think the simple fact is Gipsel is right nineliven is wrong. When Did NV really get their big perf per watt advantage was it when NV released their first product with colour compression (fermi as far as im aware) or was it when they released their first product with the ROP caches being L2 backed (maxwell).

Unless colour compression just sucked in fermi/kepler then magically became super awesome in maxwell because reasons.
 
When Did NV really get their big perf per watt advantage was it when NV released their first product with colour compression (fermi as far as im aware) or was it when they released their first product with the ROP caches being L2 backed (maxwell).

Unless colour compression just sucked in fermi/kepler then magically became super awesome in maxwell because reasons.
I think Kepler did okay, which compression may have helped with. Fermi had some other things going on, which just goes to show that the analysis is complicated.
Changing to the more heavily tiled rasterization method may have also injected some other optimizations. Some of the discussions around Realworldtech's article about its tiling methods intimated that Nvidia might be getting higher than theoretical ROP throughput in some tests because it might be using other write paths to the tile data or coalescing ROP work. Bandwidth tests would not be able to catch that.
 
Performance per Watt has even more confounding factors than bandwidth efficiency. ;)

I find somewhat funny that the only group who really care about perf/watts are the miners, because this determine their benefitts and they mostly use AMD GPUs...
 
Gipsel said:
I just argued that and why it is hard to judge just from overall bandwidth efficiency.
Did you ever stop to think that maybe... just maybe I had already taken into account everything you have written here before I made my original post? I'll guess no.

Gipsel said:
I made my point.
Congrats on winning an argument with yourself I guess?

itsmydamnation said:
Unless colour compression just sucked in fermi/kepler then magically became super awesome in maxwell because reasons.
I don't recall giving a value for the amount color compression contributes to the overall bandwidth savings. I simply said Nvidia's was better than AMD's and gave a greatest lower bound for the total savings (from all sources) contribution to perf/watt of 25%. Thus the range for what I said is 0 < X < 25 where X is color compression. Of fucking course, I don't think that color compression is responsible for all of that.... but when has interpreting other people's words accurately ever been an issue here before... :sleep:
 
Last edited:
Back
Top