AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Story so far:

- "You can't compare these cards to Quadro! I don't care that AMD themselves categorizes this card as part of Pro line and boasts "professional, but not certified drivers". It's not a "Pro card"!
- "OK then, if we look at some gaming benchmarks..."
- "OMG, what kind of idiot would expect a PRO card to have decent gaming drivers? "

You know how sometimes the studios refuse to screen a movie for the critics...?

-Certified drivers by Autodesk, solidworks Dassault, etc take time... never seen a pro card got them available at launch before tests are conducts intensively.. In general, we allow our customers to use non certified drivers at their own risks, but they can still use them.. before we give the green light on selected drivers. ( Meaning in general a few certified drivers will be given green light in a year ) . ( By green light, i mean after conduct all tests in intern, we give a certification that this driver is stable for use with our software and recommended . ).-. Certified drivers are more for the company who bring sofwares to protect them vs bring satisfaction to consumers.... Thoses are not WHQL from MS..
 
Last edited:
AMD does have color compression and I'd expect diminishing returns on improving that feature.
And yet, Nvidia still thought it useful to improve compression on Pascal. Diminishing returns or not, when consumers and review website declare a GPU win or loss based on single digit percentages, you probably can't afford to leave these kind of improvements on the table.

Especially if AMD foresees the market moving towards compute where it isn't used. Compression there is a programmer implementation.
That's the "forward looking architecture" argument. The one that makes you win benchmarks a year after the damaging reviews have been published.

Prefetch is consecutive data words. Most of that compression will be from working around masked lanes. Graphics bandwidth comes from this as SIMDs usually read consecutive data. Zero compression being a significant component of overall compression. From there it's a matter of smaller block sizes mapping better to sparse data. Prefetch of one and zero compression would be pointless as you'd just skip it. Then add in the additional channels on separate tasks to make up the bandwidth.
Now I'm even more lost. Compression is about being able to reduce a transaction of a certain set of 32 bytes (say, 8) into a smaller set of 32 bytes.

The prefetch length (number of cycles) is completely irrelevant if the prefetch size (32 bytes) is identical.
 
Compression there is a programmer implementation.
Humm, no. We are not talking about texture block compression (which works is just run a script to prepare the application assets...) which is usually lossy.. We are talking about frame colour compression which is lossless, and impacts in several ways the resources in the GPU to optimize bandwidth. It is hardware related only since it impacts cache read optimizations, which means the programmer cannot directly invoke it as it was an API.
You can read more about AMD delta colour compression here: http://gpuopen.com/dcc-overview/
 
That's the "forward looking architecture" argument. The one that makes you win benchmarks a year after the damaging reviews have been published.
Fine Wine technology. Given the feature support it seems an accurate description of Vega here.

Now I'm even more lost. Compression is about being able to reduce a transaction of a certain set of 32 bytes (say, 8) into a smaller set of 32 bytes.
Agreed, but that works backwards as well. Mask off all but one lane, going scalar, then compress the result into too large of a block. Block size being the prefetch window for maximum bandwidth. The smaller inherent block size allowing better compression, or more appropriately not wasting space. At 16n with GDDR5X, any sparse waves are going to be rough. So even without true compression, you avoid transferring lots of zeroes.

Humm, no. We are not talking about texture block compression (which works is just run a script to prepare the application assets...) which is usually lossy.. We are talking about frame colour compression which is lossless, and impacts in several ways the resources in the GPU to optimize bandwidth. It is hardware related only since it impacts cache read optimizations, which means the programmer cannot directly invoke it as it was an API.
You can read more about AMD delta colour compression here: http://gpuopen.com/dcc-overview/
I'm saying in a compute shader that lacks the concept of colors, textures, etc. Where some sort of reduction or programmer implementation would do the compressing. The compression being a programmable feature. I get the lossless part, which is why I limited my example to zero compression. DCC I excluded from my example, but would stack with some form of zero compression given stencils. Maybe checkerboard? Really depends on what operations are being performed.
 
Agreed, but that works backwards as well. Mask off all but one lane, going scalar, then compress the result into too large of a block. Block size being the prefetch window for maximum bandwidth. The smaller inherent block size allowing better compression, or more appropriately not wasting space. At 16n with GDDR5X, any sparse waves are going to be rough. So even without true compression, you avoid transferring lots of zeroes.
The CU and L2 cache lines are 64 bytes. Compression or masking will not change that this is the minimum they can operate with.
Some of the other caches like ROP caches are less clear, although even then I think it's at worst 32 bytes.
DRAM burst length doesn't have any reason to change, given what it interfaces with. Prefetch is what the DRAM must send in a burst, or what will become dead cycles in a chopped burst (for the standards that allow the option).
 
I don't think I ever indicated otherwise? In my post from August of last year I believe I described it as a necessary but not sufficient condition...
I think you misunderstood the point I was trying to make. You claimed AMD's color compression is not as good as nVidia's. I claim you simply can't judge that in isolation.
Is Pascal more bandwidth efficient than Polaris? Sure, no question about it.
Does this mean, that the Pascal color compression is "better"? Not necessarily (it's even difficult to define, what "better" means). If AMD's Polaris would have implemented the exact same compression scheme as Pascal, it would likely be slower and would consume more bandwidth than it does now. Does this mean the color compression in Polaris is better than in Pascal? No, that wouldn't be proof of that neither. It just means that the color compression doesn't work in a vacuum and needs to be tailored to the specifics of the architecture, especially how the rasterization and ROP/framebuffer caching works (specifically the combination of the sizes of framebuffer tiles, compression blocks, and cache sizes). And even having a color compression scheme that has a better (or at least as good) compression ratio for all cases doesn't mean the architecture comes out on top in the bandwidth efficiency department, let alone the case that different architectures uses different sizes for their compression blocks and ROP tile sizes (to accommodate the different raster/ROP/caches architectures), which makes it difficult to impossible to compare.
 
Fine Wine technology. Given the feature support it seems an accurate description of Vega here.
Raja actually seemed to be even proud of it and mentioned it during a financial conf call. "Look how we great we are at underperforming on launch day!"

Agreed, but that works backwards as well. Mask off all but one lane, going scalar, then compress the result into too large of a block. Block size being the prefetch window for maximum bandwidth. The smaller inherent block size allowing better compression, or more appropriately not wasting space. At 16n with GDDR5X, any sparse waves are going to be rough. So even without true compression, you avoid transferring lots of zeroes.
I don't know what you mean by "lane" (or the rest of your explanation for that matter), but let's leave it at that.

At least you agree that only GDDR5X is different with 64 byte prefetch, while HBM2 and GDDR5 are essentially the same with 32 bytes.

And that's before considering GDDR5X pseudo-channels which, unlike HBM2, do reduce the prefetch size back to 32 bytes.
 
The way AMD has commented on it, the "game mode" isn't supposed to be equal to having Radeon RX -version of the card

So what is the deal then:

a) you buy the expensive FE and get an intentional hit on gaming performance compared to the RX because AMD thinks that is correct
b) you get an immensely expensive semi-pro card that has drivers that are not suited for gaming but also not certified for most pro applications

we could add

c) even if the card is delayed AMD still unable to deliver acceptable launch drivers
 
Gipsel said:
Is Pascal more bandwidth efficient than Polaris? Sure, no question about it.
Does this mean, that the Pascal color compression is "better"? Not necessarily (it's even difficult to define, what "better" means).....
Lul...

Gipsel said:
If AMD's Polaris would have implemented the exact same compression scheme as Pascal, it would likely be slower and would consume more bandwidth than it does now.
*cough* Bullshit. *cough*

Gipsel said:
which makes it difficult to impossible to compare.
Oh, I haven't had any problem comparing the two... YMMV.

silent_guy said:
Raja actually seemed to be even proud of it and mentioned it during a financial conf call. "Look how we great we are at underperforming on launch day!"
Goes to show the power of messaging, which AMD screwed up pretty hard in the RX480 launch...

Even if Vega still trails Pascal by a small % in perf/watt and thus ultimate performance, I don't think the community will crucify them too hard if they get the messaging right. Of course, AMD's execution lately has been like a game of beanboozled, time to grab the popcorn...
 
I'm saying in a compute shader that lacks the concept of colors, textures, etc. Where some sort of reduction or programmer implementation would do the compressing. The compression being a programmable feature. I get the lossless part, which is why I limited my example to zero compression. DCC I excluded from my example, but would stack with some form of zero compression given stencils. Maybe checkerboard? Really depends on what operations are being performed.
Are you talking about software compression via compute shaders? Even with proper intrinsics support, I doubt the waste of clock cycles will reward with memory or bandwidth savings, except in extreme case, at least if we are talking about real time applications. Probably the best thing is the proper choice of algorithm and data structures... Anyway, I think developers would appreciate more a working general purpose shading language (hello Microsoft, templates are REALLY the first thing you should add to the new compiler!) with proper working and stable drivers (ie: no OpenCL shit drivers). Also a valid page faulting support in WDDM would also really appreciated (doesn't Pascal support some form of page faulting? What about the new memory controller of Vega?).
 
SPECviewperf numbers are up. AMD ran their SPECviewperf in monitor native resolution (4K) while official results are in 1900x1060. That's why Vega FE looked so bad against Quadros - with the proper resolution Vega is up around the level of Quadro P5000 and P6000
Not even close.
8c4adf8d42d1fb43e74207ebb37441b1b3d22f15b80bbd999267a417f2718aff.png
 
Last edited:
It doesn't look too bad compared to the P5000, although a couple of big losses, it does also have a couple of wins and is half the cost. Of course, if it isn't officially a pro card (support), then that is the cost differential right there....
 
Gipsel said:
I feel struck with awe in view of such compelling arguments!
Same.

Gipsel said:
I gave you some reasoning for my point of view. You don't have to agree. But at least try to give a meaningful answer or don't bother to answer at all. That's something like the discussion 101. ;)
I'm not really interested in "reasoning" but in hard data. If you have some you would like to share, by all means.... Until then I have no reason to trust your "reasoning" over actual real world results. Sorry.
 
The CU and L2 cache lines are 64 bytes. Compression or masking will not change that this is the minimum they can operate with.
Some of the other caches like ROP caches are less clear, although even then I think it's at worst 32 bytes.
Atomics exist that could bypass those lines. If paging registers it makes sense. Framebuffer less so as there won't be gaps. So yeah it won't work everywhere, but significant. If they did some variable SIMD design those cache sizes could change as well.

Are you talking about software compression via compute shaders? Even with proper intrinsics support, I doubt the waste of clock cycles will reward with memory or bandwidth savings, except in extreme case, at least if we are talking about real time applications. Probably the best thing is the proper choice of algorithm and data structures... Anyway, I think developers would appreciate more a working general purpose shading language (hello Microsoft, templates are REALLY the first thing you should add to the new compiler!) with proper working and stable drivers (ie: no OpenCL shit drivers). Also a valid page faulting support in WDDM would also really appreciated (doesn't Pascal support some form of page faulting? What about the new memory controller of Vega?).
Software via compute would be an option, but more an algorithm. Writing to LDS/GDS prior to flushing between waves, manual DCC, etc. Implementing tiled rasterization manually for example. Not compression so much as alleviating the need for it. Same thing GPUs are doing transparently already. Was DCC even a separate circuit or just appended instructions to calculate delta with reduction to get a final result?

For a shading language, C would be nice. Just write GPU code as giant loops a GPU can unroll. So close with unified memory and page faulting.
 
I should have added the "rumored" qualifier, but it was based on the planned PCIe Volta at the end of the year that might presage something coming out for the upper range, and additional rumors of different Volta chips early next year.

So by Pascal being replaced "soon" you mean "6-8 months from now"?
 
I had some at least. :rolleyes:
I'm not really interested in "reasoning" but in hard data. If you have some you would like to share, by all means.... Until then I have no reason to trust your "reasoning" over actual real world results. Sorry.
So you are denying that the spatial ordering of a tiling rasterizer (higher locality and therefore higher cache hitrate) and the larger cache sizes (again higher hitrate) influence the bandwidth efficiency (and not only the color compression) of a GPU? And you also don't deem it likely that with larger caches and hitrates there is an incentive to increase the size of the compression blocks, where on another architecture without these features larger compression blocks would be prohibitive because of the added cost for the higher number of block/tile reloads? Or in other word, you would deny that differences in GPU architectures also create different choices for the color compression?
Now that would be something I would consider highly unlikely (to not call it bullshit, you know).

As I said: Bandwidth efficiency comprises way more than just the color compression. And these different factors need to work together, they influence each other. And therefore you can't judge the color compression in an isolated way from the overall bandwidth efficiency.

If you are not interested in such a discussion, that's fine. But others may be. And for sure it doesn't mean it's bullshit. ;)
 
Back
Top