AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Broken and not enabled are two very different things...
He just said, it does not work as designed (answering your very question) - and Vega clearly was designed with Primitive Shaders in mind, if you go by AMD PR.
 
What do you mean by half-broken? Non enabled/working in drivers or are you saying you think its the hardware? Because when you say "half-broken new uArch" it sounds like you think its a hardware issue.
GPU without drivers is nothing more than a slab of Cu and Si.
It's half-broken whether it's hardware or software.
 
Where did anyone say that? I've seen no official words to that effect at all. Is there a link you could post please?
I think you misunderstood me. I was just pointing out, that it was confirmed not to be working as designed (as opposed to being confirmed as being broken)

[hope this quote orgy works]
I really wonder if Vega works as designed or if implementation of design is broken.
It doesn't, that was confirmed ages ago.
Where was it confirmed exactly?
Right on this very forum. https://forum.beyond3d.com/posts/1997699/
Broken and not enabled are two very different things...
„It doesn't“ seems to indicate to me, that Bondrewd was commenting on the „works as designed“ grammatically. No guarantees though, as I am not a native speaker.
 
I think you misunderstood me. I was just pointing out, that it was confirmed not to be working as designed (as opposed to being confirmed as being broken)
I understood you fine... I'm just wondering where the actual confirmation it was confirmed not to be working as designed, because I've seen no such confirmation, other than forum busybodies making claims, that is. :p
 
What i finde intresting is, that in the White paper is written that they "tested" the culling rate with a Special driver. 17.320.

17.320 is strang there is also written that ther were testdrivers 17.20 and 17.30
 
I understood you fine... I'm just wondering where the actual confirmation it was confirmed not to be working as designed, because I've seen no such confirmation, other than forum busybodies making claims, that is. :p
Well, Vega is designed to make use of primitive shaders. According to Ryan Smith who asked AMD about it, PS are not enabled in any (then) current shipping drivers. You do the logic.

P.S.:
Maybe it helps to mentally differentiate between „works as designed with the current driver software“ and „is theoretically able to work as designed“.
 
It makes me wonder if Vega would be better off without hierarchical-Z support. Basically, DSBR and hierarchical-Z are competing to do the same thing (prevent shading of fragments that will have no effect on the render target). Why have two things on chip that are trying to do the same thing?
That seems too simple of a change to make. I wouldn't say they do the same thing as much as DSBR facilitates improved HiZ. Allowing occlusion culling against a subsequent triangle that ends on top of the pile.

Assuming the Hi-Z and DSBR tiles line up that is. More efficient cache usage that way.
They don't have to line up, but yeah it would be more ideal. Not being aligned just means more tests against impacted HiZ or bin tiles. Should still be a performance win, but obviously less is more.

What i finde intresting. Inside the White paper there is a Chart which Shows that primitive shaders are working.
At that time the context would have been dev implemented primitive shaders. As an API, at least publicly, still doesn't exist; no devs could have implemented them. Maybe they exist as a driver optimization involving shader replacement, but hard-coded test cases may be a more apt analogy to demonstrate techniques. The implementation appears to be evolving, likely with dev feedback and experience.

From Nvidia's old paper, we know that a monolithic design will handle beat an "equivalent" chiplet design, but for these kinds of savings, you can afford to underprice the monolithic design by a wide margin.
Or make larger designs than would otherwise be possible. Epyc and V100 are roughly the same size already, and AMD could go larger while V100 is at the reticle limit.

Why make all of those architectural changes to increase clocks if you're going to underclock just one generation later?
The ability to increase clocks typically corresponds to less energy usage per clock. Fiji and Vega being similarly scaled designs, it's not just a contention issue.
 
According to Ryan Smith who asked AMD about it, PS are not enabled in any (then) current shipping drivers. You do the logic.
Not sure what logic it is I'm supposed to do, as all he said is it's not enabled. That's not confirmation, official or otherwise, that the feature isn't working as designed. It's just an assumption.

Maybe it helps to mentally differentiate between „works as designed with the current driver software“ and „is theoretically able to work as designed“.
Well, he didn't say anything like that, so again, all he said it's not enabled.
 
Not sure what logic it is I'm supposed to do, as all he said is it's not enabled. That's not confirmation, official or otherwise, that the feature isn't working as designed. It's just an assumption.


Well, he didn't say anything like that, so again, all he said it's not enabled.

Well, he reported he had explicit confirmation from AMD. I don't think he'd lie about it, being the Editor-in-chief of anandtech.com, not some cheesy random blog. If that's not good enough for you, I guess we have to agree to disagree upon this.

The logic you were supposed to do is as follows: If PS not enabled in the drivers, Vega is not working as designed - because the design includes PS. Nowhere does this imply or deny that the feature is broken in hardware, just that Vega is currently not working as designed. Naively, I thought it blatantly obvious.
 
It makes me wonder if Vega would be better off without hierarchical-Z support. Basically, DSBR and hierarchical-Z are competing to do the same thing (prevent shading of fragments that will have no effect on the render target). Why have two things on chip that are trying to do the same thing?

Without hierarchical-Z, perhaps there'd be less paths competing for resources. Hierarchical-Z needs to support two kinds of queries:
  1. coarse-grained - can any of this triangle be visible in each of the coarse tiles of the render target?
  2. fine-grained - which fragment-quads (or MSAA fragment-quads) of this triangle can be visible
Per the GDC presentation from DICE on using compute to do triangle culling, Hi-Z is a 32-bit code word per 8x8 region of pixels.
From the patents on AMD's rasterization and Linux patches for Raven Ridge, the binning process would seem to fall into a coarser level 0, with areas of screen space ranging from thousands to hundreds of thousands of pixels. The choice would seemingly vary per combination of render target and shader features.
The coarse bin intercept stage is so broad that it would seem like there's still many orders of magnitude between bin determination (0) and sub-pixel coverage (2) where there's potentially room for an additional reject, perhaps as a shortcut to speed up bin processing before a triangle hits the scan-converter or a way to remove a triangle from a batch context before it overflows (possible in some potential embodiments).
The stride of the rasterizer doesn't seem to have grown to match the bin extents.
The Vega ISA instruction for primitive shader SE coverage posits 32-pixel regions, which is somewhat out of alignment with that sizing, although it doesn't align perfectly with the bin sizes given in the Linux patches either.

Some reasons why HiZ could remain is that while there is some rough overlap, its lifetime and terminating conditions versus those of the DSBR and batching are unclear. There are some shared conditions where HiZ and the DSBR do not apply, such as transparencies or shaders that modify depth. I'm less clear on whether some of those conditions necessarily terminate HiZ vesus delay its use until the pipeline clears. In steady-state function, the lifetime of a bin in process is also short, although I suppose the most relevant data in this case would live on in the depth buffer. Htile would have a lifetime spanning countless bins.
Outside of that, we know that the DSBR and batching can give up in instances HiZ does not, per the descriptions of embodiments and the Linux patch. If pixel or depth context is heavy enough, or some other glass jaw of the DSBR or batching logic is hit, the fallback path is to revert to the standard methods--if a somewhat slightly worse form of them.
There's also a mode where the DSBR doesn't defer shader invocation, where having HiZ would help prevent a regression in performance.

That speculation aside, if the contention is that both HiZ and DSBR walk like a duck and quack like a duck, maybe they aren't different birds?
The bin sizing seems to indicate a goal of keeping whatever color, depth, and stencil context there is on-chip. If this means keeping the relevant tiles in the depth and color caches, wouldn't the metadata like Htile be kept similarly unthrashed?
The DSBR's depth tracking could be initialized from a snapshot of existing depth, one of which might be delivered much more rapidly. Additionally, the math to determine which fixed 8x8 regions of the screen could be checked might be easier to calculate and apply to multiple primitives in the pipeline hardware than the variable bin dimensions, and a handful of 32-bit values every 64 pixels could be kept closer to the front end than a whole bin context.
Past that, it would seem like writing the bin's depth data back could write back Htile at the same time. Since there seems to be software that uses HiZ directly, like in the consoles, the hardware ideal might have to lose to the reality of software already using it. Since HiZ and the DSBR have an early input into the geometry engine's triangle processing, their physical proximity might translate into them being physically conjoined or being modes of the same hardware. The complexity might be causing problems, although the counter-point may be that a pure implementation's downsides leave it with glass jaws that leave mediocre improvement anyway.

Some of these elements might be used to reduce the overall latency of the DSBR, which AMD has indicated as a side-effect of the whole batching process. I don't recall seeing results for the DX12 thread's asynchronous compute test. I'm not so curious about AC, but whether the frequent trend of AMD's straight-line graphics performance often being long enough to counteract its compute prowess is helped or hindered--if we had the ability to toggle this part of the architecture.

But is something really broken? What if everything is working as intended or what if part of the design just turned out to yield bad results causing need to revisit chip architecture?
It has precedent.
x87 floating point stands as a design that was implemented mostly as intended (to an extent, one of the deviations in its implementation may have been more likely to be salvageable https://cims.nyu.edu/~dbindel/class/cs279/stack87.pdf), but as it turned out some elements of the design weren't as important or helpful as was thought, and a few complex points that the designers hand-waved as being handled by software turned out to be intractable for software to handle. This left even the best x87 CPUs generally hamstrung to around half the throughput of comparable non-x87 FPUs.

Perhaps, and some rumors did hint at a possible scenario like this, the hardware designers or implementation fobbed off some points for the software to handle, and reality "Noped" back at them in response.
It's massively worse with x87 given the heavy ISA and architectural exposure, which the DSBR doesn't seem to have--although if there were exposure AMD could have neglected to mention it.
I don't think GCN will have the opportunity to correct any missteps after 20 years like x86 did by deprecating x87 for SSE2.


Than you scrap it/don't talk about it. They very much did talk about Vega.
There were a number of instances where RTG and/or Koduri should have kept their mouths shut but didn't. I appreciated his desire to make RTG more assertive in its role, but at least some rumors make it seem that the gap between results and talking big finally prompted AMD to tell him to take a long dance off a short pier.
 
The logic you were supposed to do is as follows: If PS not enabled in the drivers, Vega is not working as designed - because the design includes PS. Nowhere does this imply or deny that the feature is broken in hardware, just that Vega is currently not working as designed. Naively, I thought it blatantly obvious.
For what it's worth, I picked the words I used with quite a bit of care; AMD is only confirming that the feature is not enabled in their drivers right now. They weren't saying anything about the hardware (which isn't meant to be a negative implication, only that they had nothing to say).

To most native English speakers, saying something is not working right, especially when it comes to electronics, implies a hardware failure ala R600.
 
For what it's worth, I picked the words I used with quite a bit of care; AMD is only confirming that the feature is not enabled in their drivers right now. They weren't saying anything about the hardware (which isn't meant to be a negative implication, only that they had nothing to say).

To most native English speakers, saying something is not working right, especially when it comes to electronics, implies a hardware failure ala R600.
Thank you for stepping in. That's exactly why i made the differentiation in my later posts.
 
Back
Top