AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Actually, isn´t PS4 Pro GPU a Vega chip without HBM2 (and lacking also the tiled based rendering) in disguise?. The shader cores seem the same (capable of FP16 packed math), the scheduler and geometry engine are also Vega like...
No, it's a Polaris with some added extra that will come available in Vega for PC
 
I don't see a reason not to do exactly the same with the HW_ID bits. And there is plenty of space for just adding a different set of bits for newer versions.
I wouldn't see a need to do that until the needs of the internals were expanded so that there needed to be extra representation externally.

Because AMD seems to value consistency across generations? ;)
For a number of these context values, apparently yes. AMD went so far as to reserve bits inside the ID representation, which seems to indicate that there is value to them in having things positioned the way they are.

The main task of the needed rework is not changing a few bits in some registers. It's actually designing the work distribution and all the crossbars in the chip to the needed size.
The amount of work distribution is less variable, as wavefront allocation and launches are constrained to at most 1 wave per pipe capable of dispatch, and some limited number per shader engine (1-2?).
How the GPU tracks free resources for determining if a wavefront can launch, allocating them, and in what order looks like something that relies on some of these values, perhaps in a mix of hardware or microcode since sources like the Xbox One SDK indicate the strides used distributing work can be set by the developer.
Coordination from the command processors, shader engine fixed function regions, and the CU arrays also frequently use pathways independent of the vector memory path, such as the GDS and export buses.

The crossbars in GPUs have way more clients (just look at the connections between all the L1 caches [there 96 separate ones!] and L2 tiles [everything has a 512bit wide connection btw.]).
There's the crossbar or a cheaper approximation of one between the write-through L1s to the L2. No communication between L1s, no coherence outside of the trivial handling by the L2, limited RAS, and very weak memory model.

And isn't AMD on record, that AMD will use some infinity fabric derived thingy to connect everything on the chip to the L2 tiles/memory controllers in Vega?
The most AMD has said is that the infinity fabric is implemented as a mesh in Vega, and that it has the same bandwidth as the DRAM.
Putting that between the CUs and the L2s would be a regression in the usual ratio of L1-L2 to DRAM bandwidth from prior architectures, and in raw terms inferior to Hawaii's L2. This is why I am having doubts about the infinity data fabric going into the traditionally encapsulated portion of the GPU. For Zen, the infinity fabric doesn't inject itself between the cores and their local L3 either.

But a crossbar (and the effort needed for it) is not agnostic to the number of attached clients, quite the contrary actually.
That wouldn't have anything to do with work distribution. All of that decision making is done by logic on either side, the fabric may be responsible for routing payloads, but it wouldn't know what they are or decide on the endpoints. It wouldn't know what mode the GPU or CU groups have decided for kernel allocation, how two dispatchers arbitrate a collision, nor could it interpret the meaning of a wavefront's signalling its final export.
The data payload movement isn't what I was considering, although I think most of that portion is actually handled by bus connections, more unidirectional flow, and internal hierarchies--and given the other limits some of those might be streamlined down to have at most N clients or would scale very poorly if they had more.

Also, the vector memory path prior to Vega wouldn't put a limit on the RBE count or rasterizer width per SE, since those aren't L2 clients.

If you scale a chip from 10 CUs (with 16 L1$ and 4 L2-tiles, i.e. you have a 16x4 port crossbar between them [as mentioned, every port is capable of 512bit per clock]) to 64 CUs (96 L1$, 32 L2 tiles, i.e. a 96x32 port crossbar),
I accept that the memory crossbar can scale significantly between chips, although some of the elements I was discussing would only see a generic memory port with 1 access per cycle or hook into separate data paths.
I presume the the 32 L2 tile case is for Fury and its 4x8 channels? Bandwidth synthetics showed a distinct lack of bandwidth scaling over Hawaii until after accesses went outside the L2, for some reason. Also, I'm not entirely sure how many of the 96 clients can independently use the crossbar, and this may leave out some of the blocks that can also read from the L2.

Access arbitration as broad as 96x32 may be excessive, given the limits of the L2 and possibly other simplifying assumptions about distribution. It's possible that this could be a limiter in this pathway, or in access arbitration.

Each vALU instruction has a certain fixed latency usually identical to the throughput (4 cycles for full rate instruction, 8 cycles for half rate and so on, exceptions exist).
A lot of exceptions stem from VALU instructions that can source from the LDS, which for some reason don't make the list for a waitcnt. As documented in the ISA doc, VALU instructions cannot increment or decrement the counter anyway.

To give a specific example, the consumer parts for Hawaii (which is physically a half rate double precision chip) have a fuse blown (or potentially some µcode update done by the firmware) which sets this to 1/8 rate for the DP instructions. The chip effectively pauses the vALU for some cycles after each DP instruction.
Other chips have units which can do only 1/16 DP rate.
Perhaps if the GPU assembly were manually arranged or the compiler hacked, that pause be tested to see if it allows a shader to remove some the required wait states for various vector ops, if there's effectively 6 vector issue cycles of delay added this way.

What sequencer?
This was part of my earlier speculation, where there's a smaller execution loop in the various domains that manages their pipelines and allows for them to coordinate despite varying latencies.
That allows for several smaller execution loops that modify the counters, which would then relieves the scalar pipeline of having to tightly coordinate with them. I think the scalar pipe itself is more closely linked to one of them, or is itself derived from a formerly unexposed sequencer in pre-GCN GPUs.
Something that awkwardly bridges multiple domains requires conservatively tracking multiple counters by setting a 0, like for flat addressing or scalar memory ops.

Why that might be interesting for instructions like vector ops that can have unexpected latencies or variable latencies is that there apparently is or was a VALU count, one which hasn't been fully expunged. However, hard-wiring an implicit requirement that it be 0 can allow for some of those variations or new vector functions to be handled transparently if there is similar logic in the SIMD block.

It's very much about economizing the effort. But you can't just build a large GPU out of the design of a small CP, some compute pipe, a CU, an RBE, and a L2 tile with a connected 32bit memory channel.
The relationship between a number of these is rather constrained. CP to SE has a few combinations with no clear dependence on CU count. SE front end is linked to RBE, not CU count or the command processor block. RBEs have previously not connected to the L2. Some of the wait counts mentioned earlier are about arbitrating access to a bus, rather than having an ever-changing interconnect.
 
Last edited:
I wouldn't see a need to do that until the needs of the internals were expanded so that there needed to be extra representation externally.
And so the whole argument for a limit based on these bits collapses.
What were we discussing again?
There's the crossbar or a cheaper approximation of one between the write-through L1s to the L2. No communication between L1s, no coherence outside of the trivial handling by the L2, limited RAS, and very weak memory model.
You still need the crossbar switch (or some other non-blocking multistage interconnection network, which amounts to basically the same effort) with exactly the amount of input/output ports (data can obviously be transferred in both directions) as I said (that was the a x b notation, the number of ports on both sides). That figured already in, that you don't need to directly connect the L1 caches to each other. ;)
The most AMD has said is that the infinity fabric is implemented as a mesh in Vega, and that it has the same bandwidth as the DRAM.
Putting that between the CUs and the L2s would be a regression in the usual ratio of L1-L2 to DRAM bandwidth from prior architectures, and in raw terms inferior to Hawaii's L2. This is why I am having doubts about the infinity data fabric going into the traditionally encapsulated portion of the GPU.
That's too early to call as we are lacking detailed information. And I think AMD also said they are heading towards an NoC architecture (with all the layers that comes with it) for the future.
For Zen, the infinity fabric doesn't inject itself between the cores and their local L3 either.
As you say, the L3 in Zen is local to a CCX. The memory controller isn't (both CCXs can access it, in case of multi die/multi socket solutions all can) and the infinity fabric sits between the CCXs and the memory controllers. In GPUs, an L2 tile is tied to a memory channel, it is a global resource. From that point of view it makes perfect sense that one uses the infinity fabric to connect all clients to the memory controllers in GPUs (where each channel has an L2 tile) or APUs in much the same way it is done with Ryzen.
That wouldn't have anything to do with work distribution. All of that decision making is done by logic on either side, the fabric may be responsible for routing payloads, but it wouldn't know what they are or decide on the endpoints. It wouldn't know what mode the GPU or CU groups have decided for kernel allocation, how two dispatchers arbitrate a collision, nor could it interpret the meaning of a wavefront's signalling its final export.
The data fabric (no matter what it actually is) is pretty important to actually facilitate the work distribution. Of course the decision making has to be done somewhere, but this is only one part. Eventually, the GPU has to act on these decisions.
And yes, each CU (or even IBs within the CU) signals completion of wavefronts and such stuff to the command processor where one keeps track of resource allocation, occupancy and so on. But as you say, the amount of decisions per a unit of time is not that high. But it scales also with the size of the GPU, roughly the same as all the necessary data fabric.
Also, the vector memory path prior to Vega wouldn't put a limit on the RBE count or rasterizer width per SE, since those aren't L2 clients.
The recent GDC talk stated that basically every CU can export through every RBE ("there is no fixed mapping of ROP tile to CU on the chip"). That also matches with earlier hints, that the rasterizer tiles don't necessarily match the RBE tiles and there is some interleaving pattern between them (probably for load balancing reasons). So while this data isn't going over the L1-L2 path, it necessitates another crossbar between the CUs and RBEs with a comparable amount of clients. The effort for scaling the unit counts keeps growing ;).
I accept that the memory crossbar can scale significantly between chips, although some of the elements I was discussing would only see a generic memory port with 1 access per cycle or hook into separate data paths.
That are the things that don't complicate the scaling to larger GPUs that much, I guess?
I presume the the 32 L2 tile case is for Fury and its 4x8 channels? Bandwidth synthetics showed a distinct lack of bandwidth scaling over Hawaii until after accesses went outside the L2, for some reason. Also, I'm not entirely sure how many of the 96 clients can independently use the crossbar, and this may leave out some of the blocks that can also read from the L2.
Access arbitration as broad as 96x32 may be excessive, given the limits of the L2 and possibly other simplifying assumptions about distribution. It's possible that this could be a limiter in this pathway, or in access arbitration.
So maybe AMD couldn't implement a "fatter"/full crossbar with reasonable effort? That kind of reinforces my point about what the complicated things are, isn't it? ;)
Perhaps if the GPU assembly were manually arranged or the compiler hacked, that pause be tested to see if it allows a shader to remove some the required wait states for various vector ops, if there's effectively 6 vector issue cycles of delay added this way.
You can't as the delay is introduced by the scheduler hardware in the CU. And if you could somehow manipulate the instruction decoding (that's more a theoretical possibility for most parts, maybe a real one for consumer parts of Hawaii), a faster issue rate than the hardware is capable of would likely result in crashes/undefined behaviour in 1/16 DP rate parts, although you could coerce a consumer model of Hawaii back to their full 1/2 rate that way. But this is somehow unconnected to the scaling issue.
The relationship between a number of these is rather constrained. CP to SE has a few combinations with no clear dependence on CU count. SE front end is linked to RBE, not CU count or the command processor block. RBEs have previously not connected to the L2. Some of the wait counts mentioned earlier are about arbitrating access to a bus, rather than having an ever-changing interconnect.
It doesn't have to be a bus. If you have multiple clients on some resource (no matter if that are the CUs, the LDS, memory [caches] or the RBEs), you need arbitration to handle collisions which induces variable latencies. A bus just restricts the number of parallel accesses to all connected resources to 1. A crossbar can connect all (idle) ports on both sides with each other simultaneously. And there is a whole range in between. You will always have variable latencies from arbitration no matter what you use if collisions can occur.
 
And so the whole argument for a limit based on these bits collapses.
What were we discussing again?
I'm not sure, it's something of an orthogonal track.
The argument I have is that there's limited reason to expand the representation more than the architecture needs, and if the representation makes certain choices it may reflect baked-in choices of the architecture that would require propagating changes of non-zero cost through the overall pipeline.
I think that since what was set down in SI matches the bounds of Fiji, that the ID bits are consistent with AMD's statement that the architecture has certain designed limits in the shader engine, RBE, and CU counts.

The argument I am interpreting is that changing the output is easy, and that the hardware can be redesigned to fit expanded counts.
I think we disagree on whether there is significant incremental costs above the costs inherent to a larger implementation, and perhaps if there are such additional costs how much AMD is willing pay.
I tend assume to the low end on the latter point for the architectural range that AMD's statement applied to.

You still need the crossbar switch (or some other non-blocking multistage interconnection network
The non-blocking part is something I'm curious about, particularly near the upper end. Fury's somewhat anomalous bandwidth behavior might be related to some limitation there, or perhaps the assumption about the L2's number of slices needs revisiting.
Polaris had a blurb about doing more to group L2 requests, but that could also be more of a voluntary blocking on the part of the vector L1s to buy time for more requests to coalesce.

That's too early to call as we are lacking detailed information. And I think AMD also said they are heading towards an NoC architecture (with all the layers that comes with it) for the future.
AMD gave the bandwidth number for Vega's fabric, and it's inferior to Hawaii's L1-L2 bandwidth. If Hawaii's interface were clocked as high as Vega apparently will be, it would have 3x the bandwidth of an L1-L2 connection using AMD's fabric number.

In GPUs, an L2 tile is tied to a memory channel, it is a global resource. From that point of view it makes perfect sense that one uses the infinity fabric to connect all clients to the memory controllers in GPUs (where each channel has an L2 tile) or APUs in much the same way it is done with Ryzen.
Some of the discussions about Zen's inter-CCX communications also include mention of links between CCXs, although I'm not sure it's been fully described.
Is the assumption that there's a change with Vega's L1s? The current GCN's L1s are not coherent, write-through, and have a messy non-exclusive and extremely weak consistency to the L2?

The data fabric (no matter what it actually is) is pretty important to actually facilitate the work distribution.
Something has to be the transport mechanism, but a major selling point of the fabric is that it is generic and encapsulates the details of the clients from the interconnect.

And yes, each CU (or even IBs within the CU) signals completion of wavefronts and such stuff to the command processor where one keeps track of resource allocation, occupancy and so on. But as you say, the amount of decisions per a unit of time is not that high. But it scales also with the size of the GPU, roughly the same as all the necessary data fabric.
Is there a ranking on this? The command processor blocks have ranged from 3 in Tahiti to 5-9 in GPUs and APUs ranging from Kabini to Fiji.
Dispatch capability is unclear to me, possibly 2-4 or 1-4 if including the APUs?

The number of memory channels varies widely, and the product of channels and CU clients varies wildly even among the GPUs that have similar labels to Kaveri's front end.
Memory channels vary more freely as well, given Tahiti and Tonga's non-power of 2 interfaces. That involved yet another crossbar between the controllers and ROPs in Tahiti, and presumably the same for Tonga even though the extra channels and their L2 slices went unused.


The recent GDC talk stated that basically every CU can export through every RBE ("there is no fixed mapping of ROP tile to CU on the chip").
Is there a document or reference with the context? I know of an option for geometry processors to perform setup from other SEs. The static assignments wouldn't be at the CU's level of the hierarchy.

That also matches with earlier hints, that the rasterizer tiles don't necessarily match the RBE tiles and there is some interleaving pattern between them (probably for load balancing reasons).
For Vega or prior GPUs? One rasterizer per SE, screen-space assignment was stated for known GCN chips.

So while this data isn't going over the L1-L2 path, it necessitates another crossbar between the CUs and RBEs with a comparable amount of clients.
Which fields in the export instruction identify the target RBE, or is it relying on some other identifiers? Otherwise, is it some other unit that makes the determination?

That are the things that don't complicate the scaling to larger GPUs that much, I guess?
They wouldn't complicate the CU or unit's perception of the rest of the system.

So maybe AMD couldn't implement a "fatter"/full crossbar with reasonable effort? That kind of reinforces my point about what the complicated things are, isn't it? ;)
I have not discounted the possibility of other sources of difficult to scale complexity, I'm considering the idea that the interconnect may be more limited than a full crossbar even at lower sizes.

You can't as the delay is introduced by the scheduler hardware in the CU.
I'm talking about the required NOPs or independent instructions mandated in the ISA document that the instruction stream must have. If the CU does its own stalling for that many cycles, wouldn't it be possible to anticipate this and craft an instruction sequence for that mode that doesn't need the manual wait states?

It doesn't have to be a bus.
It's documented as a bus or set of them based on the specific type, at least so far. Perhaps there's imprecision in the language.
 
4gb version? 4GB version! AMD???

2 years ago, yeah ok fine. Not the end of the world, whatever. Today, for a mid range card, still fine. Sure sometimes, rarely, there's some crazy "4k!" texture pack, with a texel density so high you'd need a very high res monitor to see it anyway. But for the highest end GPU you make? Now that's just plain out of date. Sure it's only out of date on "some" titles, but if you're paying $600+ that's just not even close to good enough.
 
2x compare to what?
4gb version? 4GB version! AMD???

2 years ago, yeah ok fine. Not the end of the world, whatever. Today, for a mid range card, still fine. Sure sometimes, rarely, there's some crazy "4k!" texture pack, with a texel density so high you'd need a very high res monitor to see it anyway. But for the highest end GPU you make? Now that's just plain out of date. Sure it's only out of date on "some" titles, but if you're paying $600+ that's just not even close to good enough.

You need to read more, AMD says that with HBC 4Gb is enough and if they are comitting to it it must be true
 
2x compare to what?


You need to read more, AMD says that with HBC 4Gb is enough and if they are comitting to it it must be true

Except, worst case scenario is the title in question really is efficient in memory use. EG the newest version of Frostbite (anything made be EA) evicts used buffers during the very frame they were created, let alone more long term memory resources like textures. Which is to say, since newer Vulkan/Dx12 titles have finer grained memory control, and other titles like Ashes will use the entire object space cache to "cache" shading results, 4gb can still fail.

Are HBM yields really that low that you have to put 4gb in a high end card? Honestly...
 
Last edited:
There are two Vega cards. I honestly doubt there will be lots of 4GB Vega 10 out there. Maybe AMD will "launch" one just to claim a price range achievement (ahem RX480 4GB ahem), but then it just won't be an interesting offer.
 
There are two Vega cards. I honestly doubt there will be lots of 4GB Vega 10 out there. Maybe AMD will "launch" one just to claim a price range achievement (ahem RX480 4GB ahem), but then it just won't be an interesting offer.
4GB Vega at $200-$250 to push the rx 480 8 gig down in price could work. If its another $50 to go to 8gigs . It would compete with the 1060 wouldn't it and I believe that's 3/6gigs so it could compete nicely against that
 
200 for the vega and 150 for the 580? and 120 for the 480? O.O
480 would go out of commission since the 580 is probably the same chip on a later step, BIOS with different clocks and no reference cooler this time (think 290X -> 390X).
I don't think the RX580 needs to go that far down in price, as the RX480 is still selling well when there's a discount to ~$200 for the 8GB model.

I'd bet $180-200 for RX580 8GB and $130-150 for RX570 4GB. Polaris 11 RX560 could come down to $70-90 but this time with all the CUs activated.
 
130 for the 570 would make the used 480 to cost 150/130 and the 470 100 +/- that is a stupid hell of a deal for a card(s) that can play at 1080p with everything max...I'm optimist but even I have my doubt that the world would be such a sweet place :D
 
Except, worst case scenario is the title in question really is efficient in memory use. EG the newest version of Frostbite (anything made be EA) evicts used buffers during the very frame they were created, let alone more long term memory resources like textures. Which is to say, since newer Vulkan/Dx12 titles have finer grained memory control, and other titles like Ashes will use the entire object space cache to "cache" shading results, 4gb can still fail.
Render target optimization has actually the opposite effect. Frostbite (at 4K) used to need 1 GB memory for render targets. The new version only needs 0.5 GB. Render targets will never be paged out, since they are accessed every frame. In that old engine version most of the render target memory was only accessed once (or a few times) per frame. Now the render target memory is reused inside the same frame and same regions are touched multiple times. Thus previously Frostbite had 1 GB of memory that must always be resident. Now it only has 0.5 GB.

Remaining GPU memory is mostly assets (textures, meshes, etc). As render targets get smaller, this portion gets proportionally bigger. This means that bigger portion of the allocated memory can be non-resident. Thus Vega should be very happy about optimizations like this. Also this makes 4 GB cards in general better. 4K has been a problem for 4 GB cards. 4K at half the memory cost is a good thing for 4 GB cards. If enough other AAA devs follow the suit, this kind of technical advances in game engines will extend the life time of 4 GB graphics cards.
 
I did not follow the discussion over the past months... Any hope for a new rasterizer? Q___Q

FP16 packed vectors are fine but... You know, conservative rasterization and pixel sync/rovs/~programmable blend state, that's what I wish really to hear.
 
Where'd you read this?

http://www.frostbite.com/2017/03/framegraph-extensible-rendering-architecture-in-frostbite/

It's very neat! Though it does seem a bit, uhm, overkill for a few hundred mb. Though it does show that the usual overcaching of data can be overcome in modern engine design with newer API access and such.
Render target optimization has actually the opposite effect. Frostbite (at 4K) used to need 1 GB memory for render targets. The new version only needs 0.5 GB. Render targets will never be paged out, since they are accessed every frame. In that old engine version most of the render target memory was only accessed once (or a few times) per frame. Now the render target memory is reused inside the same frame and same regions are touched multiple times. Thus previously Frostbite had 1 GB of memory that must always be resident. Now it only has 0.5 GB.

Remaining GPU memory is mostly assets (textures, meshes, etc). As render targets get smaller, this portion gets proportionally bigger. This means that bigger portion of the allocated memory can be non-resident. Thus Vega should be very happy about optimizations like this. Also this makes 4 GB cards in general better. 4K has been a problem for 4 GB cards. 4K at half the memory cost is a good thing for 4 GB cards. If enough other AAA devs follow the suit, this kind of technical advances in game engines will extend the life time of 4 GB graphics cards.

Wasn't meant to show that this specific example was something the Vega design would find troubling, but that the usual overcaching that Vega is supposed to eliminate can now be eliminated by devs themselves. Surely this is better for high resolution RTs on low memory systems, but as memory residency gets refined by devs the idea of it getting refined by some hardware cache becomes irrelevant, since the optimization is already done. EG not gonna be able to run DOOM on max settings cause all that overcaching of virtualized textures is to prevent any streaming popin (afaik? do they overcache tiles to use for anistropic filtering or something?)

Just suggesting that the High Bandwidth Cache claims of discarding unused memory residency could deliver less gains over time. And if someone wants a super high end card, you probably want it to last a long time too. Then again it's perhaps a belabored point. By the time 2020 and a "new" generation of VR/AR/whatever consoles rolls around maybe Vega will be outdated anyway. The new ones could have Shader Model 7 and transactional memory and etc. etc.
 
I did not follow the discussion over the past months... Any hope for a new rasterizer? Q___Q

FP16 packed vectors are fine but... You know, conservative rasterization and pixel sync/rovs/~programmable blend state, that's what I wish really to hear.

Do you mean aside from the deferred binning portion of the rasterizer?
There does seem to be a mode control for conservative rasterization: https://lists.freedesktop.org/archives/mesa-dev/2017-March/148861.html
+ si_pm4_set_reg(pm4, R_028C4C_PA_SC_CONSERVATIVE_RASTERIZATION_CNTL,

Other things I've run across:

Like virtually all architectures GFX9 does have to workaround a few quirks/bugs:
https://lists.freedesktop.org/archives/mesa-dev/2017-March/148879.html, titled "radeonsi/gfx9: add a scissor bug workaround"

There was some speculation about how the ROPs being L2 clients would change the behavior of the rest of the GPU. One item that appears to be continuing is a separate and incoherent metadata path, for things like delta compression:
https://lists.freedesktop.org/archives/mesa-dev/2017-March/148890.html
Also, whatever relationship the ROP path has with the L2 seems to have some subtleties to how coherence handled or not handled. GFX9 avoids some cache flushes, but still needs others that prior generations did:
+ /* GFX9: Wait for idle if we're flushing CB or DB. ACQUIRE_MEM doesn't
+ * wait for idle on GFX9. We have to use a TS event.
+ */
+ if (sctx->b.chip_class >= GFX9 && flush_cb_db) {
+ struct r600_resource *rbuf = NULL;
+ uint64_t va;
+ unsigned offset = 0, tc_flags, cb_db_event;
+
+ /* Set the CB/DB flush event. */
+ switch (flush_cb_db) {
+ case SI_CONTEXT_FLUSH_AND_INV_CB:
+ cb_db_event = V_028A90_FLUSH_AND_INV_CB_DATA_TS;
+ break;
+ case SI_CONTEXT_FLUSH_AND_INV_DB:
+ cb_db_event = V_028A90_FLUSH_AND_INV_DB_DATA_TS;
+ break;
+ default:
+ /* both CB & DB */
+ cb_db_event = V_028A90_CACHE_FLUSH_AND_INV_TS_EVENT;
+ }
+
+ /* TC | TC_WB = invalidate L2 data
+ * TC_MD | TC_WB = invalidate L2 metadata
+ * TC | TC_WB | TC_MD = invalidate L2 data & metadata
+ *
+ * The metadata cache must always be invalidated for coherency
+ * between CB/DB and shaders. (metadata = HTILE, CMASK, DCC)
+ *
+ * TC must be invalidated on GFX9 only if the CB/DB surface is
+ * not pipe-aligned. If the surface is RB-aligned, it might not
+ * strictly be pipe-aligned since RB alignment takes precendence.

The above's mention about ACQUIRE_MEM not waiting for idle runs counter to how it is discussed in a patch that removes a note that exempted the graphics ring from using it (perhaps a quirk of the ROP path?):
https://lists.freedesktop.org/archives/mesa-dev/2017-March/148925.html

While we're at it, it seems AMD has merged a few of its internal shader types, potentially correlating with the changes made in the geometry engine:
https://lists.freedesktop.org/archives/mesa-dev/2017-March/148903.html
+ /* GFX9 merged LS-HS and ES-GS. Only set RW_BUFFERS for ES and LS. */

There are some preliminary descriptions of the hardware dimensions of Vega10, which seem mostly in-line with Fiji. The one item with a question mark, for max_tile_pipes may have it because this is one number that is inferior to Fiji--listed as 16 elsewhere. One item of note is that the max number of texture channel caches, which seems to give a count of active L2 cache slices, is also 16 for Fiji elsewhere despite its wider channel count. (Hawaii's is also 16)
https://lists.freedesktop.org/archives/amd-gfx/2017-March/006570.html

+ case CHIP_VEGA10:
+ adev->gfx.config.max_shader_engines = 4;
+ adev->gfx.config.max_tile_pipes = 8; //??
+ adev->gfx.config.max_cu_per_sh = 16;
+ adev->gfx.config.max_sh_per_se = 1;
+ adev->gfx.config.max_backends_per_se = 4;
+ adev->gfx.config.max_texture_channel_caches = 16;
+ adev->gfx.config.max_gprs = 256;
+ adev->gfx.config.max_gs_threads = 32;
+ adev->gfx.config.max_hw_contexts = 8;
+
+ adev->gfx.config.sc_prim_fifo_size_frontend = 0x20;
+ adev->gfx.config.sc_prim_fifo_size_backend = 0x100;
+ adev->gfx.config.sc_hiz_tile_fifo_size = 0x30;
+ adev->gfx.config.sc_earlyz_tile_fifo_size = 0x4C0;
+ gb_addr_config = VEGA10_GB_ADDR_CONFIG_GOLDEN;
+ break;
 
Just suggesting that the High Bandwidth Cache claims of discarding unused memory residency could deliver less gains over time. And if someone wants a super high end card, you probably want it to last a long time too. Then again it's perhaps a belabored point. By the time 2020 and a "new" generation of VR/AR/whatever consoles rolls around maybe Vega will be outdated anyway. The new ones could have Shader Model 7 and transactional memory and etc. etc.
It is hard to see what happens in the future. If hardware paging solutions become prevalent there's significantly less reason to create your own highly complex software paging solution. Data sets are growing all the time, meaning that less and less data gets actually accessed per frame. My prediction is that gains from automated paging systems are increasing in the future. But obviously we are also going to see games using techniques that require big chunks of physical memory. For example big runtime generated volume textures used in global illumination systems. The best bet is to have both more memory and an automated paging system.

The reality right now is that recently Nvidia released Geforce GTX 1060 with 3 GB of memory. Developers need to ensure that their games work perfectly fine on this popular GPU model. Thus I don't see any problem with a 4 GB mid-tier GPU with automated paging solution. Automated memory paging solution doesn't need to double your available memory in every single game to be useful.
 
I like AMD approach and it gives more flexibility, if a game use only 4GB of data why making a VGA with 11? you can have ur 4GB on VGA and use the system ram with have plenty of unused space to "cache" the rest and in the future not too far away maybe the the whole drive if the new tech of nand with near ram speed materialize in time. Because right now is not very resource efficiency to just put as much data in the ram just to use a fraction of it.

I think that the recent history shows AMD ahead of time(more cores, Mantle, AsyncCompute) so they at least deserve the benefit of the doubt that if they think this is the right direction and they are committing to this(I doubt next generation GPU of all segments won't use it) is because they think this is the best way to do it and maybe even is a feature asked for developers themselves or console makers in the same way they help create the first CGN card long time ago.
 
I think that the recent history shows AMD ahead of time(more cores, Mantle, AsyncCompute) so they at least deserve the benefit of the doubt that if they think this is the right direction and they are committing to this(I doubt next generation GPU of all segments won't use it) is because they think this is the best way to do it and maybe even is a feature asked for developers themselves or console makers in the same way they help create the first CGN card long time ago.
Pascal has similar feature already, but it is designed for professional compute use. NVIDIA even designed a custom NVLINK interconnect (to unified main memory) to maximize their technology potential.

Some info:
https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/
http://www.techenablement.com/key-aspects-pascal-commercial-exascale-computing/
 
Back
Top