AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

[Meta]
Is it me or labeling things as being high level or low level is too much of a simplification? I'm refering to the view that DX12 is " close to the metal"

While I know nothing about DX, I view abstractions mostly as being either good or bad for a given task. A bad abstraction can be introduced at the wrong layer (so maybe here we get the low vs high level separation), may not allow the user to specify all the information she wishes, or may not abstract away minute and useless details. The latter I believe.

Defining a good abstraction on the other hand, is very hard for complex APIs sych as these, of course

Of course when these change, a programmer may be forced to learn the new model, hence a temporary decrease of productivity and/or some resitance. But this is different to the "only good programers can handle low level" mantra
[/Meta]
 
They provide the structure to accelerate it with the queues. There is no reason they can't dispatch draw calls as was already explained. It should be possible to craft a loop that runs continuously with little to no CPU intervention. Using the CPU only for external stimulus, HBCC and unified memory for resource management/allocation, and a single compute shader or CP thread as a main game loop keeping everything in sync. That system should be able to load balance itself with minimal tuning in the part of the shader or developer. Low level APIs would be the first step towards this because all that validation in a shader would be a nightmare. In effect it's stripping the branching portion that required a CPU for performance.
As explained where? You are describing something that doesn't exist yet. You can queue a number of draw calls and a lot of parameters can be decided upon later on including the exact number of draw calls being issued (that is they exist in GPU space). But a lot of GPU state still can't be touched by the GPU. And I wouldn't call calculating some parameters for a draw call packet latter in the command stream dispatching a draw call.
 
Really? So devs wasted their time optimizing for a feature that will not help NV in anyway? I am curious to know which developer actually said that! The last thing I know is the developer of Ashes Of Singularity stating that Maxwell doesn't do well with Async in their game.
Most devs won't outright say that for political reasons as I've pointed out already. Case in point the very presentation you linked. There have been comments on various forums and bug reports. Anyone who understands how the architecture is working should be able to guess at the limitations based on API design decisions. Any restrictions are for compatibility.

Yes, that happened in multiple games already, but this is not about NV's async, this about DX12 in general, devs are not coming out and criticizing NV's Async, they are criticizing the whole API.
Everything about DX12 is asynchronous. All queues operate that way until you tell them not to with fence and barrier constructs. It's based on out of order processing. Nvidia prefers in order with the ability to predict inherently unpredictable ratios in their driver. Evidenced by the asychronous time warp implementations. It can't randomly launch and take priority. Even pure compute tasks struggle with that. That's the very situation the ACE hardware resolves. Pausing and switching dispatches shouldn't be difficult, but Nvidia can't seem to do it along with prioritization.

As explained where? You are describing something that doesn't exist yet. You can queue a number of draw calls and a lot of parameters can be decided upon later on including the exact number of draw calls being issued (that is they exist in GPU space). But a lot of GPU state still can't be touched by the GPU. And I wouldn't call calculating some parameters for a draw call packet latter in the command stream dispatching a draw call.
Explained in the whitepapers in combination with currently required capabilities. At least for AMD a lot shows up with ROCm documentation. If a shader can output data, forging a command packet is straightforward enough. I've also seen some hacks where a shader can write it's own code. At that point you have a Von Neumann architecture. Write and submit command packets and you're golden.

In the case of XB1X consider Microsoft's CPU modifications to reduce overhead. There's really no reason that same capability couldn't exist on the GPU along with bundles. Much more efficient with less validation on a SIMD. HBCC and virtual addressing in theory open the door for dynamic memory allocations on the GPU. AMDs latest HC compiler hints at this feature at least. That solves a lot of GPU state issues. It should currently be possible to pre-compile bundles and lists with fixed addressing. Leaving very little work to kick off execution.
 
Explained in the whitepapers in combination with currently required capabilities. At least for AMD a lot shows up with ROCm documentation. If a shader can output data, forging a command packet is straightforward enough. I've also seen some hacks where a shader can write it's own code. At that point you have a Von Neumann architecture. Write and submit command packets and you're golden.
Not what I'm saying. Sure, if you know how a draw call is structured (or any other command for that matter) you can in theory construct which ever command stream you want. In theory. It's not something that's doable through D3D/Vulkan/Mantle/... I'm fairly certain this is not available to XBox/PS developers either. So yes in theory if you're a linux driver developer you can play with that. If you're a game developer then no.
 
Not just a cache flush a pipeline flush too. I don't really understand why they did it either seems like an performance sink but what do I know about the hardware/CPU interface.
The system is running the GPU and CPU asynchronously through queues that are at a variable depth.
The queues themselves translate into any number of independent internal contexts run by implementation-specific controllers and processors, and they may not be structured to be safe with regards to concurrency in their execution or the hardware contexts they share.
The data-race issue is something that comes up in GCN when reading in shader output. It's not the API that's mandating ROP and L2 cache flushes, or the command processor(s) stall that comes with them.

GCN requires lots of GPU cache flushes during the frame rendering, as ROP caches aren't coherent with L1/L2 cache. Every time you stop writing to a render target and start sampling it you need to flush the caches. Vega (GCN5) moves the ROP caches under L2 cache. This reduces the need of GPU cache flushes drastically. AFAIK all Nvidia DX11 GPUs had ROPs under L2. Vega also has a tiled rasterizer (Nvidia has had it since Maxwell1). So it seems that the memory hierarchies of AMD & NV are going to be pretty similar after Vega launch.
Vega's avoidance of L2 flushes does require some kind of alignment, probably to avoid straddling the L2's partitions in a manner that the hardware can no longer guarantee consistency. Apparently RBEs and L2 channels can disagree, and Vega must respect the alignment of the data if aligned to RBEs but not the L2.

Even with that, the metadata needs to be flushed for DCC, Htile, multisample color compression, etc.
And there's a stall on the queue until this process finishes.


If there is to be a relaxation on DX12's mandatory flushes, there may need to be a way to really version or isolate data and execution flow. The front ends can run ahead very far, the intermediate stages may not be fully architected to be safe, and the back ends have limited visibility or intelligence when it comes to how much disparate hardware interacts.
Even if one GPU did make this leap, that leaves out everyone else an API must cater to.
 
Hardware Battle has a rumor about consumer Vega (picked up by WCCFTech).

(Google Translate) said:
According to the information, AMD will ship VEGA GPU chipsets for gaming to global graphics card makers as early as this week.

So far, only two GPUs are known to exist, but this is also uncertain.

Domestic shipments are expected to arrive at the end of July or early August, and for a certain period after the release of the reference version
(Note for those reading the WCCFTech report: the comparison with the GTX 1080 has been crossed out on the Hardware Battle page.)

If there are two consumer Vega GPUs right now, then could they correspond to two of the 68xx:yy codenames for Vega engineering samples? I was thinking that if so, then maybe the 687F:C3 and 687F:C1 end up as consumer Vega, but I don't know enough about engineering samples to properly speculate.
 
Last edited:
If working back from the supposed Vega RX launch at SIGGRAPH, I think shipments to board makers would need to be starting or underway since manufacturing and shipping have lead times.
 
Sure, if you know how a draw call is structured (or any other command for that matter) you can in theory construct which ever command stream you want. In theory. It's not something that's doable through D3D/Vulkan/Mantle/...

In Direct3D you have an abstract command-buffer coding which is used for ExecuteIndirect. Nvidia added an extension to Vulkan for some sort of abstract Nvidia-only command-buffer. Vulkan has Indirect stuff, but it seems underdocumented, they don't mention what type of representation it has and if it can be generated by compute shaders. I have no idea how the abstract command-buffers become machine dependent command-buffers.
 
Most devs won't outright say that for political reasons as I've pointed out already
In other words you made it up.
There have been comments on various forums and bug reports
Comments on forums? Yeah, we all know how accurate those can be, pretty soon someone will compare FP16 to FP32 and call it a win.
but Pascal can get gains from async, so it does help NV too
True, but most DX12 games before the Pascal era, just disabled Async on Maxwell, as performance regressed with it enabled. Sometimes NVIDIA did the disabling themselves or collaborated with the developers to disable it automatically in the game (like in Gears 4). So yeah, nobody wasted anytime getting Async working on NV. Some of those who tried got flack for doing it (like 3DMark TimeSpy).
 
Last edited:
Not what I'm saying. Sure, if you know how a draw call is structured (or any other command for that matter) you can in theory construct which ever command stream you want. In theory. It's not something that's doable through D3D/Vulkan/Mantle/... I'm fairly certain this is not available to XBox/PS developers either. So yes in theory if you're a linux driver developer you can play with that. If you're a game developer then no.
As the link above shows, it's just a matter of an extension exposing it. Crafting a command buffer is easy as it's a lot of boilerplate. Crafting a coherent program and scheduling is another issue. That's where hardware schedulers and dynamic memory can make execution relatively safe. So there's at least one Vulkan extension available. I'd hazard a guess some (DICE?) developers are experimenting with the capability privately. Not far from how Mantle developed.

In other words you made it up.
There are comments around if you bother to go looking. I have some I could link, but I'm not going to. So no I didn't make it up, there's ample evidence to go figure it out for yourself.

pretty soon someone will compare FP16 to FP32 and call it a win.
Peak usable flop rates aren't comparable? In the case of a FP16 heavy workload FP16 for Vega and FP32 for Pascal would be the ideal comparison. Unless you really expect a developer to use double rate on AMD and 1/16th rate on Nvidia. That should end well.
 
Even with that, the metadata needs to be flushed for DCC, Htile, multisample color compression, etc.
And there's a stall on the queue until this process finishes.
There's a stall if you want there to be a stall. GCN is certainly capable of running compute shader work concurrently with metadata flushes and multisample/HTILE/DCC decompression steps. You just don't want to touch these resources to avoid race conditions. DX12/Vulkan barriers and async compute are adequate to describe the available parallelism in this case. A good driver should be able to organize things in a way that allows compute overlap with these things (compute on GCN has no global state dependency). Pixel shader overlap of course is a harder case, as some of these steps will likely be performed by ROP hardware.

DX12 split barrier and Vulkan events are designed exactly for this use case. Basic barrier is immediate, so there's no transition period where other work can run to overlap with decompress & etc operations. Split barrier on the other hand tells GPU that this resource is no longer used, and there will be an end barrier later as a sync point for the decompress & etc steps. This allows overlapping these operations even inside a single queue (no async compute needed).
In Direct3D you have an abstract command-buffer coding which is used for ExecuteIndirect. Nvidia added an extension to Vulkan for some sort of abstract Nvidia-only command-buffer. Vulkan has Indirect stuff, but it seems underdocumented, they don't mention what type of representation it has and if it can be generated by compute shaders. I have no idea how the abstract command-buffers become machine dependent command-buffers.
GPU generated commands obviously never leave the GPU memory. That would require PCI-E roundtrip and would kill the performance (huge stall). There's obviously a compute shader that reads the abstract command buffer and outputs the device specific commands to the real command buffer.
 
Last edited:
If working back from the supposed Vega RX launch at SIGGRAPH, I think shipments to board makers would need to be starting or underway since manufacturing and shipping have lead times.

Look like it is exactly the rumor ( rumor but from different sources ).. AMD start sending the sillicon to AIB partners this week .... impossible to verify anyway.
 
GPU generated commands obviously never leave the GPU memory. That would require PCI-E roundtrip and would kill the performance (huge stall). There's obviously a compute shader that reads the abstract command buffer and outputs the device specific commands to the real command buffer.

I felt tempted to consider custom FF hardware for this. Using just one lane seems complete waste to me. :D In a way x86 instruction coding is also abstract and becomes fixed length instructions inside, the instruction decoder does that. Why not in a CU. Using a compute shader to transform the command-stream has the disadvantage that you need a temporary distinct command-buffer, which is ... really inconvenient for an implementer (edit: of Driver/ExecuteIndirect ofc, not the app guy).
Regardless, I also think it's a compute shader (currently).
 
Last edited:
AMD confirms the MI25 specs:

Radeon-Instinct-MI25-2.png


Radeon-Instinct-MI25-3-1000x253.png

Radeon-Instinct-MI25-1.png

https://videocardz.com/70440/amd-announces-radeon-instinct-mi25-specifications
 
Evidenced by the asychronous time warp implementations. It can't randomly launch and take priority. Even pure compute tasks struggle with that. That's the very situation the ACE hardware resolves. Pausing and switching dispatches shouldn't be difficult, but Nvidia can't seem to do it along with prioritization.
That's all wrong. First of all, ACEs are not meant to resolve such situations, there is a CU reservation feature for this, which allows to dedicate a particular number of CUs on timewarp in order to get it completrd in time. SM reservation is the same method which has been implemented for parallel compute and graphics execution on NV's GPUs since Maxwells. So all modern GPUs including Maxwells can do it. Unlike reservation, fine grain warp level scheduling of async dispatches can't guarantiee timewarp execution in time. Moreover, Pascals also can use their mid triangle preemption feature to ensure that timewarp is executed in time, which I guess is less expensive than dedicate a number of SMs for this task, but still should guarantee timewarp execution in time for every frame.
 
Last edited:
The MI25's 484 GB/s memory bandwidth is basically the same as the Frontier Edition's 483 GB/s (could the difference be due to rounding?).

Looks like we may never see 2 Gbps at this rate….
The GP100 used the very first HBM2 batches from Samsung and the GV100 is using 4 stacks so there's a chance there are negligible gains from reaching over 900GB/s so nvidia can save a bit on power consumption.


Both the Mi25 and the FE are using 8-Hi stacks. RX Vega 8GB should be the first AMD card with 4-Hi HBM2 stacks and those could be able to reach 2Gbps.
BTW, the HBM in Fiji cards could be overclocked from 500 up to 600MHz. Maybe HBM2 will be harder to overclock than HBM1, but there isn't any reason to believe AMD will enforce a hard lock on HBM2's clocks.
 
Back
Top