AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

sebbbi · Jun 18, 2017

MDolenc said:
Except that in neither DX12 or Vulkan case shader is responsible for submitting draw calls. Shaders can prepare data for the draw calls and that data doesn't have to be shipped back to the CPU side, but the actual draw is still dispatched by the CPU.

ExecuteIndirect and MultiDraw (OpenGL) support indirect draw count, meaning that the CPU doesn't even know now many draw calls are being made. Draw arguments and draw count are written by GPU to a buffer that is only accessible by the GPU. CPU does however send GPU a command to start ExecuteIndirect/Multidraw, but doesn't know the draw count or the draw parameters.

These draws are made either by the GPU command processor (CP loop) or by a compute shader writing directly to GPU command buffer. Intel for example implements ExecuteIndirect with indirect draw count as compute dispatch that writes to command buffer. Nvidia's Vulkan device generated commands extension (https://developer.nvidia.com/device-generated-commands-vulkan) does the same. This extension can even change render state and shaders. AMD implements ExecuteIndirect in their command processor (no extra dispatch to setup draws).

CUDA and OpenCL have functionality to spawn shaders directly from compute kernels. You can write a lambda inside the shader and spawn N instances of it. The shader can then wait for completion and continue work from there, without any CPU intervention.

But the driver still is a big deal. The driver spawns the extra setup dispatches and decides how and when to run them (possibly concurrent to other tasks) in order to ensure that the data is ready in time, but the data generation doesn't block other tasks. DX12 barriers are still pretty high level construct. GPUs caches tend not to be fully coherent, so barrier might instruct flushes of some caches. Render targets might be in compressed format that need to be decompressed before sampling it. There might be temp resource allocations and preparation needed before tessellation, geometry shader and tiled rasterizer work can begin. All of these extra tasks can be executed in parallel to other work. It is the driver's responsibility to make this happen in the fastest safe way possible. Driver can also do stuff like combine cache flushes of the same type, and increase parallelism in cases where it knows that concurrent execution of some shaders is safe. Async compute opens up a whole new can of worms. Console devs are used to hard coding CU masks and wave limits, etc to fine tune the performance of concurrent work (neither queue starves). PC drivers need to do this without application level knowledge. This isn't an easy problem to solve. And most importantly, the driver isn't just the CPU code that is being run. The driver sends the microcode for all the GPU fixed function processors. Thus the driver is also fully responsible how the GPU command processor executes the commands and how indirect multidraw and spawning new shaders from kernels is implemented on the GPU side.

During the last year I rewrote Unreal Engine's 2x console backend resource transitions. There's just so many small things you can do differently to reduce stalls between draws & dispatches and to allow best possible overlap. But there are also so many corners cases that you need to take into account to prevent that data race that only occurs in some obscenely rare case. You would need to know what the shader does in order to write the best possible transition code. Priority #1 for any driver is to guarantee correctness. This unfortunately means that the default code path will likely be slightly slower than game specific code paths.

sebbbi · Jun 18, 2017

Infinisearch said:
This exactly. In fact in dx12 underneath every executecommandlists call is a gpu pipeline flush. Which by its very nature requires CPU intervention. So the driver is more involved than you think.

CPU setups the commands needed for barrier transitions and ExecuteCommandLists & ExecuteIndirect, but the GPU doesn't need to wait for the CPU. In some cases the command processor handles this directly or a compute shader writes directly to command list. But the driver uploads all this code to the GPU, so it doesn't really matter whether it is the CPU or GPU doing the work. Driver is responsible for it.

Clukos · Jun 18, 2017

This looks like brushed aluminum:

This will be one beautiful GPU

And it looks like the final renders are using 2x8-pin.

Infinisearch · Jun 18, 2017

sebbbi said:
CPU setups the commands needed for barrier transitions and ExecuteCommandLists & ExecuteIndirect, but the GPU doesn't need to wait for the CPU. In some cases the command processor handles this directly or a compute shader writes directly to command list.

I would think in the case of a pipeline flush CPU intervention might be necessary depending on implementation of the CP. Isn't the command processor reading from a ring buffer? If so the only way to flush the pipeline would be to put a command in the ring or to not feed the ring and have the GPU notify the CPU it needs more work. The latter is what I assumed was happening but I wasn't sure. But I'm sure a pipeline flush happens for each executecommandlists call, so something is happening.

Jawed · Jun 18, 2017

silent_guy said:
Close to the metal usually makes it possible for crack developers to extract better performance. It also gives less developers more rope to hang themselves (both in terms of crashes and bad performance.)
Anytime that happens, there'll be opportunities for a driver to step in and fix things.

In the interests of educational value, perhaps you can describe how a driver compensates for errors and bad choices in close to the metal code?

silent_guy · Jun 18, 2017

Jawed said:
In the interests of educational value, perhaps you can describe how a driver compensates for errors and bad choices in close to the metal code?

How about we start with the presence of a shader compiler?

Jawed · Jun 18, 2017

What's low level about that?

silent_guy · Jun 18, 2017

Jawed said:
What's low level about that?

Yeah, you're right. I think the key thing is that DX12 is just not close to the metal at all...

3dcgi · Jun 18, 2017

Anarchist4000 said:
If a shader is responsible for submitting draw calls, and the card can run a loop with no interaction from the CPU and driver, how do you propose the drivers will be making a huge difference? Even with async the scheduling is ideally hardware driven. Beyond setup and configuration the drivers shouldn't be doing much work in a GPU driven scenario. That's the whole point of GPU driven rendering, beyond the GPU scaling better on lots of objects. I did say compilers will make a difference.

You said compiler make a bit of difference. I read that to mean small though perhaps it was just a poor choice of words on your part. I would say the driver makes less of a difference for DX12 with GPU driven rendering, but it will still make a difference. Sebbbi did a good job explaining it.

DavidGraham said:
Actually here they refer to drivers from both parties, you could listen to the whole lecture yourself. AMD drivers also require big efforts to tune in Async Compute since they rely on it far more than NVIDIA.

The application does the most important tuning for Async Compute. For the driver to do much the code will be application specific and most applications won't get this kind of attention.

Infinisearch said:
This exactly. In fact in dx12 underneath every executecommandlists call is a gpu pipeline flush. Which by its very nature requires CPU intervention. So the driver is more involved than you think.

If there's a pipeline flush it should be implementation dependent and not a hard rule.

DavidGraham · Jun 18, 2017

3dcgi said:
The application does the most important tuning for Async Compute. For the driver to do much the code will be application specific and most applications won't get this kind of attention.

True, but developers argue that this is indeed happening, according to their experience driver complexity increases with Async Compute.

3dcgi · Jun 18, 2017

DavidGraham said:
True, but developers argue that this is indeed happening, according to their experience driver complexity increases with Async Compute.

When you say argue do you mean they think this is a bad thing? In theory if developers tune their code drivers won't have to. I understand developers will never be able to tune for every card out there hence the "in theory" part of my statement. Sebbbi describes reserving CUs, etc. on consoles, but developers shouldn't need to do this to extract some performance for low hanging fruit cases like overlapping compute with depth only rendering. A key part of my statement is some performance. I think there will always be cases where tuning helps as it's very difficult (maybe impossible) to design a chip that does everything perfectly.

DavidGraham · Jun 19, 2017

3dcgi said:
When you say argue do you mean they think this is a bad thing?

It's not a bad thing perse, but they consider it one of the burdens that needs to be dealt with in the new APIs.

3dcgi said:
I understand developers will never be able to tune for every card out there

As they describe it here:

Infinisearch · Jun 19, 2017

3dcgi said:
If there's a pipeline flush it should be implementation dependent and not a hard rule.

Actually according to someone from MS its due to the "API design of D3D12". See this thread https://www.gamedev.net/forums/topic/677701-d3d12-resource-barriers-in-multiple-command-lists/ and search for the post starting with "With the current API design of D3D12, the hardware needs to flush all caches and drain the pipeline at the end of a group of command lists, because the CPU might read/write resources at that granularity." Also see this presentation GDC16_gthomas_adunn_Practical_DX12.pdf = https://developer.nvidia.com/sites/.../GDC16/GDC16_gthomas_adunn_Practical_DX12.pdf which lists the flush as non vendor specific on page 8.

3dcgi · Jun 19, 2017

Infinisearch said:
Actually according to someone from MS its due to the "API design of D3D12". See this thread https://www.gamedev.net/forums/topic/677701-d3d12-resource-barriers-in-multiple-command-lists/ and search for the post starting with "With the current API design of D3D12, the hardware needs to flush all caches and drain the pipeline at the end of a group of command lists, because the CPU might read/write resources at that granularity." Also see this presentation GDC16_gthomas_adunn_Practical_DX12.pdf = https://developer.nvidia.com/sites/.../GDC16/GDC16_gthomas_adunn_Practical_DX12.pdf which lists the flush as non vendor specific on page 8.

I didn't know the API enforced a cache flush. It seems weird to require a flush because the data might be needed by another engine. I'd expect an optional flush for that though a small number of flushes per frame are likely not noticeable in the final frame rate.

Infinisearch · Jun 19, 2017

3dcgi said:
I didn't know the API enforced a cache flush.

Not just a cache flush a pipeline flush too. I don't really understand why they did it either seems like an performance sink but what do I know about the hardware/CPU interface.

Anarchist4000 · Jun 19, 2017

DavidGraham said:
Actually here they refer to drivers from both parties, you could listen to the whole lecture yourself. AMD drivers also require big efforts to tune in Async Compute since they rely on it far more than NVIDIA.

Yes the devs remained neutral in the presentation. In reality the AMD implementation takes very little effort. The Nvidia implementation has a bunch of gotchas. Devs have the ability to tune, but they don't have to and in fact can't tune for all workloads. The more recent GCN variants should do that tuning in hardware. Manual tuning is an option, not a requirement for acceptable performance.

DavidGraham said:
Again with this fallacy? I thought we put it to rest long time ago. This has nothing to do with a "hardware scheduler".

What fallacy? You provided a presentation that all but said it was the case. The entire ecosystem is having to adapt to that inability at significant development cost. There are plenty of devs saying they spend the vast majority of their time tuning for Nvidia hardware so performance doesn't go backwards with async. Or they outright disabled it because of the effort.

MDolenc said:
Except that in neither DX12 or Vulkan case shader is responsible for submitting draw calls. Shaders can prepare data for the draw calls and that data doesn't have to be shipped back to the CPU side, but the actual draw is still dispatched by the CPU.

They provide the structure to accelerate it with the queues. There is no reason they can't dispatch draw calls as was already explained. It should be possible to craft a loop that runs continuously with little to no CPU intervention. Using the CPU only for external stimulus, HBCC and unified memory for resource management/allocation, and a single compute shader or CP thread as a main game loop keeping everything in sync. That system should be able to load balance itself with minimal tuning in the part of the shader or developer. Low level APIs would be the first step towards this because all that validation in a shader would be a nightmare. In effect it's stripping the branching portion that required a CPU for performance.

3dcgi said:
The application does the most important tuning for Async Compute. For the driver to do much the code will be application specific and most applications won't get this kind of attention.

The application can tune, but if you threw two independent tasks, possibly different applications, at it the hardware should load balance automatically. That's the packetized design of a stream processor anyways.

3dcgi said:
In theory if developers tune their code drivers won't have to.

Programmers can't tune for unknown hardware and the hardware could adapt more intuitively without constraints while saving development effort. It probably won't perform as well as a team of expert programmers could manage on strictly defined hardware with extensive effort, but not all developers are experts and there are still costs. In a larger, more complex system the hardware method may be faster. There are just too many corner cases for programmers to anticipate.

Infinisearch said:
Not just a cache flush a pipeline flush too. I don't really understand why they did it either seems like an performance sink but what do I know about the hardware/CPU interface.

It's a compatibility issue on some hardware with static partitioning or limited resources as a design choice. Doesn't have to be a bad choice. If all execution units share one instruction buffer that only holds one kernel a flush is required. More complex than that, but that's the idea.

DavidGraham · Jun 19, 2017

Anarchist4000 said:
There are plenty of devs saying they spend the vast majority of their time tuning for Nvidia hardware so performance doesn't go backwards with async.

Really? So devs wasted their time optimizing for a feature that will not help NV in anyway? I am curious to know which developer actually said that! The last thing I know is the developer of Ashes Of Singularity stating that Maxwell doesn't do well with Async in their game.

Or they outright disabled it because of the effort

Yes, that happened in multiple games already, but this is not about NV's async, this about DX12 in general, devs are not coming out and criticizing NV's Async, they are criticizing the whole API. This is not an NV vs AMD argument here. You said drivers are not involved in DX12 and matter very little, I presented a developer who stated otherwise, and several other members here posted why this is just plain wrong. And drivers still matter very much in DX12. So I can really appreciate you not confusing the issue with the old worn out argument of NV's supposed lackluster Async capability.

Anarchist4000 said:
What fallacy?

The fallacy that a lack hardware scheduler is responsible for bad async on NV. Or are you discussing something else entirely different?

sebbbi · Jun 19, 2017

silent_guy said:
How about we start with the presence of a shader compiler?

silent_guy said:
Yeah, you're right. I think the key thing is that DX12 is just not close to the metal at all...

DX12 SM 6.0 just introduced cross lane operations that operate on waves. These are still higher level compared to writing the swizzle instructions yourselves. Each cross lane operation needs a series of swizzles, depending on the GPU functionality available. There's a good example of SM 6.0 wave prefix sum implementation here (see "DPP Example"): http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/. Shader compiler would need to emit completely different code for GCN2 and GCN3. GCN2 code would still work, but wouldn't be optimal for GCN3.

AMDs scalar unit needs special care in the shader compiler. Instructions offloaded to scalar unit are practically free (co-issued). Shader compiler needs to run analysis on all registers to see which ones are wave invariant (guaranteed to be same value for all 64 neighbor lanes). These registers can be offloaded to scalar unit. This saves VGPR pressure, and allows offloading math based on these registers (and the instruction results) to scalar ALU. This can bring pretty nice gains of performance and reductions in power usage when done right. This is also very important because AMD doesn't have fixed function constant buffer hardware like Nvidia and Intel does. AMD relies heavily on their scalar units. But HLSL/GLSL code is written from one lane's perspective, not from the perspective of one wave. Compiler needs to extract the scalar part without any info from the developer. There's been several technical papers about this.

DX12 SM 6.0 has some helper functions (WaveIsFirstLane) to help generating good code for GPUs that have scalar units. But only time will tell how much developers actually adapt this. I would assume that the compiler still needs to analyze the code and data, but WaveIsFirstLane allows the developer to help in this process (as dynamic data is often impossible to prove at compiler time).

3dcgi said:
I didn't know the API enforced a cache flush. It seems weird to require a flush because the data might be needed by another engine. I'd expect an optional flush for that though a small number of flushes per frame are likely not noticeable in the final frame rate.

GCN requires lots of GPU cache flushes during the frame rendering, as ROP caches aren't coherent with L1/L2 cache. Every time you stop writing to a render target and start sampling it you need to flush the caches. Vega (GCN5) moves the ROP caches under L2 cache. This reduces the need of GPU cache flushes drastically. AFAIK all Nvidia DX11 GPUs had ROPs under L2. Vega also has a tiled rasterizer (Nvidia has had it since Maxwell1). So it seems that the memory hierarchies of AMD & NV are going to be pretty similar after Vega launch.

sebbbi · Jun 19, 2017

DavidGraham said:
The fallacy that a lack hardware scheduler is responsible for bad async on NV. Or are you discussing something else entirely different?

I have always thought that the main difference is simply that NV can't run compute and graphics on the same compute unit (SM). I believe the assumption has been that Nvidia reconfigures their compute unit memory pools (groupshared memory) based on graphics/compute mode. They need temp storage for their graphics pipe outputs (VS->PS, tessellation) and for their fixed function constant buffer hardware, etc. AMD has separate scalar register files on each CU, and scalar registers are useful for both graphics and compute. On graphics shaders scalar registers store resource descriptors and constants, among other things. This is just an assumption however. Nvidia hasn't been very open about their GPU architecture, so we don't know exactly why they have limitations regarding to concurrent graphics and compute and what kind of limits they are.

Kaotik · Jun 19, 2017

DavidGraham said:
Really? So devs wasted their time optimizing for a feature that will not help NV in anyway? I am curious to know which developer actually said that! The last thing I know is the developer of Ashes Of Singularity stating that Maxwell doesn't do well with Async in their game.

I'm curious too, but Pascal can get gains from async, so it does help NV too

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

sebbbi

sebbbi

Clukos

Bloodborne 2 when?

Infinisearch

Jawed

silent_guy

Jawed

silent_guy

3dcgi

DavidGraham

3dcgi

DavidGraham

Infinisearch

3dcgi

Infinisearch

Anarchist4000

DavidGraham

sebbbi

sebbbi

Kaotik

Drunk Member