AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

You are the one making the assertions, i.e. „draw“, even though you put question marks behind.
Dude, chill.. The question marks are there because I was making questions, not assertions.

I simply assumed the "throughput" meant draw, but Digidi made the proper correction.
 
Dude, chill.. The question marks are there because I was making questions, not assertions.
I simply assumed the "throughput" meant draw, but Digidi made the proper correction.
Really? Cool, dude! 3dgi corrected you almost instantly on it, which you must have overlooked, so I was having some question marks in my mind. If you're having an agenda for example, which you don't, don't you?

edit. Of course, that's a completely ridiculous question, ain't it?
 
I did a little bit calculation. When you have 17 primitives per Clock and a Clock Speed of 1400 MHz you can take care about 23.800.000.000,0 Polygons. In this DeusX Scene you have 220.000.000 Polygons. If you now divide them Both you geht 108 fps per Second which DeusX can handle if you don't do any hard shader work. Maybe sombody can test ist?
 
Does anyone know if vega supports the same type of order-independent transparency rendering as Intel premiered in its IGPs a few years back?

Thanks, peeps! :)
 
AFAIK there's nothing that keeps those algorithms from functioning on other DX12 hardware with similar feature support, so yes.
 
Does anyone know if vega supports the same type of order-independent transparency rendering as Intel premiered in its IGPs a few years back?

Thanks, peeps! :)
We waited almost a year for your return and this is all you got? :p
 
I did a little bit calculation. When you have 17 primitives per Clock and a Clock Speed of 1400 MHz you can take care about 23.800.000.000,0 Polygons. In this DeusX Scene you have 220.000.000 Polygons. If you now divide them Both you geht 108 fps per Second which DeusX can handle if you don't do any hard shader work. Maybe sombody can test ist?
You can't calculate triangle/primitive throughput like this anymore. Modern games don't spend their whole frame time rasterizing geometry. Lighting, post processing, etc take significant chunk of GPU time (up to 50%), and during that time the geometry pipelines are idling. Only a small part of the frame is geometry bound. Shadow map rendering is the most geometry bound step. G-buffer rendering tends to also be partially geometry bound (no matter how fat pixels), since there tends to be lots of triangles submitted that result in zero pixel shader invocations (backfacing or earlyZ/hiZ rejected). For example drawing a high poly character behind a nearby corner would cause 100k vertex shader invocations, but zero pixel shader invocations. This draw call would cause a bubble in GPU utilization, since the geometry pipeline can't process these triangles fast enough to go through these triangles before the existing pixel shader work (from previous draw calls) finish executing (on the CUs). I have found out that on GCN2, vertex shader work can only utilize roughly two CUs in common case (geom pipes simply can't feed more vertex waves). Remaining (=most) CUs will idle if there's a big chunk of sequential triangles which generate no pixel shader invocations. This is one of the reasons why you want to cull the triangles early, to avoid underutilizing the GPU.

Slide 12 of this presentation is a good example:
https://frostbite-wp-prd.s3.amazonaws.com/wp-content/uploads/2016/03/29204330/GDC_2016_Compute.pdf

Here you can see that the GPU occupancy is very low in a part of the G-buffer rendering step where most of the geometry is occluded. The green part of the occupancy graph is the vertex shader work. As you can see, VS occupancy never goes above a certain (small) portion of the whole GPU. GCN simply needs lots of pixel shader work to saturate the GPU when it is rendering geometry that results in small amount of visible pixels. This is also the main reason why async compute helps GCN so much. CUs can simply execute background compute shader work when there's not enough pixel waves spawned.

Async compute also has another advantage. It allows the developer to keep geometry units active for larger portions of the frame, because you can freely overlap compute with graphics. You don't need to dedicate a big chunk of GPU time to non-rasterization work (post processing and lighting). You can overlap this work with geometry heavy work to reduce both the time when geometry units are idling and the time when CUs are idling. This is one advantage that AMD has over the competition. Games could spend the whole frame submitting both geometry work and compute work. This way geometry units can be utilized during the whole frame (instead of <50% of the frame). The downside is unfortunately that AMD has been behind Nvidia in geometry performance, so you need to use techniques like this to reach parity, instead of gaining big advantages. Polaris improved things a bit, and Vega should improve things further, but so far the results haven't been as good as I had hoped. I would guess that the primitive shaders and DSBR still need some additional driver work to show their full potential. I just hope (for AMDs sake) that they don't need to write custom profiles for each game to utilize these... AMD simply doesn't have as much resources as Nvidia to optimize individual titles separately.
Does anyone know if vega supports the same type of order-independent transparency rendering as Intel premiered in its IGPs a few years back?
Rasterizer Ordered Views (ROV) is the DirectX FL 12_1 "standardized" version of Intel's PixelSync. It is supported by all 3 IHVs. AMD = Vega. Intel = Haswell, Broadwell, Skylake/Kaby. Nvidia = Maxwell2, Pascal. There's no synthetic FL 12_1 feature benchmarks around yet, so it's hard to say which IHV has the most efficient implementation.
 
Rasterizer Ordered Views (ROV) is the DirectX FL 12_1 "standardized" version of Intel's PixelSync. It is supported by all 3 IHVs.
Sweet. :) Thanks a lot. Really appreciate it!

New question lol:
On this page here, there's a slide with a bunch of DX12 features, raster ordered views being one of them, resource binding another, conservative rasterization and so on - is there any explainers for noobies on what these features actually do? I read comments on vega articles, people go like, "standard swizzle fuck yeah!", and I'm like, "????"... :p
 
Question:
Is it known whether AMD and/or Nvidia are using power gating in their GPUs yet, or if it still is just clock which is gated?

If the latter, why is that? Power gating too complex for such large chips? Not that high-core Intel Xeons are particularly small or anything, but they're repeated individual cores, wouldn't that be easier to gate than a structurally diverse IC like a GPU? *speculation alert lol*

Thanks.
 
Sweet. :) Thanks a lot. Really appreciate it!

New question lol:
On this page here, there's a slide with a bunch of DX12 features, raster ordered views being one of them, resource binding another, conservative rasterization and so on - is there any explainers for noobies on what these features actually do? I read comments on vega articles, people go like, "standard swizzle fuck yeah!", and I'm like, "????"... :p
Yes, Vega is actually the most feature-complete DirectX 12 GPU. This is the first GPU bringing working support to standard swizzle 64K AFIK (IIRC Skylake and up could support it, but it still not enable on drivers)... Would be interesting to see if AMD also improved the LDA (Crossfire) resource sharing (AFIK all previous AMG GPUS are
D3D12_CROSS_NODE_SHARING_TIER_1, while at least Pascal and Maxwell NVIDIA GPUs should be a
D3D12_CROSS_NODE_SHARING_TIER_2).
 
. I would guess that the primitive shaders and DSBR still need some additional driver work to show their full potential. I just hope (for AMDs sake) that they don't need to write custom profiles for each game to utilize these... AMD simply doesn't have as much resources as Nvidia to optimize individual titles separately.
.

Thank you sebbi for the detailed information. In the white paper of AMD is written that primitive shader is coming with a later driver. (17.360)

Also in the white paper is written, that AMD has achieved 17 triangles per clock. Does anybody know how much GP102 can handle? My last information from the German 3dcenter forum was, that each Polymorph Engine from a GTX680 can handle 0.5 polygons per clock. That means for gp102 can do 15 polygons per clock.

https://www.forum-3dcenter.org/vbulletin/showthread.php?t=574736&page=59
 
I was on a one week vacation during the Vega launch. Now back at office. I can now freely talk about Vega, since I am now using public Vega RX drivers.

Some tidbits:
- I can confirm that Vega's ROP caches under L2 seem to work properly. I see a reduction in L2 cache flushes in Claybook (UE 4.16.1) with RGP (http://gpuopen.com/gaming-product/radeon-gpu-profiler-rgp/) on Vega FE versus RX 480. Flushes (L1+L2+K$+CB+DB) = 881 (RX480) -> 674 (Vega). 142 L2, 60 CB and 66 DB flushes on Vega (will paste RX 480 numbers when I get home). I actually found some UE 4.16.1 DX12 back end perf bugs and sync bugs with this tool.
- Async compute works fine on Vega and RX 480. In Claybook we don't currently have enough async compute work (mostly physics and fluid sim) to cover all raster passes (shadows, g-buffer, velocity buffer, custom Z) on Vega. Currently our async compute workload is designed around current gen base consoles, which only have 12-18 CUs. You could easily add 10x+ more fluid on Vega without seeing noticeable slow down (would simply increase GPU utilization). RX 480 behaves more like consoles, but even it would need more async compute work to be fully occupied.
- Currently RX 480 runs Claybook at 2560x1440 @ locked 60 fps (~15 ms). Vega runs 4K at ~50 fps (~20 ms). Haven't yet done PC specific optimizations, but console opts work pretty well on PC AMD.

If you have some questions, feel free to ask me. I don't however right now have time to write stuff like FL 12_1 synthetic benchmarks (I would love to do that eventually).
 
I was on a one week vacation during the Vega launch. Now back at office. I can now freely talk about Vega, since I am now using public Vega RX drivers.

Some tidbits:
- I can confirm that Vega's ROP caches under L2 seem to work properly. I see a reduction in L2 cache flushes in Claybook (UE 4.16.1) with RGP (http://gpuopen.com/gaming-product/radeon-gpu-profiler-rgp/) on Vega FE versus RX 480. Flushes (L1+L2+K$+CB+DB) = 881 (RX480) -> 674 (Vega). 142 L2, 60 CB and 66 DB flushes on Vega (will paste RX 480 numbers when I get home). I actually found some UE 4.16.1 DX12 back end perf bugs and sync bugs with this tool.
- Async compute works fine on Vega and RX 480. In Claybook we don't currently have enough async compute work (mostly physics and fluid sim) to cover all raster passes (shadows, g-buffer, velocity buffer, custom Z) on Vega. Currently our async compute workload is designed around current gen base consoles, which only have 12-18 CUs. You could easily add 10x+ more fluid on Vega without seeing noticeable slow down (would simply increase GPU utilization). RX 480 behaves more like consoles, but even it would need more async compute work to be fully occupied.
- Currently RX 480 runs Claybook at 2560x1440 @ locked 60 fps (~15 ms). Vega runs 4K at ~50 fps (~20 ms). Haven't yet done PC specific optimizations, but console opts work pretty well on PC AMD.

If you have some questions, feel free to ask me. I don't however right now have time to write stuff like FL 12_1 synthetic benchmarks (I would love to do that eventually).

How does the raytracer in clay book work in a layman/high level and how does it interact with ue4. If the physics is running on the gpu, how much work is left for the CPU and is it possible to do that work on the gpu -even if it won't be as effective?
Can ai code run on the gpu?
Could your game take advantage of fp16/rapid packed math?
How close are we to fully ray tracer games?
What API is claybook?
What's the most interesting thing about vega?
Does anything is vega give you opportunities to get massive perf gains in old code?
When will you talk about your engine tech?

The questions are a bit all over the place.
 
Yes, Vega is actually the most feature-complete DirectX 12 GPU. This is the first GPU bringing working support to standard swizzle 64K AFIK (IIRC Skylake and up could support it, but it still not enable on drivers)... Would be interesting to see if AMD also improved the LDA (Crossfire) resource sharing (AFIK all previous AMG GPUS are
D3D12_CROSS_NODE_SHARING_TIER_1, while at least Pascal and Maxwell NVIDIA GPUs should be a
D3D12_CROSS_NODE_SHARING_TIER_2).
This answer is still the equivalent of "swizzle, fuck yeah!" without telling us why this is useful. :)
 
Also in the white paper is written, that AMD has achieved 17 triangles per clock. Does anybody know how much GP102 can handle? My last information from the German 3dcenter forum was, that each Polymorph Engine from a GTX680 can handle 0.5 polygons per clock. That means for gp102 can do 15 polygons per clock.
Why are you so obsessed by this? Each GPC can rasterize 1 triangle per clock and each SM has a PolyMorph engine which can process a triangle every two clocks. But as I pointed out earlier and sebbbi pointed out as well you really, really need to take quite a lot of other stuff into consideration as well. Reaching this peaks in benchmarks will be a whole different matter (have you seen a Vega benchmark that would suggest 17 triangles per clock? Or even 8?).

This answer is still the equivalent of "swizzle, fuck yeah!" without telling us why this is useful. :)
It's useful if you want to copy resource from one GPU to another GPU that uses different swizzle internally... You don't have to reswizzle (tm). :devilish:
 
This answer is still the equivalent of "swizzle, fuck yeah!" without telling us why this is useful. :)
I can answer that. GPUs tile 2d and 3d textures in a way that improves data cache locality. It follows usually some sort of morton order (but with limited tile size instead of fully global): https://en.wikipedia.org/wiki/Z-order_curve. The default swizzle standardizes the order, making it possible to cook assets to this layout to disk, making it faster to stream data from CPU to GPU. Also it makes it possible for multiple GPUs of different brand to more efficiently access 2d and 3d textures generated by each other.
@sebbbi Curious about the effect of MSAA on L2 cache. MSAA performance seems relatively poor on Vega. Although I guess your renderer may not be compatible with msaa?
UE4 deferred renderer doesn't support MSAA. It is fully designed around temporal antialiasing. Temporal antialiasing is also used to enable stochastic optimizations (remove noise) from screen space reflections, ambient occlusion and transparencies, among others. This makes these effects cheaper and better quality. I don't know how Vega behaves with MSAA, and frankly I don't care about MSAA anymore, as good temporal antialiasing is better, and as a bonus allows high quality temporal upsampling as well (saving 50%+ of frame cost with minimal IQ degradation).
How does the raytracer in clay book work in a layman/high level and how does it interact with ue4.
We ray trace signed distance field volumes (SDF). Our volumes are stored as multiresolution (hierarchical) volume texture. It is using a hybrid sphere tracing / cone tracing algorithm that does empty space skipping by wide cones and then splits cones to single pixel rays on impact. Ray tracing is running on async compute during g-buffer rendering, and there's a full screen pixel shader pass that combines the result to the g-buffer at end of g-buffer pass. There's also a shadow ray trace pass (proper penumbra widening soft shadows) that currently runs in a pixel shader, writing to UEs full screen shadow mask buffer.
If the physics is running on the gpu, how much work is left for the CPU and is it possible to do that work on the gpu -even if it won't be as effective?
Physics is 100% running on GPU. Every shape has 16k particles and we have lots of real time (3d) fluid interacting seamlessly with the deforming shapes, so running physics on CPU would be not possible.
Can ai code run on the gpu?
We have done some mass AI tests that run on GPU. We also generated mass path finding data (velocity field) for them on GPU.
Could your game take advantage of fp16/rapid packed math?
Yes. But Unreal Engine doesn't yet support fp16 on desktops. Only on mobiles. Their DX11 backend doesn't yet support DX 11.2 API (which enables fp16 support).
How close are we to fully ray tracer games?
Ray tracing is great for some forms of geometry, but less great on others. Branch coherency on GPU makes heterogeneous raytracing (= multiple levels of different acceleration structures and/or early out tests) inefficient. Our multires SDF sidesteps this problem (it has simple inner loop with no branches). Ray tracing is also awesome for shadows and AO. I expect games to start ray tracing shadows and AO before they start ray tracing the visible rays.
What API is claybook?
On PC we will support DX11 and DX12. Possibly Vulkan if UE desktop Vulkan backend gets ready in time, and if we have time to port all our customizations to Vulkan.
What's the most interesting thing about vega?
The new virtual memory system it by far the most impressive achievement. AMD calls it HBMCC, and it basically maps the whole system memory (DDR4) to GPU use. When the GPU touches a memory region, that piece of data will be instantly transferred from DDR4 to HBM2 at page granularity. Games only touch a small percentage of GPU memory every frame, and the accessed data set changes slowly (because animation needs to be smooth to look like continuous movement). With this tech, 8 GB of fast HBM2 should behave similarly as 16+ GB of traditional GPU memory (or even more, depending how big portion of the data is high res content which is only needed up close to a particular surface). I have plans to test this technology by ray tracing huge (32+ GB) volume textures when I have time. I was disappointed that not a single reviewer tested Vega with huge datasets versus traditional 8 GB GPUs and 12 GB Titan X.
Does anything is vega give you opportunities to get massive perf gains in old code?
No massive gains. Iterative improvements mostly. The tiled rasterizer shows huge gains in some engineering apps with massive overdraw, but no old game behaves like this. Hopefully no future game behaves like this either (better to occlusion cull early by software). HBMCC should be huge win for very large datasets, but current games don't have any problems with traditional 6 GB GPUs, so a 8 GB GPU with HBMCC doesn't show any benefits at the moment. Xbox One X with 12 GB memory should accelerate GPU memory consumption (24 GB devkit is just sweet). Maybe next year we see some gains over traditional 8 GB GPUs.

HMBCC also should reduce frame judder in DX11 games since the GPU doesn't need to transfer whole resources on use. It can simply page on demand, causing much smaller data movement per frame -> less stalls. I would be interested to know whether this is already is visible in current games. Do we see less fps spikes and better minimal fps compared to other AMD GPUs?
When will you talk about your engine tech?
I will write something after we ship the game.
 
Last edited:
...
UE4 deferred renderer doesn't support MSAA. It is fully designed around temporal antialiasing. Temporal antialiasing is also used to enable stochastic optimizations (remove noise) from screen space reflections, ambient occlusion and transparencies, among others. This makes these effects cheaper and better quality. I don't know how Vega behaves with MSAA, and frankly I don't care about MSAA anymore, as good temporal antialiasing is better, and as a bonus allows high quality temporal upsampling as well (saving 50%+ of frame cost with minimal IQ degradation).
...

Was more curious just from a architecture point of view why it might not perform as well with msaa as expected. Most of the reviews were done with fairly high levels of msaa. Thanks for the feedback.
 
Back
Top