AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
I recall reading that. Still thinking there was another limitation or that "slightly worse" was a bit more than slight. It just seemed like some devs were tripping over it more than would be expected if the fix was a simple creation option.
I would take "slightly worse" compression any day over expensive depth decompress and fast clear elimination steps. I like to do partial updates to my resources, and decompression steps are a real pain in the ass in some cases.

It would be fantastic to have more visible flags to control the resource compression, but unfortunately PC APIs are still a bit too high level for that. Would be hard to design a system that is universal to all AMD/NV/Intel GPUs.
 
Kyro 2 is back!.
Exciting if true.

This posterior post of him is interesting and certainly strange:

I don't tell anything about PS4. But If there will be a PS4 "Matrix or something", it won't use Polaris. It should be binary compatible with the existing console.
Polaris uses a fully new coding scheme (ASIC 130), and an earlier GCN binary is not executable on this microarch.
 
Last edited:
I don't tell anything about PS4. But If there will be a PS4 "Matrix or something", it won't use Polaris. It should be binary compatible with the existing console.
Polaris uses a fully new coding scheme (ASIC 130), and an earlier GCN binary is not executable on this microarch.
GCN 1.2 is not fully backwards compatible with GCN 1.1 either. Each GCN ISA manual has a list of deprecated instructions that do not exist anymore. Maybe they could translate the shaders on fly, but it's not straightforward, as the shaders are offline compiled (or potentially even hand written GCN 1.1 microcode). Runtime translation needs memory (use OS reserve?) and processing time (potentially some stalls?). Shouldn't be show stoppers.
 
GCN 1.2 is not fully backwards compatible with GCN 1.1 either. Each GCN ISA manual has a list of deprecated instructions that do not exist anymore. Maybe they could translate the shaders on fly, but it's not straightforward, as the shaders are offline compiled (or potentially even hand written GCN 1.1 microcode). Runtime translation needs memory (use OS reserve?) and processing time (potentially some stalls?). Shouldn't be show stoppers.

on game install would it be possible for the PS4 to do a decompile/recompile? Or would that be to dangerous?
 
I would assume that the "Primitive Discard Accelerator" is just a marketing term for some additional early out tests for backfacing & smaller than pixel & out of the screen triangles (and maybe an early out test for small triangles vs HTILE). Currently Nvidia beats AMD badly in triangle rate benchmarks, especially in cases where the triangles result in zero visible pixels. Nvidia certainly has more advanced triangle processing hardware, but the interesting question is whether they just win by brute force (Nvidia has distributed geometry processing to parallelize the work and have better load balancing), or whether they also have better (early out) culling for triangles that are not visible.

Nvidia has a restricted fast path geometry shader mode (no expansion) that can be used to cull individual triangles that don't hit sampling efficiently, but it has to be called and set up manually, so I'd assume there's no automatic discard in that regard.
 
Some info about this:
http://gpuopen.com/dcc-overview/
So it seems that it can directly read both depth and MSAA without decompression. However the readable format compresses slightly worse. A huge improvement over GCN 1.0/1.1.
Interesting indeed. Based on the wording in the link it looks to me though at least depth msaa compresses _significantly_ worse. And I'm wondering if I'm missing something in the open-source driver wrt depth buffer compression...
 
Nvidia has a restricted fast path geometry shader mode (no expansion) that can be used to cull individual triangles that don't hit sampling efficiently, but it has to be called and set up manually, so I'd assume there's no automatic discard in that regard.


I am unaware of this, do you have any documentation on this?
 
Good point, although that's circumstantial evidence in a way. We don't know if there are any specific differences between the two chips...


Well without any specifics its hard to say but it turns out the 16nm used less power and had higher performance, so there are more factors involved definitely
 
Well without any specifics its hard to say but it turns out the 16nm used less power and had higher performance, so there are more factors involved definitely
One process will inherently be faster or more efficient than the other, and maybe that's the 16nm process... but is your claim based on a sufficient amount of samples?
Even if it were the case for the Apple chips, it's not conclusive in terms of inherent process characteristics: it may be that libraries were more pessimistic for one process vs the other, which pushed the pessimistic one towards to more aggressive cells etc.

This is the kind of topic that requires a lot of salt.
 
Nvidia has a restricted fast path geometry shader mode (no expansion) that can be used to cull individual triangles that don't hit sampling efficiently, but it has to be called and set up manually, so I'd assume there's no automatic discard in that regard.
I am unaware of this, do you have any documentation on this?
https://www.opengl.org/registry/specs/NV/viewport_array2.txt

Output a zero viewport mask to discard the primitive. (Maxwell GM200 only)
 
Last edited:
GCN 1.2 is not fully backwards compatible with GCN 1.1 either. Each GCN ISA manual has a list of deprecated instructions that do not exist anymore. Maybe they could translate the shaders on fly, but it's not straightforward, as the shaders are offline compiled (or potentially even hand written GCN 1.1 microcode). Runtime translation needs memory (use OS reserve?) and processing time (potentially some stalls?). Shouldn't be show stoppers.
Maybe that´s why PS4K rumors said there would be an additional 512 MB of ram available for devs, that could be used for re-compile legacy PS4 code on the fly.
 
One process will inherently be faster or more efficient than the other, and maybe that's the 16nm process... but is your claim based on a sufficient amount of samples?
Even if it were the case for the Apple chips, it's not conclusive in terms of inherent process characteristics: it may be that libraries were more pessimistic for one process vs the other, which pushed the pessimistic one towards to more aggressive cells etc.

This is the kind of topic that requires a lot of salt.
I remember reading a lot about this in various publications and the problem is the way the testing and results are taken out of their special context.
Case in point Apple and quite a few other sites mention there is negligible difference between the two and their reference point is Ars Technica, however they tested at lower power-processing criteria-threshold, specifically they did:
Ars Technica said:
Update: To clarify exactly what Xcode's Activity Monitor is telling us, remember that every logical CPU core is tracked individually, so for a dual-core CPU like the A9 "full utilization" would be about 200 percent, 100 for each core. The Geekbench test is putting about 30 percent load on each core for a total of 60 percent. For comparison, the relatively light but modern iOS game Shooty Skies oscillates between 30 and 70 percent depending on how many objects are being drawn on screen.

Our WebGL battery life test similarly keeps the CPU (and the GPU) working continuously, but at a slightly lower level of load. CPU load for this test typically hovers between 45 and 50 percent, and the GPU Driver instrument says the GPU utilization is between 25 and 30 percent.
We also ran the GFXBench GL 3.1 battery life test for good measure, which loops the "T-Rex" test 30 times while measuring performance and power drain. These two tests approximate the load that a 3D game might put on the A9.

So this is probably compounded at what threshold the testing is done at; one takeaway is how the battery life performance swapped in favour of TSMC with Geekbench that used more processing power (in quote above mentions total 60 percent load).
Worth noting they were specifically looking at battery life, but also they monitored processor usage and controlled that as much as possible.
http://arstechnica.co.uk/apple/2015/10/samsung-vs-tsmc-comparing-the-battery-life-of-two-apple-a9s/
Samsung-vs-TSMC-iPhone-6S.001-640x470.png


Some other sites where they used higher processing tests suggest there is a subtle performance and battery gain for the TSMC (5-10%), but then this is compounded just like when reviewers test modern PC CPUs where some sites-reviewers as an example see notable gains with the Skylake 6700k over Haswell and others see no difference at all.
 
One process will inherently be faster or more efficient than the other, and maybe that's the 16nm process... but is your claim based on a sufficient amount of samples?
Even if it were the case for the Apple chips, it's not conclusive in terms of inherent process characteristics: it may be that libraries were more pessimistic for one process vs the other, which pushed the pessimistic one towards to more aggressive cells etc.

This is the kind of topic that requires a lot of salt.


I completely agree, too many variables for the tests that were done to be anything conclusive.
 
I would take "slightly worse" compression any day over expensive depth decompress and fast clear elimination steps. I like to do partial updates to my resources, and decompression steps are a real pain in the ass in some cases.

It would be fantastic to have more visible flags to control the resource compression, but unfortunately PC APIs are still a bit too high level for that. Would be hard to design a system that is universal to all AMD/NV/Intel GPUs.
Agreed, but I thought the compression was largely controlled by creating the resource as shared in DX11? DX12/Vulkan I'd have to check.

on game install would it be possible for the PS4 to do a decompile/recompile? Or would that be to dangerous?
Decompile would be ugly. Devs would likely want to repackage for NEO mode anyways. For those that don't, I'd guess there is some sort of fallback compatibility layer.

Interesting indeed. Based on the wording in the link it looks to me though at least depth msaa compresses _significantly_ worse. And I'm wondering if I'm missing something in the open-source driver wrt depth buffer compression...
HyperZ is the only compression I recall hearing about, besides texture, in the open source drivers. It would have to be something added for tonga/figi/polaris. I can't imagine they missed something like that.

Found it, not in there yet. drm/amd/dal: Add framebuffer compression HW programming
https://lists.freedesktop.org/archives/dri-devel/2016-February/100524.html
 
Last edited:
Status
Not open for further replies.
Back
Top